For several years now scientists—in at least some disciplines—have been concerned about low rates of replicability. As scientists in those fields, we worry about the development of cumulative knowledge, and about wasted research effort. An additional challenge is to consider end-users (decision and policy makers) and other consumers of our work: what level of trust should they place in the published literature? How might they judge the reliability of the evidence base?
With the latter questions in mind, our research group recently launched ‘The repliCATS project’ (Collaborative Assessments for Trustworthy Science). In the first phase, we’re focussing on eliciting predictions about the likely outcomes of direct replications of 3,000 (empirical, quantitative) research claims in the social and behavioural sciences. A subset of these 3,000 research claims will be replicated by an independent team of researchers, to serve as an evaluation of elicited forecasts.
The repliCATS project forms part of a broader program of replication research which will cover 8 disciplines: business, criminology, economics, education, political science, psychology, public administration, and sociology. The broader program—Systematizing Confidence in Open Research and Evidence, or SCORE—is funded by the US department of defense (DARPA).
The repliCATS project uses a structured group discussion—rather than a prediction market or a survey—called the IDEA protocol to elicit predictions about replicability.
Working in groups of 5-10, diverse groups of participants first Investigate a research claim, answering three questions: (i) about how comprehensible the claim is; (ii) whether the underlying effect described in the research claims seems real or robust; and (iii) then making a private estimate of likelihood of a successful direct replication.
They then join their group, either in a face-to-face meeting or in remote, online groups, to Discuss. Discussions start with the sharing of private estimates as well as the information and reasoning that went into forming those estimates. The purpose of the discussion phase is for the group to share and cross-examine each other’s’ judgements; it is not to form consensus.
After discussion has run its course, researchers are then invited to update their original Estimates, if they wish, providing what we refer to as a ‘round 2 forecast’. These round 2 forecasts are made privately, and not shared with other group members.
Finally, we will mathematically Aggregate these forecasts. For this project, we are trialling a variety of aggregation methods, ranging from unweighted linear averages to aggregating log odds transformed estimates (see figure below).
Some previous replication projects have run prediction markets and/or surveys alongside. Over time, these have become more accurate, particularly in the case of the social science replication project (of Science and Nature papers). Our project departs from these previous efforts, not only by using a very different method of elicitation, but also in the qualitative information we gather about reasoning, information sharing, and the process of updating beliefs (following discussion).
575 claims assessed in our first local IDEA workshop
Earlier this month, we ran our first set of large face-to-face IDEA groups, prior to the Society for Improving Psychological Science (SIPS) conference. 156 researchers joined one of 30 groups, each with a dedicated group facilitator. Over two days, those groups evaluated 575 published research claims (20-25 per group) in business, economics and psychology, making a huge contribution to our understanding of:
– those published claims themselves,
– how participants reason about replication, what information cues and heuristics they use to make such predictions, including what counter points make them change their minds, and
– the research community’s overall beliefs about the state of our published evidence base.
We’ve also started to learn about how researchers evaluate claims within their direct field of expertise versus slightly outside that scope. We don’t yet know, or necessarily expect, that there will be differences in accuracy, but there do seem to be differences in approach and subjective levels of confidence.
What happens to those predictions? How accurate were they?
The short answer is that we wait. As discussed above, the repliCATS projects is part of a larger program. What happens next is that a subset of those 3,000 claims will be fully replicated by an independent research team, serving as evaluation criteria for the accuracy of our elicited predictions.
In about a year’s time, we’ll know how accurate those predictions are. (We’re hoping for at least 80% accuracy.) Our 3,000 predictions will also be used to benchmark machine learning algorithms, developed by other (again, independently funded by DARPA) research teams.
Following our first workshop, our repliCATS team now has a few thousand probabilistic predictions, and associated qualitative reasoning data to get stuck into. It’s an overwhelming amount of information, and barely one fifth of what we’ll have this time next year!
Feedback on SIPS workshop
As you’ve probably gathered, the success of our project relies heavily on attracting large numbers of participants interested in assessing research claims. So it was hugely heartening for us that the researchers who joined our SIPS workshop gave us very positive overall feedback about the experience.
They particularly enjoyed the core task of thinking about what factors or features of a study contribute to likely replicability (or not).
Early career researchers in particular also appreciated the chance to see and discuss others’ ratings and reasoning, and told us that the workshop has helped build their confidence about writing peer reviews in the future. (In fact, several of us came to the opinion that something like our IDEA protocol would make a good substitute for current peer review process in some places!)
To hear what others thought, check out Twitter @repliCATS (click on the image below)
In the meantime, we’re ready to deploy our “repliCATS bus” (we’ll come to you, or help you, run smaller scale workshops at your institution), offering you the opportunity to join ‘remote IDEA groups’ online.
Fiona Fidler is an Associate Professor at the University of Melbourne with joint appointments in the School of BioSciences and the School of Historical and Philosophical Studies (SHAPS). Fallon Mody is a postdoctoral research fellow in the Department of History and Philosophy of Science at the University of Melbourne.
 Note that here we are specifically concerned with trust in the ‘the published literature’ and not trust in science more broadly, or in scientists themselves. The published literature is as much created by the publishing industry as it is by scientists and other scholars.
 In this project, a “research claim” has a very specific meaning: it is used to describe a single major finding from a published study – for example, a journal article – as well as details of the methods and results that support this finding.
 In subsequent phases, we’ll be thinking about conceptual replications, generalisability, and other factors that build confidence for end users of research.
[Excerpts taken from the preprint “’Because it is the Right Thing to Do’: Taking Stock of the Peer Reviewers’ Openness Initiative” by Maike Dahrendorf et al., posted at PsyArXiv Preprints]
“Although the practice of publicly sharing data and code appears relatively straightforward, it is still not the norm…In order to change the status quo and accelerate the adoption of data sharing practices, Morey et al. (2016) introduced the Peer Reviewers’ Openness Initiative (PRO); researchers who sign PRO agree to provide a full review only for manuscripts that publicly share data and code, or else provide a clear reason why sharing is not possible.”
“Almost two years after PRO was launched we sought to take stock of the initiative by surveying signatories on their subjective experiences and opinions.”
“At the time of data collection, 449 researchers had signed the PRO initiative. We successfully retrieved the email addresses of 340 signatories…Compared to surveys on similar topics, the response rate of 37.65%…was relatively high…the final sample size was N = 127.”
“…about 40% of the respondents…indicated that data had been made available, about 30% reported to have received praise from colleagues…and about 25% indicated to have been able to provide a higher-quality review…In contrast, only a small fraction of respondents reported negative experiences…Commensurate with these experiences, 117 respondents indicated that they would sign the initiative again, whereas only 8 indicated that they would not…”
“…it is important to note that this survey only concerns the experiences of a relatively small and highly selective sample. Therefore, one cannot draw general conclusions about the effectiveness and reception of the PRO initiative. Such conclusions necessitate the involvement of researchers on the receiving end of PRO, namely editors and authors.”
[Excerpts taken from “The registered reports revolution: Lessons in cultural reform” by Chris Chambers, published in Significance, a publication of the Royal Statistical Society”]
“On 12 November 2012, as my train sped towards London, I received one of the most important emails of my life. The message cordially informed me that the publisher of Cortex, a scientific journal that I had recently joined as an editor, had approved our request to launch a new type of article called a registered report.”
“Having just one journal offer registered reports was never going to be enough, and so, in the months that followed, we lobbied hard for others to follow suit. In June 2013, we brought together over 80 journal editors and wrote a joint article in The Guardian. ‘[A]s a group of scientists with positions on more than 100 journal editorial boards,’ we wrote, ‘we are calling for all empirical journals in the life sciences – including those journals that we serve – to offer pre-registered articles at the earliest opportunity.'”
“Regardless of the storm it created (or perhaps because of it), the letter appeared to work as intended. Beyond the heat of the debate – and the numerous misconceptions voiced by opponents – other scientists were quietly deciding that registered reports made sense. The idea had turned a corner and the number of adopting journals steadily grew. By the end of 2013 we had three adopting journals. A year later we had seven. Through 2015, 2016, 2017 and 2018, the number of adopters rose to 21, 41, 88 and 154.”
“Today, registered reports are offered by 202 journals and rising, and across a wide range of sciences. Nearly 200 completed articles have been published, with hundreds more in the pipeline.”
“Reforming science is a hideously difficult task…If someone then comes along and says, “Whoa, there! You should archive your data in a public repository. You should preregister your protocols to control bias. You should replicate that study before submitting it to Nature”, the response of the player isn’t even “No”. The response is silence.”
“The working scientist sweeps past in a blur of desperation, racing towards the next publication, or tenure, or the next grant or fellowship with a tiny success rate. Every so often, the weary explorer looks up and catches a fragment of the reform argument. ‘You’re telling me I should do X, Y, Z. But why would I? Unless someone is going to make everyone else do it too, I’m just going to become less competitive. I’m a scientist, not a sacrificial lamb.’”
“‘Should’ arguments, like those above, fail because they offer only judgement, not solutions. If should arguments were sufficient to change behaviour then behaviour would have changed decades ago.”
“Breaking this impasse requires aligning incentives so that any reform works for the community and the individual. Registered reports achieve this by turning the pursuit of high quality science into a virtuous transaction: submit your protocol to our journal, receive a positive assessment (most likely after performing some revision based on expert reviews), and we will guarantee to publish your paper regardless of whether or not your results support your hypothesis. Work hard at designing rigorous, careful, important science and we will de-stress your life by making the results a dead currency in quality evaluation.”
“The key lesson of registered reports here is clear: do not tell scientists what they should do. Instead, tell them why what you are proposing is better than the status quo and why the new practice is both in their career interests as well as in the interests of the scientific community. Give them no reason to say, ‘This will harm my career’ or ‘My peers will disapprove’. Give them every reason to say, ‘This works for me and helps me leave a lasting positive legacy on my field’.”
[Excerpts taken from the blog “Improving research transparency through easier, faster access to studies in the AEA RCT Registry” by Keesler Welch and James Turitto, posted at the BITSS blogsite].
“J-PAL and the American Economic Association (AEA) have been working together over the past year to create digital object identifiers, or DOIs, for each trial registered in the AEA Registry for Randomized Controlled Trials….as of August 13, 2019, the Registry has officially launched this new feature!”
“DOIs provide persistent links to web content to ensure that the content is discoverable at all times—even if its URL or location within a site changes.”
“Using DOIs for trial registration in the social sciences is still something new, and we see it as an important move forward for research transparency.”
“This will make it easier for researchers to link their study designs and their (optional) pre-analysis plans directly to their publications and published research data. This also ensures that research study designs that have been registered will remain findable in perpetuity.”
[Excerpts taken from the preprint “Preregistration Is Hard, And Worthwhile” by Brian Nosek and others, posted at PsyArXiv Preprints]
“Preregistration of studies serves at least three aims for improving the credibility and reproducibility of research findings.”
“First, preregistration of analysis plans makes clear which analyses were planned a priori and which were conducted post hoc.”
“Second, preregistration enables detection of questionable research practices such as selective outcome reporting…or Hypothesizing After the Results are Known…”
“Third, preregistration of studies can reduce the impact of publication bias—particularly the prioritization of publishing positive over negative results—by making all planned studies discoverable whether or not they are ultimately published.”
“However, preregistration is a skill that requires experience to hone…Preregistration requires research planning and it is hard, especially contingency planning. It takes practice to make design and analysis decisions in the abstract, and it takes experience to learn what contingencies are most important to anticipate.”
“This might lead researchers to shy away from preregistration for worries about imperfection. Embrace incrementalism…Having some plans is better than having no plans, and sharing those plans in advance is better than not sharing them. With experience, planning will improve and the benefits will increase for oneself and for consumers of the research.”
“There are opportunities to accelerate that skill building. Study registries such as OSF and SREE, and decision tools such as Declare Design provide structured workflows to help researchers anticipate common decisions and provide guidance for articulating those decisions.”
“Other strategies for developing these skills include: (1) refining analysis plans by simulating data to practice making the decisions; (2) splitting the “real” data into exploratory and confirmatory subsamples and preregistering the analysis after viewing the exploratory subset; (3) drafting the study methods and results section in advance to help anticipate what should be done and how you will report it; and (4) submitting the plan as a Registered Report for peer review to get expert feedback on the plan.”
“Researchers can embrace the common back-and-forth between planned and unplanned (confirmatory and exploratory) research activities…The key role of preregistration is to clarify which is which.”
“Preregistration is a plan, not a prison…When deviations from the plan will improve the quality of the research, deviate from the plan.”
“Reporting deviations from the plan can be challenging…If possible, report what occurs following the original plan alongside what occurs with the deviations, and share the materials, data, and code so that others can evaluate the reported outcomes and what would have occurred with alternative approaches.”
[Excerpts taken from the blog “Corrigendum: a word you may hope never to encounter” by Dorothy Bishop, published at BishopBlog]
“I have this week submitted a ‘corrigendum’ to a journal for an article published in the American Journal of Medical Genetics B (Bishop et al, 2006). It’s just a fancy word for ‘correction’, and journals use it contrastively with ‘erratum’. Basically, if the journal messes up and prints something wrong, it’s an erratum. If the author is responsible for the mistake, it’s a corrigendum.”
“I discovered the error when someone asked for the data for a meta-analysis….although I could recreate most of what was published, I had the chilling realisation that there was a problem with Table II.”
“I had data on siblings of children with autism, and in some cases there were two or three siblings in the family. … I decided to take a mean value for each family. So if there was one child, I used their score, but if there were 2 or 3, then I averaged them.”
“And here, dear Reader, is where I made a fatal mistake. I thought the simplest way to do this would be by creating a new column in my Excel spreadsheet which had the mean for each family, computing this by manually entering a formula based on the row numbers for the siblings in that family.”
“…I noticed when I opened the file that I had pasted a comment in red on the top row that said ‘DO NOT SORT THIS FILE!’. … Despite my warning message to myself, somewhere along the line, it seems that a change was made to the numbering, and this meant that a few children had been assigned to the wrong family. And that’s why table II had gremlins in it and needed correcting.”
“I thought it worth blogging about this to show how much easier my life would have been if I had been using the practices of data management and analysis that I now am starting to adopt. I also felt it does no harm to write about making mistakes, which is usually a taboo subject. I’ve argued previously that we should be open about errors, to encourage others to report them, and to demonstrate how everyone makes mistakes, even when trying hard to be accurate…”
[Excerpts are taken from the article “The Flawed Reasoning Behind the Replication Crisis” by Aubrey Clayton, published at nautil.us]
“Suppose an otherwise healthy woman in her forties notices a suspicious lump in her breast and goes in for a mammogram. The report comes back that the lump is malignant.”
“She wants to know the chance of the diagnosis being wrong. Her doctor answers that, as diagnostic tools go, these scans are very accurate. Such a scan would find nearly 100 percent of true cancers and would only misidentify a benign lump as cancer about 5 percent of the time. Therefore, the probability of this being a false positive is very low, about 1 in 20.”
“…the missing ingredient …is the prior probability for the various hypotheses.”
“For the breast cancer example, the doctor would need to consider the overall incidence rate of cancer among similar women with similar symptoms, not including the result of the mammogram. Maybe a physician would say from experience that about 99 percent of the time a similar patient finds a lump it turns out to be benign. So the low prior chance of a malignant tumor would balance the low chance of getting a false positive scan result. Here we would weigh the numbers: (0.05) * (0.99) vs. (1) * (0.01).”
“We’d find there was about an 83 percent chance the patient doesn’t have cancer.”
“The problem, though, is the dominant mode of statistical analysis these days isn’t Bayesian. Since the 1920s, the standard approach to judging scientific theories has been significance testing, made popular by the statistician Ronald Fisher.”
“To understand what’s wrong, consider the following completely true, Fisherian summary of the facts in the breast cancer example (no false negatives, 5 percent false positive rate):”
“Suppose we scan 1 million similar women, and we tell everyone who tests positive that they have cancer. Then, among those who actually have cancer, we will be correct every single time. And among those who don’t have it, we will be only be incorrect 5 percent of the time. So, overall our procedure will be incorrect less than 5 percent of the time.”
“Sounds persuasive, right? But here’s another summary of the facts, including the base rate of 1 percent:”
“Suppose we scan 1 million similar women, and we tell everyone who tests positive that they have cancer. Then we will have correctly told all 10,000 women with cancer that they have it. Of the remaining 990,000 women whose lumps were benign, we will incorrectly tell 49,500 women that they have cancer. Therefore, of the women we identify as having cancer, about 83 percent will have been incorrectly diagnosed.”
“Suppose the women who received positive test results and a presumptive diagnosis of cancer in our example were tested again by having biopsies. We would see the majority of the initial results fail to repeat, a “crisis of replication” in cancer diagnoses. That’s exactly what’s happening in science today.”
“We Bayesians have seen this coming for years. … Now, a consensus is finally beginning to emerge: Something is wrong with science that’s causing established results to fail. One proposed and long overdue remedy has been an overhaul of the use of statistics.”
“In 2015, the journal Basic and Applied Social Psychology took the drastic measure of banning the use of significance testing in all its submissions, and this March, an editorial in Nature co-signed by more than 800 authors argued for abolishing the use of statistical significance altogether.”
“Similar proposals have been tried in the past, but every time the resistance has been beaten back and significance testing has remained the standard. Maybe this time the fear of having a career’s worth of results exposed as irreproducible will provide scientists with the extra motivation they need.”
[Excerpts taken from the article “The Sad Truth about Happiness Scales” by Timothy Bond and Kevin Lang, forthcoming in the Journal of Political Economy]
“A large literature has attempted to establish the determinants of happiness using ordered response data from questions such as ‘Taking all things together, how would you say things are these days—would you say that you are very happy, pretty happy, or not too happy?'”
“We…reach the striking conclusion that the results from the literature are essentially uninformative about how various factors affect average happiness.”
“The basic argument is as follows. There are a large (possibly infinite) number of states of happiness that are strictly ranked. In order to calculate a group’s “mean” happiness, these states must be cardinalized, but there are an infinite number of arbitrary cardinalizations, each producing a different set of means. The ranking of the means remains the same for all cardinalizations only if the distribution of happiness states for one group first-order stochastically dominates that for the other.”
“…we do not observe the actual distribution of states. We instead observe their distribution in a small number of discrete categories…Without additional assumptions we cannot rank the average happiness of two groups if each has responses in the highest and lowest category. Using observed covariates to achieve full nonparametric identification of the latent happiness distributions would require making assumptions that happiness researchers generally claim to reject.”
“We are therefore forced to follow the standard approach and assume the latent distributions are from a common unbounded location-scale family (e.g., an ordered probit). If we do, it is (almost) impossible to get stochastic dominance, and the conclusion is therefore not robust to simple monotonic transformations of the scale.”
“…we outlined the conditions under which the rank order of happiness for groups can be identified using categorical data on subjective well-being. We now put these into practice for nine key results from the happiness literature: the Easterlin (1973, 1974) paradox for the United States, whether happiness is U-shaped in age, the optimal policy trade-off between inflation and unemployment, rankings of countries by happiness, whether the Moving to Opportunity program increased happiness, whether marriage increases happiness, whether children decrease happiness, the relative decline of female happiness in the United States, and whether disabilities decrease happiness.”
“Table 1 summarizes the results.”
“None of these results are identified nonparametrically. Moreover, in the eight cases for which we can test for equality of variances under a parametric normal assumption, we reject equality. Thus we never have rank-order identification and can always reverse the standard conclusion by instead assuming a left-skewed or rightskewed lognormal.”
“…if researchers wanted to draw any conclusions from these data…they would have to argue that it is appropriate to inform policy based on one arbitrary cardinalization of happiness but not on another…”
“Researchers who wish to continue to interpret such questions more broadly need to…justify their particular cardinalization or parametric assumption…At a bare minimum, we would require a functional form assumption that survived the joint test of the parametric functional form and common reporting function across groups. Certainly calls to replace GDP with measures of national happiness are premature.”
To read the full article, click here. (NOTE: Article is behind a paywall.)
[Excerpts taken from the working paper “Replicator Degrees of Freedom Allow Publication of Misleading “Failures to Replicate” by Christopher Bryan, David Yeager, and Joseph O’Brien, posted at SSRN]
“…using data from an ongoing debate, we show that commonly-exercised flexibility at the experimental design and data analysis stages of replication testing can make it appear that a finding was not replicated when, in fact, it was.”
“The present analysis is important, in part, because it provides the sort of direct demonstration that has the potential to spur change.”
“We focus here on the debate about whether a subtle manipulation of language—referring to voting in an upcoming election with a predicate noun (e.g., “to be a voter”) vs. a verb (e.g., “to vote”)—can increase voter turnout.”
“A preliminary analysis of the data from just the day before the election revealed that many of the most obvious model specifications yielded significant replications of the original noun-vs.-verb effect.”
“A closer examination of the analyses reported by Gerber and colleagues (35) in support of their claim of non-replication revealed that the replicating authors chose to include three features…that in combination are known to increase the risk of misleading results.”
“…study results often hinge on data analytic decisions about which reasonable and competent researchers can disagree…we employed an analytical approach that is expressly designed to provide a comprehensive assessment of whether study data support an empirical conclusion when the influence of arbitrary researcher decisions on results is minimized.”
“The primary statistical approach we employ, called “Specification-Curve Analysis,” involves running all reasonable model specifications (i.e., ones that are consistent with the relevant hypothesis, expected to be statistically valid, and are not redundant with other specifications in the set…”
“An associated significance test for the specification curve, called a “permutation test,” quantifies how strongly the specification curve as a whole (i.e., all reasonable model specifications, taken together) supports rejecting the null hypothesis.”
“The results … make clear that noun wording had a significant effect on turnout overall…But the specification curve results also strongly suggest that the replicating authors’ data analysis choices might not be the only replicator degree of freedom influencing results. Rather, a design-stage degree of freedom exercised by the replicating authors, regarding the window of time in which the study was conducted, may have further driven the treatment effect estimate downward…”
“Perhaps the clearest, most concrete implication of the present analysis is that specification-curve analysis should be standard practice in replication testing.”
[Excerpts taken from an email sent out by the American Economic Association to its members on July 16, 2019]
“On July 10, 2019, the Association adopted an updated Data and Code Availability Policy, which can be found at https://www.aeaweb.org/journals/policies/data-code. The goal of the new policy is to improve the reproducibility and transparency of materials supporting research published in the AEA journals.”
“What’s new in the policy?”
“A central role for the AEA Data Editor. The inaugural Data Editor was appointed in January 2018 and will oversee the implementation of the new policy.”
“The Data Editor will regularly ask for the raw data associated with a paper, not just the analysis files, and for all programs that transform raw data into those from which the paper’s results are computed.”
“Replication archives will now be requested prior to acceptance, rather than during the publication process after acceptance…”
“There is a new repository infrastructure, hosted at openICPSR, called the “AEA Data and Code Repository.” Data (where allowed) and code (always required) will be uploaded to the repository and shared with the Data Editor prior to publication.”
“The Data Editor will assess compliance with this policy and will verify the accuracy of the information.”
“Will the Data Editor’s team run authors’ code prior to acceptance? Yes, to the extent that it is feasible. The code will need to produce the reported results, given the data provided. Authors can consult a generic checklist, as well as the template used by the replicating teams.”
“Will code be run even when the data cannot be posted? This was once an exemption, but the Data Editor will now attempt to conduct a reproducibility check of these materials through a third party who has access to the (confidential or restricted) data.”