For several years now scientists—in at least some disciplines—have been concerned about low rates of replicability. As scientists in those fields, we worry about the development of cumulative knowledge, and about wasted research effort. An additional challenge is to consider end-users (decision and policy makers) and other consumers of our work: what level of trust should they place in the published literature? How might they judge the reliability of the evidence base?
With the latter questions in mind, our research group recently launched ‘The repliCATS project’ (Collaborative Assessments for Trustworthy Science). In the first phase, we’re focussing on eliciting predictions about the likely outcomes of direct replications of 3,000 (empirical, quantitative) research claims in the social and behavioural sciences. A subset of these 3,000 research claims will be replicated by an independent team of researchers, to serve as an evaluation of elicited forecasts.
The repliCATS project forms part of a broader program of replication research which will cover 8 disciplines: business, criminology, economics, education, political science, psychology, public administration, and sociology. The broader program—Systematizing Confidence in Open Research and Evidence, or SCORE—is funded by the US department of defense (DARPA).
The repliCATS project uses a structured group discussion—rather than a prediction market or a survey—called the IDEA protocol to elicit predictions about replicability.
Working in groups of 5-10, diverse groups of participants first Investigate a research claim, answering three questions: (i) about how comprehensible the claim is; (ii) whether the underlying effect described in the research claims seems real or robust; and (iii) then making a private estimate of likelihood of a successful direct replication.
They then join their group, either in a face-to-face meeting or in remote, online groups, to Discuss. Discussions start with the sharing of private estimates as well as the information and reasoning that went into forming those estimates. The purpose of the discussion phase is for the group to share and cross-examine each other’s’ judgements; it is not to form consensus.
After discussion has run its course, researchers are then invited to update their original Estimates, if they wish, providing what we refer to as a ‘round 2 forecast’. These round 2 forecasts are made privately, and not shared with other group members.
Finally, we will mathematically Aggregate these forecasts. For this project, we are trialling a variety of aggregation methods, ranging from unweighted linear averages to aggregating log odds transformed estimates (see figure below).
Some previous replication projects have run prediction markets and/or surveys alongside. Over time, these have become more accurate, particularly in the case of the social science replication project (of Science and Nature papers). Our project departs from these previous efforts, not only by using a very different method of elicitation, but also in the qualitative information we gather about reasoning, information sharing, and the process of updating beliefs (following discussion).
575 claims assessed in our first local IDEA workshop
Earlier this month, we ran our first set of large face-to-face IDEA groups, prior to the Society for Improving Psychological Science (SIPS) conference. 156 researchers joined one of 30 groups, each with a dedicated group facilitator. Over two days, those groups evaluated 575 published research claims (20-25 per group) in business, economics and psychology, making a huge contribution to our understanding of:
– those published claims themselves,
– how participants reason about replication, what information cues and heuristics they use to make such predictions, including what counter points make them change their minds, and
– the research community’s overall beliefs about the state of our published evidence base.
We’ve also started to learn about how researchers evaluate claims within their direct field of expertise versus slightly outside that scope. We don’t yet know, or necessarily expect, that there will be differences in accuracy, but there do seem to be differences in approach and subjective levels of confidence.
What happens to those predictions? How accurate were they?
The short answer is that we wait. As discussed above, the repliCATS projects is part of a larger program. What happens next is that a subset of those 3,000 claims will be fully replicated by an independent research team, serving as evaluation criteria for the accuracy of our elicited predictions.
In about a year’s time, we’ll know how accurate those predictions are. (We’re hoping for at least 80% accuracy.) Our 3,000 predictions will also be used to benchmark machine learning algorithms, developed by other (again, independently funded by DARPA) research teams.
Following our first workshop, our repliCATS team now has a few thousand probabilistic predictions, and associated qualitative reasoning data to get stuck into. It’s an overwhelming amount of information, and barely one fifth of what we’ll have this time next year!
Feedback on SIPS workshop
As you’ve probably gathered, the success of our project relies heavily on attracting large numbers of participants interested in assessing research claims. So it was hugely heartening for us that the researchers who joined our SIPS workshop gave us very positive overall feedback about the experience.
They particularly enjoyed the core task of thinking about what factors or features of a study contribute to likely replicability (or not).
Early career researchers in particular also appreciated the chance to see and discuss others’ ratings and reasoning, and told us that the workshop has helped build their confidence about writing peer reviews in the future. (In fact, several of us came to the opinion that something like our IDEA protocol would make a good substitute for current peer review process in some places!)
To hear what others thought, check out Twitter @repliCATS (click on the image below)
In the meantime, we’re ready to deploy our “repliCATS bus” (we’ll come to you, or help you, run smaller scale workshops at your institution), offering you the opportunity to join ‘remote IDEA groups’ online.
Fiona Fidler is an Associate Professor at the University of Melbourne with joint appointments in the School of BioSciences and the School of Historical and Philosophical Studies (SHAPS). Fallon Mody is a postdoctoral research fellow in the Department of History and Philosophy of Science at the University of Melbourne.
 Note that here we are specifically concerned with trust in the ‘the published literature’ and not trust in science more broadly, or in scientists themselves. The published literature is as much created by the publishing industry as it is by scientists and other scholars.
 In this project, a “research claim” has a very specific meaning: it is used to describe a single major finding from a published study – for example, a journal article – as well as details of the methods and results that support this finding.
 In subsequent phases, we’ll be thinking about conceptual replications, generalisability, and other factors that build confidence for end users of research.