Replication markets are prediction markets run in conjunction with systematic replication projects. We conducted such markets for the Replication Project: Psychology (RPP), Experimental Economics Replication Project (EERP), Social Science Replication Project (SSRP) and the Many Labs 2 Project (ML2). The participants in these markets trade ‘bets’ on the outcome of replications. Through the pricing of these bets they generate and negotiate quantitative forecasts for the replication results.
This post has three objectives: 1 – Advertise a new replication market project; 2 – Explain why it is useful to run prediction markets on replications; 3 – Discuss caveats with relying on binary interpretations of replication results.
As part of DARPA SCORE, we are currently recruiting participants for a new replication market project. As in our past projects, there is a bunch of studies that are going to be replicated, and we’d love to know how well our participants can forecast the outcome of these replications. There are some important differences to our past markets.
The first major difference is scale: in past replication markets, the number of studies we elicited forecasts for was in the order of 20. For SCORE, our forecasted studies will total 3,000.
Of course, nobody is going to replicate 3,000 studies. Rather, about 100 are selected for replication. We are not informed which ones. We designed our markets to generate forecasts for all 3,000 studies, but only bets for those 100 forecasts that are validated will be paid out.
Second, given the scale, we do the forecasting in monthly rounds over about one year. We will have 10 rounds, each on 300 studies, and 2 types of incentives for our participants.
Each round, prizes are distributed for survey responses to those who are estimated (using a peer assessment method) to be most accurate. In addition, bets in the market are paid out once the 100 replications have been conducted, and the replication outcomes are released. The total prize pool is USD 100,000+. Currently, we have more than 600 active participants. If interested, have a look / sign up at predict.replicationmarkets.com!
Why are we doing this?
We started with prediction markets to see if researchers have an idea about which findings are replicable, and which ones are not; and if prediction markets can aggregate this information into accurate forecasts.
We acknowledge that “an idea about which findings are reliable” is incredibly vague. A more accurate description could involve Bayesian subjective priors on the replicated hypotheses, beliefs on distributions of effect sizes, and considerations about the appropriateness of the instrumentalization. But we’re not there yet.
Our results are encouraging: in our past replication markets, the forecasted probabilities fit very well to the observed outcomes. In these projects, we observed the outcomes for (nearly) all forecasted replications. The value of forecasting replications may therefore not be so much the forecast itself, but the proof-of-principle that the outcome of replications can be forecasted.
In the new SCORE project, this will be different. Rather than just providing a proof-of-principle, there are approximately 2,900 studies that are not selected for replication. For those studies, our forecasts will provide valuable pieces of information on the studies’ credibility.
In our replication markets, we typically use a binary criterium to settle the bets: whether the replication result is statistically significant and the effect is in the same direction as in the original study.
The use of such dichotomies has been criticized – have a look at the “dichotomania” thread in Andrew Gelman’s blog – not only for prediction markets, but also for the summaries of large-scale replications (“X out of Y studies replicated”) and for the interpretation of research findings in general (p<0.05 = evidence; p>0.05 = lack of evidence).
Dichotomies are simplifications, and as such entail a loss of information. Publications on large-scale replication projects therefore offer a wealth of additional information, starting from additional binary criteria down to every single replication effect size and how it relates to the original effect.
For prediction markets, the reason for using the above binary criterium is that elicitation for continuous outcomes appears to be much harder than for binary outcomes. We would love to elicit and aggregate the beliefs of our forecasters in all detail and richness, but we haven’t yet figured how to do this best.
So far, the results we are getting for binary forecasts tend to be more reliable, and therefore we stick to this approach. Among the various options for binary outcomes, we believe that same direction + statistical significance is the best criterium to elicit forecasts for, because this is the most common way of how researchers judge replications.
Prediction markets might provide a stepping stone out of “dichotomania” in that they encourage dealing with uncertainty in a quantitative way. Rather than providing us with a simple Yes/No, our forecasters use probabilities to quantify and negotiate uncertainty in replication outcomes.
Obviously, there is still way to go – we are just beginning to explore how best to differentiate between, e.g. a forecaster who believes that an effect exists, but is too small to be detected in a replication, and a forecaster who doubts the effect’s existence or the validity of the instrumentalization.
In the meantime, we believe our past prediction markets and the upcoming SCORE project are important as they show that the scientific community has valuable information on the credibility of claims made in scientific publications.
Thomas Pfeiffer is Professor in Computational Biology/Biochemistry at Massey University, New Zealand; and a member of the Professoriate at the New Zealand Institute for Advanced Study. He can be contacted at email@example.com.