[Excerpts taken from the article “Laypeople Can Predict Which Social Science Studies Replicate” by Suzanne Hoogeveen, Alexandra Sarafoglou, and Eric-Jan Wagenmakers, posted at PsyArXiv Preprints]
“…we assess the extent to which a finding’s replication success relates to its intuitive plausibility. Each of 27 high-profile social science findings was evaluated by 233 people without a PhD in psychology. Results showed that these laypeople predicted replication success with above-chance performance (i.e., 58%). In addition, when laypeople were informed about the strength of evidence from the original studies, this boosted their prediction performance to 67%.”
“Participants were presented with 27 studies, a subset of the studies included in the Social Sciences Replication Project…and the Many Labs 2 Project.”
“For each study, participants read a short description of the research question, its operationalization, and the key finding…In the Description Only condition, solely the descriptive texts were provided; in the Description Plus Evidence condition, the Bayes factor and its verbal interpretation (e.g., “moderate evidence”) for the original study were added to the descriptions.”
“After the instructions, participants…indicated whether they believed that this study would replicate or not (yes / no), and expressed their confidence in their decision on a slider ranging from 0 to 100.”
“Figure 1 displays participants’ confidence ratings concerning the replicability of each of the 27 included studies, ordered according to the averaged confidence score.”

“Positive ratings reflect confidence in replicability, and negative ratings reflect confidence in non-replicability, with −100 denoting extreme confidence that the effect would fail to replicate. Note that these data are aggregated across the Description Only and the Description Plus Evidence condition.”
“The top ten rows indicate studies for which laypeople showed relatively high agreement that the associated studies would replicate. Out of these ten studies, nine replicated and only one did not (i.e., the study by Anderson, Kraus, Galinsky, & Keltner, 2012; note that light-grey indicates a successful replication, and dark-grey indicates a failed replication).”
“The bottom four rows indicate studies for which laypeople showed relatively high agreement that the associated studies would fail to replicate. Consistent with laypeople’s predictions, none of these four studies replicated.”
“For the remaining 13 studies in the middle rows, the group response was relatively ambiguous, as reflected by a bimodal density that is roughly equally distributed between the negative and positive end of the scale. Out of these 13 studies, five replicated successfully and eight failed to replicate successfully.”
“Overall, Figure 1 provides a compelling demonstration that laypeople are able to predict whether or not high-profile social science findings will replicate successfully.”
“…the relative ordering of laypeople’s confidence in replicability for a given set of studies may provide estimations of the relative probabilities of replication success.”
“If a replicator’s goal is to purge the literature of unreliable effects, he or she may start by conducting replications of the studies for which replication failure is predicted by naive forecasters.”
“Alternatively, if the goal is to clarify the reliability of studies for which replication outcomes are most uncertain, one could select studies for which the distribution of the expected replicability is characterized by a bi-modal shape.”
“As such, prediction surveys may serve as ‘decision surveys’, instrumental in the selection stage of replication research (cf. Dreber et al., 2015). These informed decisions could not only benefit the replicator, but also optimize the distribution of funds and resources for replication projects.”
To read the article, click here.
[Excerpts taken from the article “New Journal Focused on Reproducibility” by Colleen Flaherty, published at insidehighered.com]
“Cambridge University Press is launching a new open-access journal to help address science’s reproducibility issues and glacial peer-review timelines. Experimental Results, announced today, gives researchers a “place to publish valid, standalone experimental results, regardless of whether those results are novel, inconclusive, negative or supplementary to other published work,” according to the press. It will also publish work about attempts to reproduce previously published experiments.”
[Excerpts taken from the article, “Artificial Intelligence Confronts a ‘Reproducibility’ Crisis’” by Gregory Barber, published at Wired.com]
“A few years ago, Joelle Pineau, a computer science professor at McGill, was helping her students design a new algorithm when they fell into a rut. …Pineau’s students hoped to improve on another lab’s system. But first they had to rebuild it, and their design, for reasons unknown, was falling short of its promised results. Until, that is, the students tried some “creative manipulations” that didn’t appear in the other lab’s paper. Lo and behold, the system began performing as advertised.”
“The lucky break was a symptom of a troubling trend, according to Pineau. Neural networks, the technique that’s given us Go-mastering bots and text generators that craft classical Chinese poetry, are often called black boxes because of the mysteries of how they work. Getting them to perform well can be like an art, involving subtle tweaks that go unreported in publications. The networks also are growing larger and more complex, with huge data sets and massive computing arrays that make replicating and studying those models expensive, if not impossible for all but the best-funded labs.”
“Pineau is trying to change the standards.”
To read the article, click here.
Replication markets are prediction markets run in conjunction with systematic replication projects. We conducted such markets for the Replication Project: Psychology (RPP), Experimental Economics Replication Project (EERP), Social Science Replication Project (SSRP) and the Many Labs 2 Project (ML2). The participants in these markets trade ‘bets’ on the outcome of replications. Through the pricing of these bets they generate and negotiate quantitative forecasts for the replication results.
This post has three objectives: 1 – Advertise a new replication market project; 2 – Explain why it is useful to run prediction markets on replications; 3 – Discuss caveats with relying on binary interpretations of replication results.
Advertisement
As part of DARPA SCORE, we are currently recruiting participants for a new replication market project. As in our past projects, there is a bunch of studies that are going to be replicated, and we’d love to know how well our participants can forecast the outcome of these replications. There are some important differences to our past markets.
The first major difference is scale: in past replication markets, the number of studies we elicited forecasts for was in the order of 20. For SCORE, our forecasted studies will total 3,000.
Of course, nobody is going to replicate 3,000 studies. Rather, about 100 are selected for replication. We are not informed which ones. We designed our markets to generate forecasts for all 3,000 studies, but only bets for those 100 forecasts that are validated will be paid out.
Second, given the scale, we do the forecasting in monthly rounds over about one year. We will have 10 rounds, each on 300 studies, and 2 types of incentives for our participants.
Each round, prizes are distributed for survey responses to those who are estimated (using a peer assessment method) to be most accurate. In addition, bets in the market are paid out once the 100 replications have been conducted, and the replication outcomes are released. The total prize pool is USD 100,000+. Currently, we have more than 600 active participants. If interested, have a look / sign up at predict.replicationmarkets.com!
Why are we doing this?
We started with prediction markets to see if researchers have an idea about which findings are replicable, and which ones are not; and if prediction markets can aggregate this information into accurate forecasts.
We acknowledge that “an idea about which findings are reliable” is incredibly vague. A more accurate description could involve Bayesian subjective priors on the replicated hypotheses, beliefs on distributions of effect sizes, and considerations about the appropriateness of the instrumentalization. But we’re not there yet.
Our results are encouraging: in our past replication markets, the forecasted probabilities fit very well to the observed outcomes. In these projects, we observed the outcomes for (nearly) all forecasted replications. The value of forecasting replications may therefore not be so much the forecast itself, but the proof-of-principle that the outcome of replications can be forecasted.
In the new SCORE project, this will be different. Rather than just providing a proof-of-principle, there are approximately 2,900 studies that are not selected for replication. For those studies, our forecasts will provide valuable pieces of information on the studies’ credibility.
Binary interpretations
In our replication markets, we typically use a binary criterium to settle the bets: whether the replication result is statistically significant and the effect is in the same direction as in the original study.
The use of such dichotomies has been criticized – have a look at the “dichotomania” thread in Andrew Gelman’s blog – not only for prediction markets, but also for the summaries of large-scale replications (“X out of Y studies replicated”) and for the interpretation of research findings in general (p<0.05 = evidence; p>0.05 = lack of evidence).
Dichotomies are simplifications, and as such entail a loss of information. Publications on large-scale replication projects therefore offer a wealth of additional information, starting from additional binary criteria down to every single replication effect size and how it relates to the original effect.
For prediction markets, the reason for using the above binary criterium is that elicitation for continuous outcomes appears to be much harder than for binary outcomes. We would love to elicit and aggregate the beliefs of our forecasters in all detail and richness, but we haven’t yet figured how to do this best.
So far, the results we are getting for binary forecasts tend to be more reliable, and therefore we stick to this approach. Among the various options for binary outcomes, we believe that same direction + statistical significance is the best criterium to elicit forecasts for, because this is the most common way of how researchers judge replications.
Prediction markets might provide a stepping stone out of “dichotomania” in that they encourage dealing with uncertainty in a quantitative way. Rather than providing us with a simple Yes/No, our forecasters use probabilities to quantify and negotiate uncertainty in replication outcomes.
Obviously, there is still way to go – we are just beginning to explore how best to differentiate between, e.g. a forecaster who believes that an effect exists, but is too small to be detected in a replication, and a forecaster who doubts the effect’s existence or the validity of the instrumentalization.
In the meantime, we believe our past prediction markets and the upcoming SCORE project are important as they show that the scientific community has valuable information on the credibility of claims made in scientific publications.
Thomas Pfeiffer is Professor in Computational Biology/Biochemistry at Massey University, New Zealand; and a member of the Professoriate at the New Zealand Institute for Advanced Study. He can be contacted at pfeiffer.massey@gmail.com.
[Excerpts taken from the article “P-value Thresholds: Forfeit at Your Peril’ by Deborah Mayo, forthcoming in the European Journal of Clinical Investigation]
“A key recognition among those who write on the statistical crisis in science is that the pressure to publish attention-getting articles can incentivize researchers to produce eye-catching but inadequately scrutinized claims.”
“We may see much the same sensationalism in broadcasting metastatistical research, especially if it takes the form of scapegoating or banning statistical significance.”
“A lot of excitement was generated recently when Ron Wasserstein, Executive Director of the American Statistical Association (ASA), and co-editors A. Schirm and N. Lazar, updated the 2016 ASA Statement on P-Values and Statistical Significance (ASA I).”
“In their 2019 interpretation, ASA I “stopped just short of recommending that declarations of ‘statistical significance’ be abandoned,” and in their new statement (ASA II) announced: “We take that step here….Statistically significant –don’t say it and don’t use it”.”
“To herald the ASA II, and the special issue “moving to a world beyond p < 0.05”, the journal Nature requisitioned a commentary from Amrhein, Greenland and McShane “Retire Statistical Significance” (AGM). With over 800 signatories, the commentary received the imposing title “Scientists rise up against significance tests”!”
“Getting past the appeals to popularity and fear, the reasons ASA II and AGM give are that thresholds can lead to well-known fallacies, and even to some howlers more extreme than those long lampooned.”
“Of course it’s true: “a statistically non-significant result does not ‘prove’ the null hypothesis (the hypothesis that there is no difference between groups or no effect of a treatment …). Nor do statistically significant results ‘prove’ some other hypothesis.” (AGM)”
“It is easy to be swept up in their outrage, but the argument: “significance thresholds can be used very badly, therefore remove significance thresholds” is a very bad argument. Moreover, it would remove the very standards we need to call out the fallacies.”
“The danger of removing thresholds on grounds they could be badly used is that they are not there when you need them.”
“Ioannidis zeroes in on the problem: ‘With the gatekeeper of statistical significance, eager investigators whose analyses yield, for example, P = .09 have to either manipulate their statistics to get to P < .05 or add spin to their interpretation to suggest that results point to an important signal through an observed “trend.” When that gate keeper is removed, any result may be directly claimed to reflect an important signal or fit to a preexisting narrative.’”
“ASA II regards its positions “open to debate”. An open debate is very much needed.”
To read the article, click here.
[Excerpts taken from the blog “What development economists talk about when they talk about reproducibility …” by Luiza Andrade, Guadalupe Bedoya, Benjamin Daniels, Maria Jones, and Florence Kondylis, published on the World Bank’s Development Impact blog]
“Can another researcher reuse the same code on the same data and get the same results as a recently published paper?…This question motivated teams from the World Bank, 3ie, BITSS/CEGA, J-PAL, and Innovations for Poverty Action (IPA) to host researchers from the US Census Bureau, the Odum Institute, Dataverse, the AEA, and several universities at the first Transparency, Reproducibility, and Credibility research symposium last Tuesday.”
“…research practitioners discussed their experiences in panels focused on three key topics: defining transparency in research; balancing privacy and openness in data handling; and practical steps forward on credible research. Here we share some of the highlights of the discussions.”
“‘Reproducibility’ and ‘replicability’ are often confused (case in point: a ‘push-button replication’ is in fact a check for computational reproducibility). Prof. Lorena Barba, one of the authors of a 2019 report by the National Academies of Science, Engineering and Medicine, offered the following definitions. Reproducibility means ‘computational’ reproducibility: can you obtain consistent computational results using the exact same input data, computational steps, methods, code, and conditions of analysis? Replicability means obtaining consistent results across separate studies aimed at answering the same scientific question, each of which has obtained its own data.
“Complete data publication, unlike reproducibility checks, brings along with it a set of serious privacy concerns, particularly when sensitive data is used in key analyses. The group discussed a number of tools developed to help researchers de-identify data (PII_detection from IPA, PII_scan from JPAL, and sdcMicro from the World Bank). But is it ever possible to fully protect privacy in an era of big data?…we need additional institutional investments to create permanent, secure, and interoperable infrastructure to facilitate restricted and limited access to sensitive data.”
“Finally, the group discussed practical steps to advance the agenda of reproducibility, transparency, and credibility in research. Ideas included incorporating transparent practices into academic training at all levels and verifying computational reproducibility of all research outputs within each institution. Pre-publication review, for example, is a practice DIME has instituted department-wide: over the past two years, no working paper has gone out without computational reproducibility being confirmed by the Analytics team.”
“In a particularly relevant presentation, Guadalupe Bedoya from DIME presented data from a recent short survey designed to survey practitioners’ application of classic “best practices” that are fundamental to reproducible research. The team surveyed principal investigators in top development economics organizations based on the IDEAS ranking and received responses from 99 PIs, as well as from 25 DIME research assistants…On a scale from 1 to 5, PIs that responded rate their preparedness to comply with the AEA’s new policy at 3.5.”
“One issue cited by researchers is the high entry cost to changing their workflow to absorb new practices. Many respondents worried that enforcing reproducibility standards would only accentuate the inequality between researchers. The fear articulated by some respondents was that well-established researchers, have more access to funds, training and staff (e.g. more research assistants)–all of which lower the entry costs.”
The repliCATS Project is part of a monumental effort to estimate the replicability of 3,000 published social science research claims. Towards that end, it is sponsoring a one-day workshop in Melbourne, Australia on November 6th. Participants will work in small groups of 5-6 to evaluate approximately a dozen social and behavioral research claims. Each group will have its own facilitator to guide them through a structured assessment protocol.
To help defray costs, travel grants of $400 USD will be made available to participants. In addition, those who wish to stay on and attend the following AIMOS conference (November 7-8) will have their registration costs waived.
The project is currently looking for economists to join its team of other social science researchers.
To learn more about the workshop and sign up, click here.
Hope to see you in Melbourne!
[Excerpts taken from the article “No Data in the Void: Values and Distributional Conflicts in Empirical Policy Research and Artificial Intelligence” by Maximilian Kasy, published at econfip.org]
“Decision making based on data…is becoming ever more widespread. Any time such decisions are made, we need to carefully think about the goals we want to achieve, and the policies we might possibly use to achieve them.”
“…Let us now…turn to the debates about publication bias, replicability, and the various reform efforts aimed at making empirical research in the social and life sciences more credible.”
“…what it is that reforms of academic research institutions and norms wish to ultimately achieve: What is the objective function of scientific research and publishing?”
“Consider, as an example, clinical research on new therapies. Suppose that in some hypothetical area of medicine a lot of new therapies, say drugs or surgical methods, are tested in clinical studies.”
“In this hypothetical scenario, which findings should be published? That is, which subset of studies should doctors read? In order to improve medical practice, it would arguably be best to tell doctors about the small subset of new therapies which were successful in clinical trials.”
“If this is the selection rule used for publication, however, published findings are biased upward. Replications of the published clinical trials will systematically find smaller positive effects or even sometimes negative effects.”
“This reasoning suggests that there is a deep tension between relevance (for decision making) and replicability in the design of optimal publication rules.”
“In Frankel and Kasy (2018), Which findings should be published?, we argue that this type of logic holds more generally, in any setting where published research informs decision makers and there is some cost which prevents us from communicating all the data. In any such setting, it is optimal to selectively publish surprising findings.”
“These considerations leave us with the practical question of what to do about the publication system…A possible solution might be based on a functional differentiation of publication outlets”
“There might be a set of top outlets focused on publishing surprising (“relevant”) findings…These outlets would have the role of communicating relevant findings to attention-constrained readers (researchers and decision makers). A key feature of these outlets would be that their results are biased, by virtue of being selected based on surprisingness.”
“There might then be another, wider set of outlets that are not supposed to select on findings.., For experimental studies, pre-analysis plans and registered reports (results-blind review) might serve as institutional safe-guards to ensure the absence of selectivity by both researchers and journals.”
“Journals that explicitly invite submission of “null results” might be an important part of this tier of outlets. This wider set of outlets would serve as a repository of available vetted research, and would not be subject to the biases induced by the selectivity of top-outlets.”
To read the article, click here.
[Excerpts taken from the blog “Responding to the replication crisis: reflections on Metascience2019” by Dorothy Bishop, published at her blogsite, BishopBlog]
“I’m just back from MetaScience 2019…It is a sign of a successful meeting, I think, if it gets people…raising more general questions about the direction the field is going in, and it is in that spirit I would like to share some of my own thoughts.”
“…Another major concern I had was the widespread reliance on proxy indicators of research quality. One talk that exemplified this was Yang Yang’s presentation on machine intelligence approaches to predicting replicability of studies…implicit in this study was the idea that the results from this exercise could be useful in future in helping us identify, just on the basis of textual analysis, which studies were likely to be replicable.”
“Now, this seems misguided on several levels…Goodhart’s law would kick in: as soon as researchers became aware that there was a formula being used to predict how replicable their research was, they’d write their papers in a way that would maximise their score.”
“One can even imagine whole new companies springing up who would take your low-scoring research paper and, for a price, revise it to get a better score.”
[Excerpts are taken from the article “Affirmative citation bias in scientific myth debunking: A three-in-one case study” by Kåre Letrud and Sigbjørn Hernes, published in PLOS One]
“…we perform case studies of the academic reception of three articles critical of the widely cited yet contentious Hawthorne Effect. By consulting papers citing these critical works, we seek to establish whether, and to what degree the accumulated citations are indeed skewed in favor of the Hawthorne Effect, suggesting the existence of an affirmative citation bias.”
“The idea of a Hawthorne Effect originated from studies on workplace behavior at the Western Electric Company’s Hawthorne Plant during the 1920s and 1930s…Surprisingly, both higher and lower lighting levels supposedly led to increased productivity…‘The Hawthorne Effect’ is now an ambiguous and vague, yet widely used, term, primarily associated with an observer effect: subjects altering their behavior when aware of being observed…”
“We based the case studies on articles arguing against the Hawthorne Effect, interpreted as various observer effects. The selection criteria being that they were unequivocally critical of the effect, that their argumentation was substantial, and that they were extensively cited by peer reviewed articles.”
“Case 1: Franke and Kaul 1978: Franke and Kaul perform the first statistical analysis of the Hawthorne Studies data, and draws conclusions ‘different from those heretofore drawn’…17 articles cited Franke and Kaul, while taking a negative stance towards the Hawthorne Effect, and 63 were neutral…197 affirmed the Hawthorne Effect, and of these 189 cited Franke and Kaul as affirming the Hawthorne Effect.”
“Case 2: Jones 1992: Jones performs an analysis of the data from the relay studies, searching for evidence of the Hawthorne Effect (interpreted as the subjects being aware changes in experimental conditions before or during the experimental period), and finds none. His conclusion: … ‘I must conclude that there is slender or no evidence of a Hawthorne effect…’”
“Consulting these 140 articles we found that 19 were neutral on the matter. 18 cited Jones while criticizing the Hawthorne Effect, whereas 103 affirmed its validity. Of the affirmative articles, 60 cited Jones as affirming the Hawthorne Effect.”
“Case 3: Wickström and Bendix 2000: Citing former reanalyses, Wickström and Bendix…argue that the original Hawthorne Studies did not show adequate evidence of the effect.”
“We were able to retrieve the text of 196 of 198 titles citing Wickström and Bendix published between 2001 and 2018…Merely five articles took a critical stance towards the Hawthorne Effect. 23 were neutral, while 168 affirmed the effect. 155 of these 168 articles cited Wickström and Bendix as affirming the Hawthorne Effect.”
“Out of 197 affirmative citations of Franke and Kaul, 189 cited the critical articles as affirming the Hawthorne Effect. For Jones, the number was 60 of 103, for Wickström and Bendix 155 of 168…When it comes to academic publishing, the affirming articles are dominant on the issue of the Hawthorne Effect, and are likely the major contributors to the forming of the published consensus.”
“The findings not only demonstrate that the three efforts at criticizing the Hawthorne Effect to varying degrees were unsuccessful, but they also suggest that if the intention behind the critiques were to reduce the frequency of affirmations of the claim in the scientific corpus, they may have achieved the very opposite.”
To read the article, click here.
You must be logged in to post a comment.