[From the article “Replications on the Rise” by Stuart Buck, posted at Arnold Ventures]
“Once a poorly rewarded scientific value, replication has seen a boom with studies in everything from psychology to dogs.”
“A significant tipping point was the Reproducibility Project in Psychology, which Arnold Ventures funded, and which was carried out by our grantee Center for Open Science. That project organized more than 200 psychology labs around the world to systematically redo 100 experiments published in top psychology journals.”
“Since 2015, we have seen an explosion in similar efforts. Particularly in social-behavioral science, there are now many replication projects that have organized labs around the world to replicate one or more scientific findings.”
– “The Social Sciences Replication Project (coordinated by the Center for Open Science) sought to replicate all 21 social science experiments published in Science and Nature between 2010 and 2015.”
– “The Many Labs series of studies (1 through 5) have all sought to systematically replicate psychology studies with many labs around the world. Many Labs 1 came out in 2014 and replicated 13 psychological findings; Many Labs 2 did the same for another 28 classic and newer findings; Many Labs 3 looked at whether psychological effects studied on college students would vary depending on the time of the semester; and Many Labs 4 and 5 are in progress.”
– “The Psychological Science Accelerator is a “globally distributed network of psychological science laboratories (currently more than 350),” with the goal of coordinating massive psychology studies across diverse settings.”
– “The Many Babies projects (of which there are now three) are coordinating multiple labs to study infant cognition (such as how babies develop a theory of mind or how they react to speech).”
– “In education, Many Numbers is a project to replicate research on how children develop math skills, while Many Classes will study questions such as curriculum across ‘dozens of contexts, spanning a range of courses, institutions, formats, and student populations.'”
– “The Many Smiles Project is organizing 18 labs from 17 countries to re-examine a contested question in psychology about whether people actually become happier when they are tricked into smiling (by being asked to hold a pencil between their teeth).”
– “The “Many” tagline has even reached animal research, including Many Primates (a collaboration to study primate cognition, such as short-term memory, in much larger samples than are typical for the field), and Many Dogs (a collaboration to study dog cognition).”
“More fields – from medicine to sociology to biology – should see the value in large replication projects that systematically take stock of whether research stands the test of time.”
Replication researchers cite inflated effect sizes as a major cause of replication failure. It turns out this is an inevitable consequence of significance testing. The reason is simple. The p-value you get from a study depends on the observed effect size, with more extreme observed effect sizes giving better p-values; the true effect size plays no role. Significance testing selects studies with good p-values, hence extreme observed effect sizes. This selection bias guarantees that on average, the observed effect size will inflate the true effect size. The overestimate is large, 2-3x, under conditions typical in social science research. Possible solutions are to increase sample size or effect size or abandon significance testing.
 By “inflate” I mean increase the absolute value.
Figure 1 illustrates the issue using simulated data colored by p-value. The simulation randomly selects true effect sizes, then simulates a two group difference-of-mean study with sample size n=20 for each true effect size. The effect size statistic is standardized difference, aka Cohen’s d, and p-values are from the t-test. The figure shows a scatter plot of true vs. observed effect size with blue and red dots depicting nonsignificant and significant studies. P-values are nonsignifiant (blue) for observed effect sizes between about -0.64 and 0.64 and improve as the observed effect size grows. The transition from blue to red at ± 0.64 is a critical value that sharply separates nonsignificant from significant results. This value depends only on n and is the least extreme significant effect size for a given n.
Technical note: The sharpness of the boundary is due to the use of Cohen’s d in conjunction with the t-test. This pairing is mathematically natural because both are standardized, meaning both are relative to the sample standard deviation. In fact, Cohen’s d and the t-statistic are essentially the same statistic, related by the identities d = t∙sqrt(2/n) and t = d∙sqrt(2/n) (for my simulation scenario).
The average significant effect size depends on both d and n. I explore this with a simulation that fixes d to a few values of interest, sets n to a range of values, and simulates many studies for each d and n.
From what I read in the blogosphere, the typical true effect size in social science research is d=0.3. Figure 2 shows a histogram of observed effect sizes for d=0.3 and n=20. The significant results are way out on the tails, mostly on the right tail, which means the average will be large. Figure 3 shows the theoretical equivalent of the histogram (the sampling distribution) for the same parameters and two further cases: same d but larger n, and same n but larger d. Increasing n makes the curve sharper and reduces the critical effect size, causing much more of the area to be under the red (significant) part of the curve. Increasing d slides the curve over, again putting more of the area under the red. These changes reduce the average significant effect size bringing it closer to the true value.
Figure 4 plots the average significant effect size for d between 0.3 and 0.7 and n ranging from 20 to 200. In computing the average, I only use the right tail, reasoning that investigators usually toss results with the wrong sign whether significant or not, as these contradict the authors’ scientific hypothesis. Let’s look first at n=20. For d=0.3 the average is 0.81, an overestimate of 2.7x. A modest increase in effect size helps a lot. For d=0.5 (still “medium” in Cohen’s d vernacular), the average is 0.86, an overestimate of 1.7x. For d=0.7, it’s 0.93, an overestimate of 1.3x. To reduce the overestimate to a reasonable level, say 1.25x, we need n=122 for d=0.3, but only n=47 for d=0.5, and n=26 for d=0.7.
Significance testing is a biased procedure that overestimates effect size. This is common knowledge among statisticians yet seems to be forgotten in the replication literature and is rarely explained to statistics users. I hope this post will give readers a visual understanding of the problem and under what conditions it may be worrisome. Shravan Vasishth offers another good explanation in his excellent TRN post and related paper.
You can mitigate the bias by increasing sample size or true effect size. There are costs to each. Bigger studies are more expensive. They’re also harder to run and may require more study personnel and study days, which may increase variability and indirectly reduce the effect size. Increasing the effect size typically involves finding study conditions that amplify the phenomenon of interest. This may reduce the ability to generalize from lab to real world. All in all, it’s not clear that the net effect is positive.
A cheaper solution is to abandon significance testing. The entire problem is a consequence of this timeworn statistical method. Looking back at Figure 1, observed effect size tracks true effect size pretty well. There’s uncertainty, of course, but that seems an acceptable tradeoff for gaining unbiased effect size estimates at reasonable cost.
[From the Twitter thread started by @JessieSunPsych]
Jessie Sun (@JessieSunPsych) relayed the following question that was raised at a recent Psychology conference: “At what point can a theory be falsified (e.g., if the effect size is d = .02)? We often just predict the direction of the effect, but do we need to think about the specificity of effect sizes?”
This led to a large number of responses. Daniel Lakens (@lakens) replied by giving three links to works that he has either authored or co-authored, each addressing a piece of the answer.
Among other things, this blog recommends that a researcher should “power your experiment such that you, or someone else, can conduct a similarly-sized experiment and have high power for detecting an interesting difference from your study. We need to stop thinking about studies as if they are one-offs, only to be interpreted once in light of the hypotheses of the original authors. This does not support cumulative science.”
The latter addresses how to test a theory (or the claims of a prior paper) that a given parameter takes a range of values. It also encourages researchers to choose alternative hypotheses that would be unlikely to be true unless the theory was correct, so rejection of the null actually means something.
Nicole Janz (University of Nottingham, @PolSciReplicate) has a set of great resources for those looking for an example of a pre-registration assignment for undergraduates (this for a political science course).
[From the blog “Misinterpreting Tests, P-Values, Confidence Intervals & Power” by Dave Giles, posted at his blogsite, Econometrics Beat]
“Today I was reading a great paper by Greenland et al. (2016) that deals with some common misconceptions and misinterpretations that arise not only with p-values and confidence intervals, but also with statistical tests in general and the “power” of such tests.”
“These comments by the authors in the abstract for their paper sets the tone of what’s to follow rather nicely:”
“A key problem is that there are no interpretations of these concepts that are at once simple, intuitive, correct, and foolproof. Instead, correct use and interpretation of these statistics requires an attention to detail which seems to tax the patience of working scientists. This high cognitive demand has led to an epidemic of shortcut deﬁnitions and interpretations that are simply wrong, sometimes disastrously so – and yet these misinterpretations dominate much of the scientiﬁc literature.”
“The paper then goes through various common interpretations of the four concepts in question, and systematically demolishes them!”
“The paper is extremely readable and informative. Every econometrics student, and most applied econometricians, would benefit from taking a look!”
To read Greenland et al.’s (2016) paper, click here.
[From the working paper, “Open science and modified funding lotteries can impede the natural selection of bad science” By Paul Smaldino, Matthew Turner, and Pablo Contreras Kallens, posted at OSF Preprints]
“…we investigate the influence of three key factors on the natural selection of bad science: publication of negative results, improved peer review, and criteria for funding allocation.”
“…our results indicate that funding agencies have the potential to play an outsized role in the improvement of science by promoting research that passes tests for rigor. Such tests include commitments to open data and open code (which permit closer scrutiny), preregistration and registered reports, and research programs with strong theoretical bases for their hypotheses. Wide-scale adoption of these and similar criteria for funding agencies can, in theory, have substantial long-term effects on reducing the rates of false discoveries.”
“Our results also highlight the contribution of open science practices. Improving peer review and reducing publication bias led to improvements in the replicability of published findings in our simulations. Alone, each of these open science improvements required extremely high levels of implementation to be effective.”
“Fortunately, we also found that the two factors could work in tandem to improve the replicability of the published literature at lower, though still high, levels of efficacy.”
“… in the absence of specific incentives at the funding or hiring level for methodological rigor, open science improvements are probably not sufficient to stop the widespread propagation of inferior research methods, despite the optimism that often surrounds their adoption.”
[From the press release “Can machines determine the credibility of research claims? The Center for Open Science joins a new DARPA program to find out” from the Center for Open Science]
“The Center for Open Science (COS) has been selected to participate in DARPA’s new program Systematizing Confidence in Open Research and Evidence (SCORE) with a 3-year cooperative agreement potentially totaling more than $7.6 million. This program represents an investment by DARPA to assess and improve the credibility of social and behavioral science research.”
“DARPA identifies the purpose of SCORE is “to develop and deploy automated tools to assign ‘confidence scores’ to different SBS research results and claims. Confidence scores are quantitative measures that should enable a DoD consumer of SBS research to understand the degree to which a particular claim or result is likely to be reproducible or replicable.” If successful, consumers of scientific evidence–researchers, funders, policymakers, etc.–would have readily available information about the uncertainty associated with that evidence.”
“COS will coordinate a massive collaboration of researchers from every area of the social and behavioral sciences to conduct replication and reproducibility studies.”
“Researchers interested in potentially joining this program to conduct replication or reproduction studies are encouraged to review the Call for Collaborators.”
[From the blog “Why you shouldn’t say ‘this study is underpowered'” by Richard Morey, posted at Towards Data Science, at Medium. com]
“The first thing to clear up, as I’ve stated above, is that study or an experiment is not underpowered; rather: A design and test combination can be underpowered for detecting hypothetical effect sizes of interest.”
“Suppose worked for a candy company and had determined that our new candy would be either green or purple. We’ve been tasked with finding out whether people like green or purple candy better, so we construct an experiment where we give people both and see which one they reach for first. For each person, the answer is either “green” or “purple”. Let’s call θ the probability of picking purple first, so we’re interested in whether θ>.5 (that is, purple is preferred).”
“Suppose we fix our design at N=50 people picking candy colors. We now need a test. … “If 31 or more people pick purple, we’ll say that purple is preferred (i.e., θ>.5)”. We can now draw the power/sensitivity curve for the design and test, given all the potential, hypothetical effect sizes (shown in the figure to the left, as curve “A”).”
“A “power analysis” is simply noting the features of this curve (perhaps along with changing the potential design by increasing N). Look at curve A. If green candies are preferred (θ<.5) we have a very low chance of mistakenly saying that purple candies are preferred (this is good!). If purple is substantially preferred (θ>.7), we have a good chance of correctly saying that purple is preferred (also good!).”
“Now let’s consider another test for this design: “If 26 or more people pick purple, we’ll say that purple is preferred (θ>.5)”. This could be motivated by saying that we’ll claim that purple is truly preferred whenever the data seem to “prefer” purple. This is curve “B” in the figure above. Let’s do a power analysis. If purple is substantially preferred (θ>.7), we are essentially sure to correctly say that purple is preferred (good!). If green candies are preferred, (θ<.5) we could have a high chance (over 40%) of mistakenly saying that purple candies are preferred (this is bad!).”
“A design sensitivity analysis — what is often called a power analysis — is just making sure the sensitivity is low in the region where the “null” is true (in common lingo, “controlling” α), and making sure the power/sensitivity is high where we’d care about it. None of this has anything to do with “estimating” power from previous results, or anything to do with the actually true effect.”
[From the working paper “Predicting the Replicability of Social Science Lab Experiments” by Altmejd et al., posted at BITSS Preprints]
“We have 131 direct replications in our dataset. Each can be judged categorically by whether it succeeded or failed, by a pre-announced binary statistical criterion. The degree of replication can also be judged on a continuous numerical scale, by the size of the effect estimated in the replication compared to the size of the effect in the original study.”
“Our method uses machine learning to predict outcomes and identify the characteristics of study-replication pairs that can best explain the observed replication results.”
“We divide the objective features of the original experiment into two classes. The first contains the statistical design properties and outcomes: among these features we have sample size, the effect size and p-value originally measured, and whether a finding is an effect of one variable or an interaction between multiple variables.”
“The second class is the descriptive aspects of the original study which go beyond statistics: these features include how often a published paper has been cited and the number and past success of authors, but also how subjects were compensated.”
“We compare a number of popular machine learning algorithms … and find that a Random Forest (RF) model has the highest performance.”
“Even with our fairly small data set, the model can forecast replication results with substantial accuracy —around 70%.”
“The statistical features (p-value and effect size) of the original experiment are the most predictive. However, the accuracy of the model is also increased by variables such as the nature of the finding (an interaction, compared to a main effect), number of authors, paper length and the lack of performance incentives.”
“Our method could be used in pre- and post-publication assessment, … For example, when a paper is submitted an editorial assistant can code the features of the paper, plug those features into the models, and derive a predicted replication probability. This number could be used as one of many inputs helping editors and reviewers to decide whether a replication should be conducted before the paper is published.”
“Post-publication, the model could be used as an input to decide which previously published experiments should be replicated.”
[From the article, “Statistical Rituals: The Replication Delusion and How We Got There” by Gerd Gigerenzer, published in Advances in Methods and Practices in Psychological Science]
“The “replication crisis” has been attributed to misguided external incentives gamed by researchers (the strategic-game hypothesis). Here, I want to draw attention to a complementary internal factor, namely, researchers’ widespread faith in a statistical ritual and associated delusions (the statistical-ritual hypothesis).”
“The crucial delusion is that the p value specifies the probability of a successful replication (i.e., 1 – p), which makes replication studies appear to be superfluous. A review of studies with 839 academic psychologists and 991 students shows that the replication delusion existed among 20% of the faculty teaching statistics in psychology, 39% of the professors and lecturers, and 66% of the students.”
“Two further beliefs, the illusion of certainty (e.g., that statistical significance proves that an effect exists) and Bayesian wishful thinking (e.g., that the probability of the alternative hypothesis being true is 1 – p), also make successful replication appear to be certain or almost certain, respectively.”
“In every study reviewed, the majority of researchers (56%–97%) exhibited one or more of these delusions.”
“Whereas the strategic-game hypothesis takes the incentives as given, the statistical-ritual hypothesis provides a deeper explanation of the roots of the replication crisis. Researchers are incentivized to aim for the product of the null ritual, statistical significance, not for goals that are ignored by it, such as high power, replication, and precise competing theories and proofs. The statistical-ritual hypothesis provides the rationale for the very incentives chosen by editors, administrators, and committees. Obtaining significant results became the surrogate for good science.”