NOTE: This entry is based on the article, “There’s More Than One Way to Conduct a Replication Study: Beyond Statistical Significance” (Psychological Methods, 2016, Vol, 21, No. 1, 1-12)
Following a large-scale replication project in economics (Chang & Li, 2015) that successfully replicated only a third of 67 studies, a recent headline boldly reads, “The replication crisis has engulfed economics” (Ortman, 2015). Several fields are suffering from a “crisis of confidence” (Pashler & Wagenmakers, 2012, p. 528), as widely publicized replication projects in psychology and medicine have showed similarly disappointing results (e.g., Open Science Collaboration, 2015; Prinz, Schlange, & Asadullah, 2011). There are certainly a host of factors contributing to the crisis, but there is a silver lining: the recent increase in attention toward replication has allowed researchers to consider various ways in which replication research can be improved. Our article (Anderson & Maxwell, 2016, Psychological Methods) sheds light on one potential way to broaden the effectiveness of replication research.
In our article, we take the perspective that replication has often been narrowly defined. Namely, if a replication study is statistically significant, it is considered successful, whereas if the replication study does not meet the significance threshold, it is considered a failure. However, replication need not only be defined by this significant, non-significant distinction. We posit that what constitutes a successful replication can vary based on a researcher’s specific goal. We outline six replication goals and provide details on the statistical analysis for each, noting that these goals are by no means exhaustive.
Deeming a replication as successful when the result is statistically significant is indeed merited in a number of situations (Goal 1). For example, consider the case where two competing theories are pitted against each other. In this situation, we argue that it is the direction of the effect that matters, which validates one theory over another, rather than the magnitude of the effect. Significance based replication can be quite informative in these cases. However, even in this situation, a nonsignificant result should not be taken to mean that the replication was a failure. Researchers who desire to evidence that a reported effect is null can consider Goal 2.
In Goal 2, researchers are interested in showing that an effect does not exist. Although some researchers seem to be aware that this is a valid goal, their choice of analysis often only fails to reject the null, which is rather weak evidence for nonreplication. We encourage researchers who would like to show that a claimed effect is null to use an equivalence test or Bayesian methods (e.g., ROPE, Kruschke, 2011; Bayes-factors, Rouder & Morey, 2012), both of which can reliably show an effect is essentially zero, rather than simply that it is not statistically significant.
Goal 3 involves accurately estimating the magnitude of a claimed effect. Research has shown that effect sizes in published research are upwardly biased (Lane & Dunlap, 1978; Maxwell, 2004), and effect sizes from underpowered studies may have wide confidence intervals. Thus, a replication researcher may have reason to question the reported effect size of a study and desire to obtain a more accurate estimate of the effect. Researchers with this goal in mind can use accuracy in parameter estimation (AIPE; Maxwell, Kelley, & Rausch, 2008) approaches to plan their sample sizes so that a desired degree of precision in the effect size estimate can be achieved. In the analysis phase, we encourage these researchers to report a confidence interval around the replication effect size. Thus, successful replication for Goal 3 is defined by the degree of precision in estimating the effect size.
Goal 4 involves combining data from a replication study with a published original study, effectively conducting a small meta-analysis on the two studies. Importantly, access to the raw data from the original study is often not necessary. This approach is in keeping with the idea of continuously cumulating meta-analysis, (CCMA; Braver, Thoemmes, & Rosenthal, 2014) wherein each new replication can be incorporated into the previous knowledge. Researchers can report a confidence interval around the average (weighted) effect size of the two studies (e.g., Bonett, 2009). This goal begins to correct some of the issues associated with underpowered studies, even when only a single replication study is involved. For example, Braver and colleagues (2014) illustrate a situation in which the p-value combining original and replication studies (p = .016) was smaller than both the original study (p = .033) and the replication study (p = .198), emphasizing the power advantage of this technique.
In Goal 5, researchers aim to show that a replication effect size is inconsistent with that of the original study. A simple difference in statistical significance is not suited for this goal. In fact, the difference between a statistically significant and nonsignificant finding is not necessarily statistically significant (Gelman & Stern, 2006). Rather, we encourage researchers to consider testing the difference in effect sizes between the two studies, using a confidence interval approach (e.g., Bonett, 2009). Although some authors declare a replication to be a failure when the replication effect size is smaller in magnitude than that reported by the original study, testing the difference in effect sizes for significance is a much more precise indicator of replication success in this situation. Specifically, a nominal difference in effect sizes does not imply that the effects differ statistically (Bonett & Wright, 2007).
Finally, Goal 6 involves showing that a replication effect is consistent with the original effect. In a combination of the recommended analyses for Goals 2 and 5, we recommend conducting an equivalence test on the difference in effect sizes. Authors who declare their replication study successful when the effect size appears similar to the original study could benefit from knowledge of these analyses, as descriptively similar effect sizes may statistically differ.
We hope that the broader view of replication that we present in our article allows researchers to expand their goals for replication research as well as utilize more precise indicators of replication success and non-success. Although recent replication attempts have painted a grim picture in many fields, we are confident that the recent emphasis on replication will bring about a literature in which readers can be more confident, in economics, psychology, and beyond.
Scott Maxwell is Professor and Matthew A. Fitzsimon Chair in the Department of Psychology at the University of Notre Dame. Samantha Anderson is a PhD student, also in the Department of Psychology at Notre Dame. Correspondence about this blog should be addressed to her at Samantha.F.Anderson.firstname.lastname@example.org.
Bonett, D. G. (2009). Meta-analytic interval estimation for standardized and unstandardized mean differences. Psychological Methods, 14, 225–238. doi:10.1037/a0016619
Bonett, D. G., & Wright, T. A. (2007). Comments and recommendations regarding the hypothesis testing controversy. Journal of Organizational Behavior, 28, 647–659. doi:10.1002/job.448
Braver, S. L., Thoemmes, F. J., & Rosenthal, R. (2014). Continuously cumulating meta-analysis and replicability. Perspectives on Psychological Science, 9, 333–342. doi:10.1177/1745691614529796
Chang, A. C., & Li, P. (2015). Is Economics Research Replicable? Sixty Published Papers from Thirteen Journals Say ”Usually Not”,” Finance and Economics Discussion Series 2015-083. Washington: Board of Governors of the Federal Reserve System, doi:10.17016/FEDS.2015.083
Gelman, A., & Stern, H. (2006). The difference between “significant” and “not significant” is not itself statistically significant. The American Statistician, 60, 328 –331. doi:10.1198/000313006X152649
Kruschke, J. K. (2011). Bayesian assessment of null values via parameter estimation and model comparison. Perspectives on Psychological Science, 6, 299–312. doi:10.1177/1745691611406925
Lane, D. M., & Dunlap, W. P. (1978). Estimating effect size: Bias resulting from the significance criterion in editorial decisions. British Journal of Mathematical and Statistical Psychology, 31, 107–112. doi:10.1111/j.2044-8317.1978.tb00578.x
Maxwell, S. E. (2004). The persistence of underpowered studies in psychological research: Causes, consequences, and remedies. Psychological Methods, 9, 147–163. doi:10.1037/1082-989X.9.2.147
Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). Sample size planning for statistical power and accuracy in parameter estimation. Annual Review of Psychology, 59, 537–563. doi:10.1146/annurev.psych.59.103006.093735
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349, aac4716. doi:10.1126/science.aac4716
Pashler, H., & Wagenmakers, E. J. (2012). Editors’ introduction to the special section on replicability in psychological science: A crisis of confidence? Perspectives on Psychological Science, 7, 528–530. doi:10.1177/1745691612465253
Prinz, F., Schlange, T., & Asadullah, K. (2011). Believe it or not: How much can we rely on published data on potential drug targets? Nature Reviews Drug Discovery, 10, 712–713.
Rouder, J. N., & Morey, R. D. (2012). Default Bayes factors for model selection in regression. Multivariate Behavioral Research, 47, 877–903. doi:10.1080/00273171.2012.734737