ANDERSON & MAXWELL: There’s More than One Way to Conduct a Replication Study – Six, in Fact

NOTE: This entry is based on the article, “There’s More Than One Way to Conduct a Replication Study: Beyond Statistical Significance” (Psychological Methods, 2016, Vol, 21, No. 1, 1-12)
Following a large-scale replication project in economics (Chang & Li, 2015) that successfully replicated only a third of 67 studies, a recent headline boldly reads, “The replication crisis has engulfed economics” (Ortman, 2015). Several fields are suffering from a “crisis of confidence” (Pashler & Wagenmakers, 2012, p. 528), as widely publicized replication projects in psychology and medicine have showed similarly disappointing results (e.g., Open Science Collaboration, 2015; Prinz, Schlange, & Asadullah, 2011). There are certainly a host of factors contributing to the crisis, but there is a silver lining: the recent increase in attention toward replication has allowed researchers to consider various ways in which replication research can be improved. Our article (Anderson & Maxwell, 2016, Psychological Methods) sheds light on one potential way to broaden the effectiveness of replication research.
In our article, we take the perspective that replication has often been narrowly defined. Namely, if a replication study is statistically significant, it is considered successful, whereas if the replication study does not meet the significance threshold, it is considered a failure. However, replication need not only be defined by this significant, non-significant distinction. We posit that what constitutes a successful replication can vary based on a researcher’s specific goal. We outline six replication goals and provide details on the statistical analysis for each, noting that these goals are by no means exhaustive.
Deeming a replication as successful when the result is statistically significant is indeed merited in a number of situations (Goal 1). For example, consider the case where two competing theories are pitted against each other. In this situation, we argue that it is the direction of the effect that matters, which validates one theory over another, rather than the magnitude of the effect. Significance based replication can be quite informative in these cases. However, even in this situation, a nonsignificant result should not be taken to mean that the replication was a failure. Researchers who desire to evidence that a reported effect is null can consider Goal 2.
In Goal 2, researchers are interested in showing that an effect does not exist. Although some researchers seem to be aware that this is a valid goal, their choice of analysis often only fails to reject the null, which is rather weak evidence for nonreplication.  We encourage researchers who would like to show that a claimed effect is null to use an equivalence test or Bayesian methods (e.g., ROPE, Kruschke, 2011; Bayes-factors, Rouder & Morey, 2012), both of which can reliably show an effect is essentially zero, rather than simply that it is not statistically significant.
Goal 3 involves accurately estimating the magnitude of a claimed effect. Research has shown that effect sizes in published research are upwardly biased (Lane & Dunlap, 1978; Maxwell, 2004), and effect sizes from underpowered studies may have wide confidence intervals. Thus, a replication researcher may have reason to question the reported effect size of a study and desire to obtain a more accurate estimate of the effect. Researchers with this goal in mind can use accuracy in parameter estimation (AIPE; Maxwell, Kelley, & Rausch, 2008) approaches to plan their sample sizes so that a desired degree of precision in the effect size estimate can be achieved. In the analysis phase, we encourage these researchers to report a confidence interval around the replication effect size. Thus, successful replication for Goal 3 is defined by the degree of precision in estimating the effect size.
Goal 4 involves combining data from a replication study with a published original study, effectively conducting a small meta-analysis on the two studies. Importantly, access to the raw data from the original study is often not necessary. This approach is in keeping with the idea of continuously cumulating meta-analysis, (CCMA; Braver, Thoemmes, & Rosenthal, 2014) wherein each new replication can be incorporated into the previous knowledge. Researchers can report a confidence interval around the average (weighted) effect size of the two studies (e.g., Bonett, 2009). This goal begins to correct some of the issues associated with underpowered studies, even when only a single replication study is involved. For example, Braver and colleagues (2014) illustrate a situation in which the p-value combining original and replication studies (p = .016) was smaller than both the original study (p = .033) and the replication study (p = .198), emphasizing the power advantage of this technique.
In Goal 5, researchers aim to show that a replication effect size is inconsistent with that of the original study. A simple difference in statistical significance is not suited for this goal. In fact, the difference between a statistically significant and nonsignificant finding is not necessarily statistically significant (Gelman & Stern, 2006). Rather, we encourage researchers to consider testing the difference in effect sizes between the two studies, using a confidence interval approach (e.g., Bonett, 2009). Although some authors declare a replication to be a failure when the replication effect size is smaller in magnitude than that reported by the original study, testing the difference in effect sizes for significance is a much more precise indicator of replication success in this situation. Specifically, a nominal difference in effect sizes does not imply that the effects differ statistically (Bonett & Wright, 2007).
Finally, Goal 6 involves showing that a replication effect is consistent with the original effect. In a combination of the recommended analyses for Goals 2 and 5, we recommend conducting an equivalence test on the difference in effect sizes.  Authors who declare their replication study successful when the effect size appears similar to the original study could benefit from knowledge of these analyses, as descriptively similar effect sizes may statistically differ.
We hope that the broader view of replication that we present in our article allows researchers to expand their goals for replication research as well as utilize more precise indicators of replication success and non-success. Although recent replication attempts have painted a grim picture in many fields, we are confident that the recent emphasis on replication will bring about a literature in which readers can be more confident, in economics, psychology, and beyond.
Scott Maxwell is Professor and Matthew A. Fitzsimon Chair in the Department of Psychology at the University of Notre Dame. Samantha Anderson is a PhD student, also in the Department of Psychology at Notre Dame. Correspondence about this blog should be addressed to her at Samantha.F.Anderson.350@nd.edu.
REFERENCES
Bonett, D. G. (2009). Meta-analytic interval estimation for standardized and unstandardized mean differences. Psychological Methods, 14, 225–238. doi:10.1037/a0016619
Bonett, D. G., & Wright, T. A. (2007). Comments and recommendations regarding the hypothesis testing controversy. Journal of Organizational Behavior, 28, 647–659. doi:10.1002/job.448
Braver, S. L., Thoemmes, F. J., & Rosenthal, R. (2014). Continuously cumulating meta-analysis and replicability. Perspectives on Psychological Science, 9, 333–342. doi:10.1177/1745691614529796
Chang, A. C., & Li, P. (2015). Is Economics Research Replicable? Sixty Published Papers from Thirteen Journals Say ”Usually Not”,” Finance and Economics Discussion Series 2015-083. Washington: Board of Governors of the Federal Reserve System, doi:10.17016/FEDS.2015.083
Gelman, A., & Stern, H. (2006). The difference between “significant” and “not significant” is not itself statistically significant. The American Statistician, 60, 328 –331. doi:10.1198/000313006X152649
Kruschke, J. K. (2011). Bayesian assessment of null values via parameter estimation and model comparison. Perspectives on Psychological Science, 6, 299–312. doi:10.1177/1745691611406925
Lane, D. M., & Dunlap, W. P. (1978). Estimating effect size: Bias resulting from the significance criterion in editorial decisions. British Journal of Mathematical and Statistical Psychology, 31, 107–112. doi:10.1111/j.2044-8317.1978.tb00578.x
Maxwell, S. E. (2004). The persistence of underpowered studies in psychological research: Causes, consequences, and remedies. Psychological Methods, 9, 147–163. doi:10.1037/1082-989X.9.2.147
Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). Sample size planning for statistical power and accuracy in parameter estimation. Annual Review of Psychology, 59, 537–563. doi:10.1146/annurev.psych.59.103006.093735
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349, aac4716. doi:10.1126/science.aac4716
Ortman, A. (2015, November 2). The replication crisis has engulfed economics. Retrieved from http://theconversation.com/the-replication-crisis-has-engulfed-economics-49202
Pashler, H., & Wagenmakers, E. J. (2012). Editors’ introduction to the special section on replicability in psychological science: A crisis of confidence? Perspectives on Psychological Science, 7, 528–530. doi:10.1177/1745691612465253
Prinz, F., Schlange, T., & Asadullah, K. (2011). Believe it or not: How much can we rely on published data on potential drug targets? Nature Reviews Drug Discovery, 10, 712–713.
Rouder, J. N., & Morey, R. D. (2012). Default Bayes factors for model selection in regression. Multivariate Behavioral Research, 47, 877–903. doi:10.1080/00273171.2012.734737

IN THE NEWS: BBC (February 22, 2017)

[From the article “Most scientists can’t replicate studies by their peers” from the BBC/News/Science&Environment website]  “Science is facing a “reproducibility crisis” where more than two-thirds of researchers have tried and failed to reproduce another scientist’s experiments, research suggests. This is frustrating clinicians and drug developers who want solid foundations of pre-clinical research to build upon.”
“From his lab at the University of Virginia’s Centre for Open Science, immunologist Dr Tim Errington runs The Reproducibility Project, which attempted to repeat the findings reported in five landmark cancer studies. “The idea here is to take a bunch of experiments and to try and do the exact same thing to see if we can get the same results.” You could be forgiven for thinking that should be easy. Experiments are supposed to be replicable. … Sadly nothing, it seems, could be further from the truth.”
To read more, click here.

Got Reproducibility? No? Check out the Project Tier Workshop for Faculty

[From the Project Tier website]  Project Tier offers faculty development workshops to help faculty incorporate reproducibility in their research supervision and teaching. “These Workshops will introduce participants to the TIER protocol for replicable empirical research. They are intended for faculty members interested in teaching their own students to follow this protocol to document the statistical work they do for senior theses, other independent research projects, or papers written for classes. The Workshops are held on the campus of Haverford College, and last one and one-half days. …The next Faculty Development Workshop will take place at Haverford College on March 31-April 1, 2017. The target deadline for applications is March 1.  Applicants who apply by that date will be notified of acceptance decisions by March 3.  Applications received after March 1 will be reviewed on a rolling basis until the workshop capacity (10 participants) is filled.” To learn more, click here.

Kahneman Shows the Right Way In Getting It Wrong

[From the website Retraction Watch] “Although it’s the right thing to do, it’s never easy to admit error — particularly when you’re an extremely high-profile scientist whose work is being dissected publicly. So while it’s not a retraction, we thought this was worth noting: A Nobel Prize-winning researcher has admitted on a blog that he relied on weak studies in a chapter of his bestselling book.” To read more, click here.

To p-value or Not To p-value, That Is The Question

[From the article, “Scientists, fishing for significance, get a meager catch” by Ivan Oransky and Adam Marcus at the website STAT] “If you cast a wide enough net, you’ll find what looks like a prize-winning fish. But you’ll also catch a lot of seaweed, plastic debris, and maybe even a dolphin you didn’t mean to bring in. Such is the dilemma of interpreting scientific results with statistics. The net, in this analogy, is the statistical concept of a “p-value.” And a growing chorus of experts says that scientific research is using too wide a net — and therefore publishing results that turn out to be false.” To read more, click here.

Check It Out: Replications in International Relations

[from Nicole Janz’s blog at Political Science Replication] “Nils Petter Gledisch and I just published a guest blog post about replication in international relations at the OUP blog. The blog is based on new research in the field, which we published as a symposium in International Studies Perspectives. We negotiated with OUP that all seven articles will be free access for a few weeks. Make sure to download all the pdfs before they go behind paywall again.” To read more, click here.

MUELLER-LANGER, FECHER, HARHOFF & WAGNER: What Matters for Replication

NOTE: This entry is based on the paper, “The Economics of Replication
Replication studies are considered a hallmark of good scientific practice (1). Yet they are treated among researchers as an ideal to be professed but not practised (2, 3). For science policy makers, journal editors and external research funders to design favourable boundary conditions, it is therefore necessary to understand what drives replication.
Using metadata from all articles published in the top-50 economics journals from 1974 to 2014, we investigated how often replication studies are published and which types of journal articles are replicated.
We find that replication is a matter of impact: High-impact articles and articles by authors from leading institutions are more likely to be replicated. We could not find empirical evidence for the hypothesis that the lower cost of replication that is associated with the availability of data and code has a significant effect on the incidence of replication.
We argue that researchers behave highly rationally in terms of the academic reputation economy, as they tend to replicate high-impact research from renowned researchers and institutions, possibly because in this case replications are more likely to be published (4). Our results are in line with previous assumptions that relate replication to impact (3, 57). In this regard, private incentives are well aligned with societal interests, since high-impact publications are also the studies that are most likely to influence political and economic decisions as well as the public discourse.
However, the question remains whether sufficient replications are conducted to guarantee the correctness of published findings. While we have no analytical result that would indicate which rate of replication is optimal for a scientific discipline, having less than 0.1% of articles among the top-50 journals in economics being replications strikes us as unreasonably low. In addition, there is no reason to believe that the share of published replication studies should be significantly higher among non-top-50 articles (2). We argue that the incidence of replication poses no threat to researchers.
We also have to note that we cannot detect any statistically strong impact of data disclosure policies. Moreover, for 37% of the studies empirical articles subject to mandatory data-disclosure, the data or program code was not available although the data was not proprietary. This raises concern regarding the enforcement of mandatory data disclosure policies.
Our results suggest that replication is—at least partly—driven by the replicator’s reputation considerations. Thus the low number of replication studies being conducted would possibly increase if replication received more formal recognition, e.g. through publication in (high-impact) journals or specific funding. The same holds true for the replicated author who should receive formal recognition if his results were successfully replicated. This could additionally motivate authors to ensure the replicability of published results. Moreover, considering the costs of replication, a stronger commitment of publishers for the replicability of research by establishing and enforcing data availability policies would lower the barrier for replicators.
Frank Mueller-Langer is Senior Research Fellow at the Max Planck Institute for Innovation and Competition and the Joint Research Center, Seville. Benedikt Fecher is a doctoral student at the German Institute of Economic Research and the Alexander von Humboldt Institute for Internet and Society. Dietmar Harhoff is Director at the Max Planck Institute for Innovation and Competition. Gert G. Wagner is Board Member of the German Institute for Economic Research and Max Planck Fellow at the MPI for Human Development in Berlin. Correspondence about this blog should be directed to Benedikt Fecher at fecher@hiig.de.
References
(1)  B. R. Jasny, G. Chin, L. Chong, S. Vignieri, Again, and Again, and Again … Science. 334, 1225–1225 (2011).
(2)  M. Duvendack, R. W. Palmer-Jones, W. R. Reed, Replications in Economics: A Progress Report. Econ Journal Watch. 12, 164–191 (2015).
(3)  D. S. Hamermesh, Viewpoint: Replication in economics. Canadian Journal of Economics. 40, 715–733 (2007).
(4)  B. Fecher, S. Friesike, M. Hebing, S. Linek, A. Sauermann, A Reputation Economy: Results from an Empirical Survey on Academic Data Sharing. DIW Berlin Discussion Paper. 1454 (2015) (available at http://dx.doi.org/10.2139/ssrn.2568693).
(5)  D. Hamermesh, What is Replication? The Possibly Exemplary Example of Labor Economics (2017), (available at https://www.aeaweb.org/conference/2017/preliminary/2100?sessionType%5Bsession%5D=1&organization_name=&search_terms=replication&day=&time=). 
(6)  J. L. Furman, K. Jensen, F. Murray, Governing Knowledge in the Scientific Community: Exploring the Role of Retractions in Biomedicine. Research Policy. 41, 276–290 (2012).
(7)  W. G. Dewald, J. G. Thursby, R. G. Anderson, Replication in Empirical Economics: The Journal of Money, Credit and Banking Project. The American Economic Review. 76, 587–603 (1986).

Andrew Gelman Asks, Does Criticizing Bad Research Do More Harm Than Good?

In a recent post at his blogsite, Statistical Modeling, Causal inference, and Social Science, Andrew Gelman asks whether his recent criticisms on statistical grounds of a prominent researcher’s experiments on healthy eating are doing more harm than good. The researcher, Brian Wansink, is John Dyson Endowed Chair in the Applied Economics and Management Department at Cornell University. His book, Mindless Eating: Why We Eat More Than We Think (2006), and associated research has been credited with having a positive impact on the eating habits of thousands (millions?) of consumers (see here). The arguments supporting the idea that criticisms, even when correct, can do more harm than good, are not new.  But they do get at the heart of science, and replications, and are worth a good think.  To read Gelman’s blog, click here.

FECHER, WAGNER & FRÄSSDORF: Social Scientists and Replications: Tell Me What You Really Think!

NOTE: This entry is based on the article, “Perceptions and Practices of Replication by Social and Behavioral Scientists: Making Replications a Mandatory Element of Curricula Would Be Useful
 In times of increasing publication rates and specialization of disciplines, it is particularly important for academia to reflect upon measures to safeguard the integrity of research, beyond the classical peer review. Empirical economics especially faces this challenge due to its responsibility towards society, but also because an increasing number of studies have called the reproducibility of findings into question (14). A prominent example is Reinhart and Rogoff’s study “Growth in a Time of Debt” on the effectiveness of austerity-based fiscal policies for highly indebted economies (5). The results of the study clearly translated into politics although it was based on fundamental miscalculations, as demonstrated by a replication study by Herndon et al. (6).
Replication studies are important because they contribute to the self-correction abilities of the self-referential scientific ecosystem. Moreover, “low cost” replication studies that use the primary investigator’s original dataset, seem increasingly feasible considering pressure by funding agencies and science policy makers to make research data available (7, 8). Nonetheless, replication studies are rarely conducted (9).
To better understand researchers views towards replication, we surveyed the perceptions and replication practices of 300 social and behavioral scientists who use data from the German Socio-Economic Panel Study (SOEP), a widely analyzed multi-cohort study of the German population (10).
84 per cent of the surveyed researchers agree that replications are necessary for improving scientific output and 71 per cent disagree with the statement that replications are not worthwhile because major mistakes will be found at some point anyway.
58 per cent of our respondents never attempted a replication despite the fact that SOEP data is easily obtained, well-documented and frequently analyzed. Of those respondents who had conducted a replication study in the past, more than half of them were conducted during regular coursework – either while teaching a class (13% of all respondents) or while being taught as a student (9%). 20% of the respondents used a replication of a SOEP article for their own research. Of those who never conducted a replication study, 76% never saw a need to do so, while the rest thought it would be too time consuming (15%) or did not have enough information (9%)— either about the data, the software or the way results in the original article were produced, i.e., the scripts—were not available.
As for those who did replicate a SOEP article, 84% were able to reproduce the results of the original article (although the results were not always exactly identical to those found by the original authors), while only 16% were not able to do so. When asked about the reason why the results could not be completely replicated, 69% of the respondents stated that the information about details of the analysis in the original article was insufficient.
The situation regarding replications can be regarded as a “tragedy of the commons”: everybody knows that they are useful, but almost everybody counts on others to conduct them. A possible explanation for this is that conducting replication studies is not worthwhile in the context of the academic reward system since they are time-consuming and rarely published (9). Previous research showed that impact considerations are already steering replication efforts (11, 12). For instance, researchers target high impact studies. Nevertheless, the number of replication studies is still considerably low. Against this background, we argue that more replications would be conducted if they received more formal recognition (e.g., journals could adapt their policies and publish more replication studies (13)). Our results furthermore show that most of the replication studies are conducted in the context of teaching. In our view, this is a promising detail: in order to increase the number of replication studies, it seems useful to make replications a mandatory part of curricula and an optional chapter of (cumulative) doctoral theses.
Benedikt Fecher is a doctoral student at the German Institute of Economic Research and the Alexander von Humboldt Institute for Internet and Society. Mathis Fräßdorf is Head of the Department for Scientific Information at Wissenshaftszentrum Berlin für Sozialforschung. Gert Wagner is Professor of Economics at the Berlin University of Technology. Correspondence about this blog should be directed to Benedikt Fecher at fecher@hiig.de.
References
(1) R. G. Anderson, A. Kichkha, Replication versus Meta-Analysis in Economics: Where Do We Stand 30 Years After Dewald, Thursby and Anderson? (2017), (available at https://www.aeaweb.org/conference/2017/preliminary/1530?page=5&per-page=50).
(2) C. F. Camerer et al., Evaluating replicability of laboratory experiments in economics. Science (2016), doi:10.1126/science.aaf0918.
(3) W. G. Dewald, J. G. Thursby, R. G. Anderson, Replication in Empirical Economics: The Journal of Money, Credit and Banking Project. The American Economic Review. 76, 587–603 (1986).
(4) M. Duvendack, R. Jones, R. Reed, What is Meant by “Replication” and Why Does It Encounter Such Resistance in Economics? (2017), (available at https://www.aeaweb.org/conference/2017/preliminary/1530?page=5&per-page=50).
(5) C. Reinhart, K. Rogoff, “Growth in a Time of Debt” (w15639, National Bureau of Economic Research, Cambridge, MA, 2010), (available at http://www.nber.org/papers/w15639.pdf).
(6) T. Herndon, M. Ash, R. Pollin, Does high public debt consistently stifle economic growth? A critique of Reinhart and Rogoff. Cambridge Journal of Economics. 38, 257–279 (2013).
(7) M. McNutt, Reproducibility. Science. 343, 229–229 (2014).
(8) B. Fecher, G. G. Wagner, A research symbiont. Science. 351, 1405–1406 (2016).
(9) C. L. Park, What is the value of replicating other studies? Research Evaluation. 13, 189–195 (2004).
(10) DIW Berlin, Übersicht über das SOEP (2015), (available at http://www.diw.de/deutsch/soep/26628.html).
(11) D. Hamermesh, What is Replication? The Possibly Exemplary Example of Labor Economics (2017), (available at https://www.aeaweb.org/conference/2017/preliminary/2100?sessionType%5Bsession%5D=1&organization_name=&search_terms=replication&day=&time=).
(12) S. Sukhtankar, Replications in Development (2017), (available at https://www.aeaweb.org/conference/2017/preliminary/2100?sessionType%5Bsession%5D=1&organization_name=&search_terms=replication&day=&time=).
(13) J. H. Hoeffler, Replication and Economics Journal Policies (2017), (available at https://www.aeaweb.org/conference/2017/preliminary/1530?page=5&per-page=50).

Some Tools for Checking Reliability in Others’ Research, and Ensuring Reliability in Your Own

[From the blogsite “The Skeptical Scientist” by Tim van der Zee] “Roughly speaking there are two kind of tools which I will list here. First, it is important to be skeptical of what you read, such as checking reported statistics. Secondly, it is even more important to be skeptical of yourself; with tools like pre-registration you can substantially increase the evidential value of you studies and prevent fooling yourself (and others).”
Tools listed include: QuickCalcs, GRIMMER, StatCheck, G*Power, Bayes Factor calculator, ShinyApps, AsPredicted.org, Open Science Framework, and Registered Reports. To read more, click here