“What is a successful replication? My students are I wanted to have a clear guide with examples, couldn’t find a clear straightforward article. Here’s a suggested table. Please help, what’s wrong/inaccurate or missing? any other simple criteria to add?”
[This post is based on the report, “The Irreproducibility Crisis of Modern Science: Causes, Consequences and the Road to Reform”, recently published by the National Association of Scholars]
For more than a decade, and especially since the publication of a famous 2005 article by John Ioannidis, scientists in various fields have been concerned with the problems posed by the replication crisis. The importance of the crisis demands that it be understood by a larger audience of educators, policymakers, and ordinary citizens. To this end, our new report, The Irreproducibility Crisis of Modern Science, outlines the nature, causes, and significance of the crisis, and offers a series of proposals for confronting it.
At its most basic level, the crisis arises from the widespread use of statistical methods that inevitably produce some false positives. Misuse of these methods easily increases the number of false positives, leading to the publication of many spurious findings of statistical significance. “P-hacking” (running repeated statistical tests until a finding of significance emerges) is probably the most common abuse of statistical methods, but inadequate specification of hypotheses and the tendentious construction of datasets are also serious problems. (Gelman and Loken 2014 provide several good examples of how easily these latter faults can vitiate research findings.)
Methodological errors and abuses are enabled by too much researcher freedom and too little openness about data and procedures. Researchers’ unlimited freedom in specifying their research designs—and especially their freedom to change their research plans in mid-course—makes it possible to conjure statistical significance even for obviously nonsensical hypotheses (Simmons, Nelson, and Simonsohn 2011 provide a classic demonstration of this). At the same time, lack of outside access to researchers’ data and procedures prevents other experts from identifying problems in experimental design.
Other factors in the irreproducibility crisis exist at the institutional level. Academia and the media create powerful incentives for researchers to advance their careers by publishing new and exciting positive results, while inevitable professional and political tendencies toward groupthink prevent challenges to an existing consensus.
The consequences of all these problems are serious. Not only is a lot of money being wasted—in the United States, up to $28 billion annually on irreproducible preclinical research alone (Freedman et al. 2015)—but individuals and policymakers end up making bad decisions on the basis of faulty science. Perhaps the worst casualty is public confidence in science, as people awaken to how many of the findings they hear about in the news can’t actually be trusted.
Fixing the replication crisis will require energetic efforts to address its causes at every level. Many scientists have already taken up the challenge, and institutions like the Center for Open Science and the Meta-Research Innovation Center at Stanford (METRICS), both in the U.S., have been established to improve the reproducibility of research. Some academic journals have changed the ways in which they ask researchers to present their results, and other journals, such as the International Journal for Re-Views in Empirical Economics, have been created specifically to push back against publication bias by publishing negative results and replication studies. National and international organizations, including the World Health Organization, have begun delineating more stringent research standards.
But much more remains to be done. In an effort to spark an urgently needed public conversation on how to solve the reproducibility crisis, our report offers a series of forty recommendations. At the level of statistics, researchers should cease to regard p-values as dispositive measures of evidence for or against a particular hypothesis, and should try to present their data in ways that avoid a simple either/or determination of statistical significance. Researchers should also pre-register their research procedures and make their methods and data publicly available upon publication of their results. There should also be more experimentation with “born-open” data—data archived in an open-access repository at the moment of its creation, and automatically time-stamped.
Given the importance of statistics in modern science, we need better education at all levels to ensure that everyone—future researchers, journalists, legal professionals, policymakers and ordinary citizens—is well-acquainted with the fundamentals of statistical thinking, including the limits to the certainty that statistical methods can provide. Courses in probability and statistics should be part of all secondary school and university curricula, and graduate programs in disciplines that rely heavily on statistics should take care to emphasize the ways in which researchers can misunderstand and misuse statistical concepts and techniques.
Professional incentives have to change too. Universities judging applications for tenure and promotion should look beyond the number of scholars’ publications, giving due weight to the value of replication studies and expecting adherence to strict standards of reproducibility. Journals should make their peer review processes more transparent, and should experiment with guaranteeing publication for research with pre-registered, peer-reviewed hypotheses and procedures. To combat groupthink, scientific disciplines should ask committees of extradisciplinary professionals to evaluate the openness of their fields.
Private philanthropy, government, and scientific industry should encourage all these efforts through appropriate funding and moral support. Governments also need to consider their role as consumers of science. Many government policies are now made on the basis of scientific findings, and the replication crisis means that those findings demand more careful scrutiny. Governments should take steps to ensure that new regulations which require scientific justification rely solely on research that meets strict standards for reproducibility and openness. They should also review existing regulations and policies to determine which ones may be based on spurious findings.
Solving the replication crisis will require a concerted effort from all sectors of society. But this challenge also represents a great opportunity. As we fight to eliminate opportunities and incentives for bad science, we will be rededicating ourselves to good science and cultivating a deeper public awareness of what good science means. Our report is meant as a step in that direction.
David Randall is Director of Research at the National Association of Scholars (NAS). Christopher Welser is an NAS Research Associate.
Freedman, Leonard P., Iain M. Cockburn, and Timothy S. Simcoe (2015), “The Economics of Reproducibility in Preclinical Research.” PLoS Biology, 13(6), e1002165. doi:10.1371/journal.pbio.1002165
Gelman, Andrew and Eric Loken (2014), “The Statistical Crisis in Science.” American Scientist, 102(6), 460–465.
Ioannidis, John P. A. (2005), “Why Most Published Research Findings Are False.” PLoS Medicine, 2(8), doi:10.1371/journal.pmed.0020124.
Simmons, Joseph P., Leif D. Nelson, and Uri Simonsohn (2011), “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant.” Psychological Science, 22(11), 1359–1366.
[From the opinion article, “How Bad is the Government’s Science?” by Peter Wood and David Randall, published at http://www.wsj.com]
“A deeper issue is that the irreproducibility crisis has remained largely invisible to the general public and policy makers. That’s a problem given how often the government relies on supposed scientific findings to inform its decisions. Every year the U.S. adds more laws and regulations that could be based on nothing more than statistical manipulations.”
“All government agencies should review the scientific justifications for their policies and regulations to ensure they meet strict reproducibility standards. The economics research that steers decisions at the Federal Reserve and the Treasury Department needs to be rechecked. The social psychology that informs education policy could be entirely irreproducible. The whole discipline of climate science is a farrago of unreliable statistics, arbitrary research techniques and politicized groupthink.”
“The process of policy-making also needs to be overhauled. Federal agencies that give out research grants should immediately adopt the NIH’s new standards for funding reproducible research. Congress should pass a law—call it the Reproducible Science Reform Act—to ensure that all future regulations are based on similar high standards.”
[From the White paper, “Practical Challenges for Researchers in Data Sharing”, posted at springernature.com]
“In one of the largest surveys of researchers about research data (with over 7,700 respondents), Springer Nature finds widespread data sharing associated with published works and a desire from researchers that their data are discoverable.….63% of respondents stated that they generally submit data files as supplementary information, deposit the files in a repository, or both. 76% of researchers rated the importance of making their data discoverable highly – with an average rating of 7.3 out of 10 and the most popular rating being 10 out of 10 (25%). “
“The results suggest two areas of focus that could increase the sharing of data amongst researchers, regardless of subject specialism or location:”
–“Increased education and support on good data management for all researchers, but particularly at early stages of researcher’s careers.”
–“Faster, easier routes to optimal ways of sharing data.”
[From the article “A survey on data reproducibility and the effect of publication process on the ethical reporting of laboratory research,” forthcoming in the journal Clinical Cancer Research]
“We developed an anonymous online survey intended for trainees involved in bench research. The survey included questions related to mentoring/career development, research practice, integrity and transparency, and how the pressure to publish, and the publication process itself influence their reporting practices.”
“…39.2% revealed having been pressured by a principle investigator or collaborator to produce “positive” data. 62.8% admitted that the pressure to publish influences the way they report data”
“… This survey indicates that trainees believe that the pressure to publish impacts honest reporting, mostly emanating from our system of rewards and advancement. The publication process itself impacts faculty and trainees and appears to influence a shift in their ethics from honest reporting (“negative data”) to selective reporting, data falsification, or even fabrication.”
[From the preprint article “Researcher conduct determines data reliability” by Mark Wass, Larry Ray, and Martin Michaelis]
“Our findings demonstrate the need for systematic meta-research on the issue of data reproducibility. A reproducibility crisis is widely recognised among researchers from many different fields. There is no shortage of suggestions on how data reproducibility could be improved, but quantitative data on the subject (including the scale of the problem) are largely missing.”
Andrew Gelman had a great post yesterday that highlights a major issue — a really major issue — with replication. The problem is, there is no commonly accepted definition of what a “replication” is. Even when a definition is provided, there is no commonly accepted standard for how to interpret the results of a replication.
The post consists of a series of email excerpts between the author of an original study (Dan Kahan) and the co-authors of a study that claimed “failure to replicate” his study (Christina Ballarini and Steve Sloman), with occasional commentary from Gelman.
The post goes like this:
— Kahan emails Ballarini and Sloman to dispute their claim that they “failed to replicate” his study.
— Ballarini and Sloman both agree that they should not have said their study “failed to replicate” Kahan’s.
— Kahan asks that they make an effort to publicly correct the record.
— Sloman responds by saying that he didn’t really mean that their study didn’t “fail to replicate.” He says “I stand by our report even if you didn’t like one of our verbs [replicate].”
— Kahan then writes a paper refuting the claim that Ballarini and Sloman “failed to replicate” his research (Title of paper = “Rumors of the ‘Nonreplication’ of the ‘Motivated Numeracy Effect’ are Greatly Exaggerated”)
Kahan’s conclusion: “This is a case study in how replication can easily go off the rails. The same types of errors people make in non-replicated papers will now be used in replications.”
Alternatively, one could argue this is NOT a case study in how replication can easily go off the rails. Rather, it illustrates that there are no rails.
To read Gelman’s post in its entirety, click here.
[From the website of the Journal of Economic Psychology announcing a special issue on “Replications in Economic Psychology and Behavioral Economics”]
“In this special issue, we aim to contribute to ongoing efforts in both disciplines to test the replicability of important findings, but also to tackle theoretical questions such as how to improve replicability, how to conduct proper replications, and how to decide which studies should be replicated in the first place. As such, we invite both empirical and theoretical contributions. Regarding empirical contributions to the special issue, we invite replications of previous findings relevant for economic psychology and/or behavioral or experimental economics. We invite two formats of replication studies: classic submissions based on already existing data and submissions of registered reports. Likewise, we invite two types of theoretical contributions: classic submissions of full manuscripts as well as brief proposals of planned theoretical contributions.”
[From the article, “Hundreds of Researchers Are Trying to Replicate High-Profile Psychology Studies” by Stephanie M. Lee in Buzzfeed]
“More than 400 psychologists worldwide are teaming up to fight a looming problem in their field: headline-making research that doesn’t hold up.”
“As part of a new network called the Psychological Science Accelerator, the researchers are trying to fix the so-called replication crisis that’s punctured splashy findings, from Diederik Stapel’s fabricated claims that messy environments lead to discrimination, to Brian Wansink’s retracted studies about eating behavior. …”
“So at the Accelerator, scientists will select a handful of influential studies, attempt to redo them, and share their results with the public, whether or not they’re able to reproduce the original finding.”
“The Accelerator grew out of a blog post that Chartier, an associate psychology professor at Ashland University in Ohio, penned last year. It isn’t the first effort of its kind. In 2015, the Center for Open Science’s Reproducibility Project sought to replicate 100 psychology experiments — and reproduced less than half of the original findings.”
“But while the Reproducibility Project sought to provide an overview of the field, the Accelerator is assigning many researchers to verify just a couple studies.”
[From the article “How (and Whether) to Teach Undergraduates About the Replication Crisis in Psychological Science”, recently published by William Chopik, Ryan Bremner, Andrew Defever, and Victor Keller in Teaching of Psychology]
“Over the past 10 years, crises surrounding replication, fraud, and best practices in research methods have dominated discussions in the field of psychology. However, no research exists examining how to communicate these issues to undergraduates and what effect this has on their attitudes toward the field. We developed and validated a 1-hr lecture communicating issues surrounding the replication crisis and current recommendations to increase reproducibility. Pre- and post-lecture surveys suggest that the lecture serves as an excellent pedagogical tool. Following the lecture, students trusted psychological studies slightly less but saw greater similarities between psychology and natural science fields. We discuss challenges for instructors taking the initiative to communicate these issues to undergraduates in an evenhanded way.”