
In support of recent efforts by social scientists to address the ‘reproducibility crisis’, the Journal of Development Effectiveness (JDEff) recently devoted a special issue on replication research studies in its last issue of 2019. Most journals continue to favor new research rather than replication work and to publish research whose data and codes have not been tested. As editors of the journal, and as the past and current executive director of the International Initiative for Impact Evaluation (3ie) which hosts it, we felt that such an issue could increase awareness among funders and researchers of how replication strengthens the reliability, rigor and relevance of their investment. It would also ensure that the replication research will be acknowledged and appreciated by the larger development community.
The special issue was devoted to the topic of enhancing financial inclusion in developing countries. 3ie, which has championed replication research for many years, had worked closely with the Bill and Melinda Gates Foundation’s Financial Services for the Poor (FSP) program to identify the studies, screen the applicants, and quality assure the replication research in this important area. FSP invests millions of dollars to broaden the reach of low-cost digital financial services for the poor by supporting the most catalytic approaches to financial inclusion, such as the development of digital payment systems, advancing gender equality, and supporting national and regional strategies. In doing so, it relies heavily on research and evidence studies, many of which, although cited and referenced heavily, have not been replicated.
We hope that these replications can be used appropriately by FSP and other stakeholders to inform future investments in an important part of the development toolkit – expanding financial services to the poor. About 1.7 billion people worldwide are excluded from formal financial services, such as savings, payments, insurance, and credit. In developing economies, nearly one billion are still left out of the formal financial system, and there is a 9 percent persistent gender gap in financial inclusion in developing economies.
Most poor households instead, operate almost entirely through a cash economy. They are cut off from potentially stabilizing and uplifting opportunities like building credit or getting a loan to start a business. And it’s harder to weather common financial setbacks, such as serious illness, a poor harvest, or an economic downturn, such as the one the world experiencing now due to the coronavirus epidemic. In fact, just last week, one of us, who had just moved from Delhi to Manila, was able to help his former Indian housekeeper with a cash transfer, only because she had a bank account. It took all of 15 minutes to send her much needed financial support, which would have been much more difficult otherwise. Millions of others have no access to such mechanisms.
The issue in JDEff replicates 6 important financial inclusion studies. These include a study on providing banking access to farmers; three studies that evaluated interventions to introduce innovative alternatives to traditional banking, such as using mobile phones or biometric smartcards as payment mechanisms for transfers; and two studies that studied the effects of different kinds of transfers (cash versus kind; conditional versus unconditional) that are distributed through financial institutions. Importantly, the replications were able to reproduce the principal results of all of the studies. It is as important to highlight this finding, as much as it is to get notoriety through “gotcha” replications that appear to overturn results, such as in the “worm wars” of a few years ago.
The replications also make several useful findings about nuances that the original research may have missed, such as about heterogeneous effects.
Research must meet the higher bar of being good enough for decision-making that affects human lives (not merely good enough for publication. Organizations like 3ie which consider replication as an important tool in making research more rigorous takes the following lessons from this JDEff issue on doing replications in the future. One is to ensure that, beyond taking the original research at face value, enough attention is dedicated to results not reported (to avoid reporting bias), to the policy significance of the reported results, to reflections about possible rival explanations for the results, and to how the main variables were constructed. Another replication deficiency in current research relates to the replication of qualitative research. While there is increasing acceptance that replication of quantitative research is part of best practice by funders and journals, replication in the qualitative research field is nascent. In a new initiative, 3ie is partnering with the Qualitative Data Repository (Syracuse University) to get their assistance in archiving and sharing select qualitative data, learn from the experience and thereby contribute to lessons and guidance on how to do this in the future. Finally, within the evidence architecture, it is worthwhile promoting systematic reviews as a set of replications of studies in differentiated real life settings.
Marie Gaarder is the current Executive Director of the International Initiative for Impact Evaluation (3ie). Emmanuel Jimenez is a Senior Fellow at and former Executive Director of 3ie, and Editor-in-Chief of the Journal of Development Effectiveness. They can be contacted, respectively, at mgaarder@3ieimpact.org and ejimenez@3ieimpact.org.
[Excerpts are taken from two articles, “The Unfortunately Long Life of Some Retracted Biomedical Research Publications”, by James M. Hagberg, published in the Journal of Applied Physiology; and “Inflated citations and metrics of journals discontinued from Scopus for publication concerns: the GhoS(t)copus Project”, by Andrea Cortegiani et al., posted at BioArXiv]
The Unfortunately Long Life of Some Retracted Biomedical Research Publications
“In 2005 the scientific misconduct case of a noted researcher concluded with, among other things, the retraction of 10 papers. However, these articles continue to be cited at relatively high rates.”
“While it initially appears there was a relative “cleansing”, as citation rates for these articles did decrease after retraction, the reductions in citation rates for these articles (-28%) were the same as those for matched non-retracted publications both by the same author (-28%) and by another investigator (-29%) over the same time frame.”
To read the article, click here. (NOTE: Article is behind a paywall.)
Inflated citations and metrics of journals discontinued from Scopus for publication concerns: the GhoS(t)copus Project
“The journals included in Scopus are periodically re-evaluated to ensure they meet indexing criteria. Afterwards, some journals might be discontinued for publication concerns. Despite their discontinuation, previously published articles remain indexed and continue to be cited.”
“This study aimed (1) to evaluate the main features and citation metrics of journals discontinued from Scopus for publication concerns, before and after discontinuation, and (2) to determine the extent of predatory journals among the discontinued journals.”
“A total of 317 journals were evaluated. The mean number of citations per year after discontinuation was significantly higher than before (median of difference 64 citations, p<0.0001), and so was the number of citations per document (median of difference 0.4 citations, p<0.0001).”
“Twenty-two percent (72/317) of the journals were included in the Cabell blacklist.”
To read the article, click here.
[Excerpts taken from the article “The Stewart Retractions: A Quantitative and Qualitative Analysis”, by Justin Pickett, published in Econ Journal Watch]
“This study analyzes the recent retraction of five articles from three sociology journals—Social Problems, Criminology, and Law & Society Review.”
“The only coauthor on all five retracted articles was Dr. Eric Stewart. He was the data holder and analyst for each article. I coauthored one of the retracted articles…”
“I organize my analysis of the quantitative and qualitative data into three sections: (1) what happened in the articles, (2) what happened among the coauthors, and (3) what happened at the journals. Everything—data, code, emails, text messages, Excel files, drafts, and university documents—needed to verify my claims is provided online.”
What happened in the five articles
— “Consistently incorrect means and standard deviations”
— “Non-uniform terminal-digit distributions”
— “Unverifiable surveys”
— “Identical statistics after changes in…everything else”
— “Inexplicable sample sizes and statistics”
— “Unreported, implausible county clusters”
What happened among the coauthors
“The coauthors of the five retracted articles include two past editors of Criminology, the flagship journal of the American Society of Criminology (ASC), as well as three ASC Fellows and two ASC vice presidents.”
“Two coauthors, Brian Johnson and Eric Baumer, have written articles about the importance of research ethics (e.g., “What Scholars Should Know about ‘Self-Plagiarism’” (Lauritsen et al. 2019); “Salami-Slicing, Peek-a-Boo, and LPUS: Addressing the Problem of Piecemeal Publication” (Gartner et al. 2012)).”
“To my knowledge, none of the coauthors have spoken publicly about what happened in the retracted articles, except to insist in the retraction notices that the irregularities resulted from “coding mistakes” and “transcription errors” (Law & Society Review 2020; Criminology 2020a; b), and to defend the accuracy of the retracted findings (Law & Society Review 2020).”
“Scientific fraud occurs all too frequently—approximately 1 in 50 scientists admit to fabricating or falsifying data (Fanelli 2009)—and I believe it is the most likely explanation for the data irregularities in the five retracted articles…The retraction notices say honest error, not fraud, is the explanation. Fortunately, if that is true, Dr. Stewart could easily prove it: recreate the original sample (N = 1,184) that produces the findings in Johnson et al. (2011) and then publicly explain how he did it.”
“…many authors are reluctant to share data publicly, and sometimes there are legitimate privacy concerns or externally imposed restrictions. Sharing data with coauthors, however, should be uncontroversial and feasible.”
“Yet without institutional support, coauthors may feel uncomfortable requesting data. For example, once irregularities were identified in their articles, Dr. Stewart’s coauthors were reluctant to press him for the data, probably because of concerns related to friendship and loyalty.”
What happened at the journals
“None of the editors followed COPE’s guidelines when alerted to the irregularities in Dr. Stewart’s articles. One editor seemingly tried to coordinate a collective response of ignoring the allegations, even though she recognized their potential seriousness…At Criminology, what seems to have driven how the co-editors responded was sympathy for some of the authors and a low opinion of critics.”
“Connections between the co-editors and authors are likely to blame; Dr. Johnson, the lead author of one article, was a co-editor, and Dr. Stewart, the lead author of the other, was to become a co-editor.”
Conclusion and recommendations
“The Stewart scandal took place over five months, and it required considerable time and effort from the editors involved. The editors corresponded extensively with each other and with other parties…the Criminology co-editors wrote multiple public statements about the steps they were taking to address the problems. Why did they not simply ask Dr. Stewart for his data?”
To read the article, click here.
NOTE FROM TRN: The excerpts above do not do justice to the article. The article should be read!
[Excerpts taken from the transcript to Series 2, Episode 15 of the Big Bang Theory – “The Maternal Capacitance”]
LEONARD’S MOTHER: Leonard, it’s one o’clock, weren’t you going to show me your laboratory at one o’clock?
LEONARD: There’s no hurry, Mother…
LEONARD’S MOTHER: But it’s one o’clock, you were going to show me your laboratory at one o’clock.
SHELDON: Her reasoning is unassailable. It is one o’clock.
LEONARD: Fine. Let’s go. I think you’ll find my work pretty interesting. I’m attempting to replicate the dark matter signal found in sodium iodide crystals by the Italians.
LEONARD’S MOTHER: So, no original research?
LEONARD: No.
LEONARD’S MOTHER: Well, what’s the point of my seeing it? I could just read the paper the Italians wrote.
To read the full transcript, click here.
[Excerpts taken from the article, “What is replication?” by Brian Nosek and Tim Errington, published in PloS Biology]
“Credibility of scientific claims is established with evidence for their replicability using new data. This is distinct from retesting a claim using the same analyses and same data (usually referred to as reproducibility or computational reproducibility) and using the same data with different analyses (usually referred to as robustness).”
“Prior commentators have drawn distinctions between types of replication such as “direct” versus “conceptual” replication and argue in favor of valuing one over the other…By contrast, we argue that distinctions between “direct” and “conceptual” are at least irrelevant and possibly counterproductive for understanding replication and its role in advancing knowledge.”
“We propose an alternative definition for replication that is more inclusive of all research and more relevant for the role of replication in advancing knowledge. Replication is a study for which any outcome would be considered diagnostic evidence about a claim from prior research. This definition reduces emphasis on operational characteristics of the study and increases emphasis on the interpretation of possible outcomes.”
“To be a replication, 2 things must be true: outcomes consistent with a prior claim would increase confidence in the claim, and outcomes inconsistent with a prior claim would decrease confidence in the claim.”
“Because replication is defined based on theoretical expectations, not everyone will agree that one study is a replication of another.”
“Because there is no exact replication, every replication test assesses generalizability to the new study’s unique conditions. However, every generalizability test is not a replication…there are many conditions in which the claim might be supported, but failures would not discredit the original claim.”
“This exposes an inevitable ambiguity in failures-to-replicate. Was the original evidence a false positive or the replication a false negative, or does the replication identify a boundary condition of the claim?”
“We can never know for certain that earlier evidence was a false positive. It is always possible that it was “real,” and we cannot identify or recreate the conditions necessary to replicate successfully…Accumulating failures-to-replicate could result in a much narrower but more precise set of circumstances in which evidence for the claim is replicable, or it may result in failure to ever establish conditions for replicability and relegate the claim to irrelevance.”
“The term “conceptual replication” has been applied to studies that use different methods to test the same question as a prior study. This is a useful research activity for advancing understanding, but many studies with this label are not replications by our definition.”
“Recall that “to be a replication, 2 things must be true: outcomes consistent with a prior claim would increase confidence in the claim, and outcomes inconsistent with a prior claim would decrease confidence in the claim.”
“Many “conceptual replications” meet the first criterion and fail the second…“conceptual replications” are often generalizability tests. Failures are interpreted, at most, as identifying boundary conditions. A self-assessment of whether one is testing replicability or generalizability is answering—would an outcome inconsistent with prior findings cause me to lose confidence in the theoretical claims? If no, then it is a generalizability test.”
To read the article, click here.
[Excerpts taken from the article “When We’re Wrong, It’s Our Responsibility as Scientists to Say So” by Ariella Kristal et al., published in Scientific American.]
“What simple, costless interventions can we use to try to reduce tax fraud? As behavioral scientists, we tried to answer this question using what we already know from psychology: People want to see themselves as good.”
“…we thought that by reminding people of being truthful before reporting their income, they would be more honest. Building on this idea, in 2012, we came up with a seemingly costless simple intervention: Get people to sign a tax or insurance audit form before they reported critical information (versus after, the common business practice).”
“While our original set of studies found that this intervention worked in the lab and in one field experiment, we no longer believe that signing before versus after is a simple costless fix….Seven years and hundreds of citations and media mentions later, we want to update the record.”
“Based on research we recently conducted—with a larger number of people—we found abundant evidence that signing a veracity statement at the beginning of a form does not increase honesty compared to signing at the end.”
“Why are we updating the record? In an attempt to replicate and extend our original findings, three people on our team (Kristal, Whillans and Bazerman) found no evidence for the observed effects across five studies with 4,559 participants.”
”We brought the original team together and reran an identical lab experiment from the original paper (Experiment 1). The only thing we changed was the sample size: we had 20 times more participants per condition. And we found no difference in the amount of cheating between signing at the top of the form and signing at the bottom.”
“This matters because governments worldwide have spent considerable money and time trying to put this intervention into practice with limited success.”
“We also hope that this collaboration serves as a positive example, whereby upon learning that something they had been promoting for nearly a decade may not be true, the original authors confronted the issue directly by running new and more rigorous studies, and the original journal was open to publishing a new peer-reviewed article documenting the correction.”
“We believe that incentives need to continue to change in research, such that researchers are able to publish what they find and that the rigor and usefulness of their results, not their sensationalism, is what is rewarded.”
To read the article, click here.
[From the blog “[81] Data Replicada” by Joe Simmons and Leif Nelson, posted in December at Data Colada]
“With more than mild trepidation, we are introducing a new column called Data Replicada. In this column, we will report the results of exact (or close) preregistered replications of recently published findings.”
“…why does the world need Data Replicada? Allow us to justify our existence.”
“Even though it is much easier than ever to publish exact replications, it is still extremely hard to publish exact replications. We know for a fact that many high-quality replications remain unreported. You can’t publish failures to replicate unless you convince a review team both that the failure is robust and that it is really important. And you can’t publish a successful replication, well, ever…”
“Many published replications focus on studies that were conducted a long time ago, or at least before the field became aware of the perils of p-hacking and the importance of better research practices. We will focus on trying to replicate recently published findings, so as to get a sense for whether non-obvious research published after the “renaissance” is in fact reliable.”
“…replications have not caught on in our subfield of behavioral marketing. So at least at the outset, we intend to focus our efforts on trying to replicate studies recently published in the two top behavioral marketing journals: the Journal of Consumer Research and the Journal of Marketing Research.”
“Conducting replications is a great way to learn about the messy business of scientific discovery, and we hope to communicate what we learn to you.”
“We are hoping to make Data Replicada a regular, monthly feature of Data Colada. But since it will be effortful and expensive, we only feel comfortable committing to doing it for six months or so, at which point we will re-assess.”
To read the blog, click here.
[* EiR = Econometrics in Replications, a feature of TRN that highlights useful econometrics procedures for re-analysing existing research.]
In recent years, DAGs (Directed Acyclic Graphs) have received increased attention in the medical and social sciences as a tool for determining whether causal effects can be estimated. A brief introduction can be found here. While DAGs are commonly used to guide model specification, they can also be used in the post-publication assessment of studies.
Despite widespread recognition of the dangers of drawing causal inferences from observational studies, and with general, nominal acknowledgement that “correlation does not imply causation”, it is still standard practice for researchers to discuss estimated relationships from observational studies as if they represent causal effects.
In this blog, we show how one can apply DAGs to previously published studies to assess whether implied claims of causal effects are justified. For our example, we use the Mincer earnings regression, which has appeared in hundreds, if not thousands, of economic studies. The associated wage equation relates individuals’ observed wages to a number of personal characteristics:
ln(wage) = b0 + b1 Educ + b2 Exp + b3 Black + b4 Female + error,
where ln(wage) is the natural log of wages, Educ is a measure of years of formal education, Exp is a measure of years of labor market experience, and Black and Female are dummy variables indicating an individual’s race (black) and sex.
The parameters b1 and b2 are comonly interpreted as the rate of return to education and labor market experience, respectively. The coefficients on Black and Female are commonly interpreted as measuring labor market discrimination against blacks and women.
Suppose one came across an estimated Mincer wage regression like the one above in a published study. Suppose further that the author of that study attached causal interpretations to the respective estimated parameters. One could use DAGs to determine whether those interpretations were justified.
To do that, one would first hypothesize a DAG that summarized all the common cause relationships between the variables. By way of illustration, consider the DAG in the figure below, where U is an unobserved confounder.1

In this DAG, Educ affects Wage through a direct channel, Educ -> Wage, and an indirect channel, Educ -> Exp -> Wage. The Mincerian regression specification captures the first of these channels. However, it omits the second because the inclusion of Exp in the specification blocks the indirect channel. Assuming both channels carry positive associations, the estimated rate of return to education in the Mincerian wage regression will be downwardly biased.
We can use the same DAG to assess the other estimated parameters. Consider the estimated rate of return on labor market experience. The DAG identifies both a direct causal path (Exp -> Wage) and a number of non-causal paths. Exp <- Female -> Wage is one non-causal path, as is Exp <- Educ -> Wage. Including the variables Educ and Female in the regression equation blocks these non-causal paths. As a result, the specification solely estimates the direct causal effect, and thus provides an unbiased estimate of the rate of return of labor market experience on wages.
In a similar fashion, one can show that given the DAG above, one cannot interpret the estimated values of b3 and b4 as estimates of the causal effects of labor market and sex discrimination.
DAGs also have the benefit of suggesting tests that allow one to assess the validity of a given DAG. In particular, the DAG above implies the following independences:2
1) Educ ⊥ Female
2) Exp ⊥ Black | Educ
3) Female ⊥ Black
Rejection of one or more of these would indicate that the DAG is not supported by the data.
In practice, there are likely to be many possible DAGs for a given estimated equation. If a replicating researcher can obtain the data and code for an original study, he/she could then posit a variety of DAGs that seemed appropriate given current knowledge about the subject.
For each DAG, one could determine whether the conditions exist such that the estimated specification allows for a causal interpretation of the key parameters. If so, one could then use the model implications to assess whether the DAG was “reasonable”, as evidenced by non-conflicting data.
If no DAGs can be found that support a causal interpretation, or if adequacy tests cause one to eliminate all such DAGs, one could then request that the original author provide a DAG that would support their causal interpretations. In this fashion, existing studies could be assessed to determine if there is an evidentiary basis for causal interpretation of the estimated effects.
2 A useful, free online tool for drawing and assessing DAGs, is DAGitty, which can be found here.
Bob Reed is a professor of economics at the University of Canterbury in New Zealand. He is also co-organizer of the blogsite The Replication Network. He can be contacted at bob.reed@canterbury.ac.nz.
[Excerpts taken from the article “In praise of replication studies and null results”, an editorial published in Nature]
“The Berlin Institute of Health last year launched an initiative with the words, ‘Publish your NULL results — Fight the negative publication bias! Publish your Replication Study — Fight the replication crises!”
“The institute is offering its researchers €1,000 (US$1,085) for publishing either the results of replication studies — which repeat an experiment — or a null result, in which the outcome is different from that expected.’”
“…Twitter, it seems, took more notice than the thousands of eligible scientists at the translational-research institute. The offer to pay for such studies has so far attracted only 22 applicants — all of whom received the award.”
“Replication studies are important….But publishing this work is not always a priority for researchers, funders or editors — something that must change.”
“Aside from offering cash upfront, the Berlin Institute of Health has an app and advisers to help researchers to work out which journals, preprint servers and other outlets they should be contacting to publish replication studies and data.”
“…more journals need to emphasize to the research community the benefits of publishing replications and null results.”
“At Nature, replication studies are held to the same high standards as all our published papers. We welcome the submission of studies that provide insights into previously published results; those that can move a field forwards and those that might provide evidence of a transformative advance.”
“Not all null results and replications are equally important or informative, but, as a whole, they are undervalued. If researchers assume that replications or null results will be dismissed, then it is our role as journals to show that this is not the case. At the same time, more institutions and funders must step up and support replications — for example, by explicitly making them part of evaluation criteria.”
“We can all do more. Change cannot come soon enough.”
To read the article, click here.
[Excerpts taken from the RFP “Data Enhancement of the DARPA SCORE Claims Dataset” posted at the Center for Open Science website]
“The DARPA SCORE Dataset contains claims from about 3,000 empirical papers published between 2009 and 2018 in approximately 60 journals in the social and behavioral sciences.”
“We seek proposals from research teams to enhance the Dataset with information about the papers that may be relevant to assessing the credibility of the coded claims. Such enhancements could include information such as:”
– “Extraction of statistical variables or reporting errors in the original papers.”
– “Identification of the public availability of data, materials, code, or preregistrations associated with the studies reported in the papers.”
– “Citations or other altmetrics associated with the papers.”
– “Identification of replications or meta-analytic results of findings associated with the paper or claims.”
– “Indicators of credibility, productivity, or other features of the authors of the papers.”
– “Extraction of design features, reporting styles, quality indicators, or language use from the original papers.”
“Possible data enhancements for the Database are not limited to these examples.”
“The key criteria for proposed data enhancements are:”
– “Relevance of the proposed variable additions to assessment of credibility of the papers and claims.”
– “Potential applicability of credibility/validity variable additions to other/broader targets or levels of abstraction besides claims or papers. For example, averaging or aggregation of claim or paper-level scores to generate “average”/expected credibility of authors, journals, or sub-disciplines.”
– “The proportion of papers that are likely to benefit from the data enhancement (i.e., minimization of missing data).”
– “The extent to which the data enhancement is automated.”
– “Evidence that extraction or identification of new data is valid, reliable, and easy to integrate with the Dataset.”
– “Description of data quality control practices that will be employed in proposed data-generation activity (e.g., interrater agreement tests, data sample auditing/revision processes).”
– “Cost for conducting the work.”
– “Completing the proposed work by June 30, 2020.”
“A total of $100,000 is available for data enhancement awards and we expect to make 4 to 15 total awards. Proposals should be no more than 2 pages and address the selection criteria above.”
You must be logged in to post a comment.