About 10 years ago, the economist Hoyt Bleakley published two important papers on the impact of health on wealth—more precisely, on the long-term economic impacts of large-scale disease eradication campaigns. In the Quarterly Journal of Economics, “Disease and Development: Evidence from Hookworm Eradication in the American South” found that a hookworm eradication campaign in the American South in the 1910s was followed by a substantial gain in adult earnings. In AEJ: Applied Economics, “Malaria Eradication in the Americas: A Retrospective Analysis of Childhood Exposure” reported similar benefits from 20th-century malaria eradication efforts in Brazil, Colombia, Mexico, and the U.S.
With my colleagues at GiveWell providing inspiration and assistance, I replicated and reanalyzed both studies. The resulting pair of papers has just appeared in the International Journal for Re-Views in Empirical Economics (hookworm, malaria). I’ve blogged my findings on givewell.org (hookworm, malaria). Short version: I can buy the Bleakley findings for malaria, but not for hookworm.
Here I will share some thoughts sparked by my experience about process—about how we generate, review, publish, and revisit research in the social sciences.
To win trust, studies need to be reanalyzed, not just replicated
Psychology is now in the throes of a replication crisis: when published lab experiments are repeated, about half the time the original results (presumably statistically different from zero) disappear (this, this). Some see a replication crisis in economics too. I do not. In my experience (this, this, this, this, this, this, this, this, …), most empirical research in economics does replicate, in the sense that original results can be matched when applying the reported methods to the reported data. The matches are perfect when original data and code are available and approximate otherwise. A paper by Federal Reserve economists reaches the opposite conclusion only by counting as non-replicable any study whose authors did not respond to a request for data.
I would say, rather, that economics is in a reanalysis crisis. Or perhaps a “robustness crisis.” When I turn from replicating a study to revising it, introducing arguable improvements to data and code, the original findings often slip away like sand through my fingers. About half the time, in fact. The split decision on the two Bleakley papers is a case in point. Another is my “reanalysis review” of the impact of incarceration on crime: of the eight studies for which data availability permitted replication, I found what I deemed to be significant methodological concerns in seven, and that ultimately led to me to reverse my reading of four. (Caveat: Essentially all my experience is with observational studies rather than field experiments, which may be more robust.)
This is why I say that half of economics studies are reliable—I’m just not sure which half. Seriously, as a partial answer to “which half?”, I conjecture that young, tenure-track researchers are more apt to produce fragile work, because they are under the most intense pressure to generate significant, non-zero results.
Review of research is under-supplied
Many studies in economics aspire to influence policy decisions that have stakes measured in billions or trillions of dollars. Yet society invests only hundreds or thousands of dollars in assessing the quality of economics research, mainly in the form of peer review. And if only half the studies that survive peer review withstand closer scrutiny, then we evidently have not reached the point of diminishing returns to investment in review.
There is something wrong with this picture. Serious assessment of published research is a public good and so is under-supplied. Who will fill the gap?
Reanalysis, like original analysis, cannot be mechanized
I like to think that in reanalyzing research, I strike a judicious balance. Ideally, I introduce appropriately tough robustness tests; yet I avoid “gotcha” specification mining, trying lots of things until I break a regression. Ultimately, it is for readers to assess my success. One might take the discretionary character of reanalysis as a fatal flaw: replication, by contrast, can be fully pre-specified and is in this sense more objective. But by the same argument, one ought not to perform original research. A better approach is to marshal the toolkit that has gradually been assembled to improve the objectivity and reliability of original research, and bring it to reanalysis—for example, posting data and code along with finished analysis, and preregistering one’s analytical plan of attack. In revisiting the Bleakley studies, I did both.
Preregistering reanalysis is a good thing
In fact, this was my first-time preregistering. I’ve heard of preregistered analysis plans that run to hundreds of pages. My plans (hookworm, malaria) just run a page or so. The Open Science Framework of the Center for Open Science serves as the perfect, public home for the documents, as an independent party that credibly time-stamped them and makes them public.
I tried to use the plans to signal my strategy, recognizing that tactics would need to be refined after encountering the data. But I did not take the plans as binding. I allowed myself to stray outside a plan, while working to inform the reader when I had done so. After all, reanalysis is a creative act too, which I think should be allowed to take unexpected turns. It’s also a social act: helpful or even peremptory comments from the original authors, as well as reviewers and editors, are bound to motivate changes late in a project.
That said, I think I have room to mature as preregisterer. I could have written my hookworm plan with more care, making it more predictive of what I ultimately did, thus adding to its credibility.
Original authors should be included in the review of replications and reanalyses, in the right way
I always send draft write-ups of replications and reanalyses to the original authors. Some don’t respond (much). Others do, and I always learn from them (Pitt, Bleakley). Clearly original authors should be heard from. But should journals give them the full powers of a referee? Maybe not. This creates an incentive for them to withhold comment on drafts sent to them before submission to a journal and then, when invited to referee, to roll out all their criticisms before the editor. Presumably some of the criticisms will be valid, and ought to be incorporated before involving other referees. Managing editor Martina Grunow explained to me how IREE threads this needle:
“We…decided that contacting the original author must be done by the replicator and before submitting the replication study to IREE (with 4 weeks waiting time whether the author responds) and that the contact (attempt or dialog) must be documented in the paper. This mainly protects the replicator against the killer argument [that the replicator failed to perform the due diligence of sharing the text with the original author]. In the case that an original author wants to comment on the replication, we offer to publish this comment along with the replication study. Up to now this did not happen. As we read in the submissions to IREE, most original authors do not reply when they are contacted by the replicators.”
I like this solution: do not use original authors as ordinary referees, but require replicators to make reasonable efforts to include original authors in the process before journal submission.
The American Economic Association’s archiving policy has holes
In 2003, the American Economic Review published a study by McCullough and Vinod, which tried—and failed—to replicate ten empirical papers in an issue of the AER. At the time, the journal merely required publishing authors to provide data and code to interested researchers upon request:
“Though the policy of the AER requires that ‘Details of computations sufficient to permit replication must be provided,’ we found that fully half of the authors would not honor the replication policy….Two authors provided neither data nor code: in one case the author said he had already lost all the files; in another case, the author initially said it would be ‘next semester’ before he would have time to honor our request, after which he ceased replying to our phone calls, e-mails, and letters. A third author, after several months and numerous requests, finally supplied us with six diskettes containing over 400 files—and no README file. Reminiscent of the attorney who responds to a subpoena with truckloads of documents, we count this author as completely noncompliant. A fourth author provided us with numerous datafiles that would not run with his code. We exchanged several e-mails with the author as we attempted to ascertain how to use the data with the code. Initially, the author replied promptly, but soon the amount of time between our question and his response grew. Finally, the author informed us that we were taking up too much of his time—we had not even managed to organize a useable data set, let alone run his data with his code, let alone determine whether his data and code would replicate his published results.”
I’ve had similar experiences. (As well as plenty of better ones, with cooperative replicatees.)
In response, AER editor Ben Bernanke announced an overhaul: henceforth, the journal would require submission of data and code to a central archive at the time of publication. The policy now applies to all American Economic Association journals, including the one that published the Bleakley study of malaria eradication.
Kudos to Bernanke and the AER, for that policy reform put the journal many years ahead of the QJE, which became the periodical of record for the Bleakley hookworm study. But in taking advantage of the Bleakley malaria data and code archive, I also ran into two serious gaps in the AEA’s policy, or at least its implementation. These leave substantial scope for original authors to impede replication. As I write:
“First, [the AEA journals] provide no access to the primary data, or at least to the code that transforms the primary data into the analysis data. The American Economic Review’s own assessment of compliance with its data availability policy highlighted this omission in 2011. ‘Simply requiring authors to submit their data prior to publication may not be sufficient to improve accuracy….The broken link in the replication process usually lies in the procedures used to transform raw data into estimation data and to perform the statistical analysis, rather than in the data themselves’ (Glandon 2011). Second, code is provided for tables only, not figures. Yet figures can play a central role in a study’s conclusions and impact. Like tables, figures distill large amounts of data to inform inference. They ought to be fully replicable, but only can be if their code is public too.”
I think much of the power of the Bleakley studies lay in figures that seemed to show kinks in long-term earnings trends with timing explicable by the eradication campaigns. In the hookworm study, those kinks pretty substantially faded in the attempted replication—and it is impossible to be sure why, for lack of access to much of the original data and code. Potential causes include discrepancies between original and replication in primary database construction, in the transformation code, or in the figure-generating code.
The AEA and other publishers can and should head-off such mysteries, with more complete archiving.
David Roodman is a Senior Advisor at GiveWell. He has replicated and reanalyzed research on foreign aid effectiveness, geomagnetic storms, alcohol taxes, immigration, microcredit, and other subjects.