(REPOST FROM JOHN COCHRANE’S BLOG, THE GRUMPY ECONOMIST)
On replication in economics. Just in time for bar-room discussions at the annual meetings.
“I have a truly marvelous demonstration of this proposition which this margin is too narrow to contain.” –Fermat
“I have a truly marvelous regression result, but I can’t show you the data and won’t even show you the computer program that produced the result” – Typical paper in economics and finance.
Science demands transparency. Yet much research in economics and finance uses secret data. The journals publish results and conclusions, but the data and sometimes even the programs are not available for review or inspection. Replication, even just checking what the author(s) did given their data, is getting harder.
Quite often, when one digs in, empirical results are nowhere near as strong as the papers make them out to be.
– Simple coding errors are not unknown. Reinhart and Rogoff are a famous example — which only came to light because they were honest and ethical and posted their data.
– There are data errors.
– Many results are driven by one or two observations, which at least tempers the interpretation of the results. Often a simple plot of the data, not provided in the paper, reveals that fact.
– Standard error computation is a dark art, producing 2.11 t statistics and the requisite two or three stars suspiciously often.
– Small changes in sample period or specification destroy many “facts.”
– Many regressions involve a large set of extra right hand variables, with no strong reason for inclusion or exclusion, and the fact is often quite sensitive to those choices. Just which instruments you use and how to transform variables changes results.
– Many large-data papers difference, difference differences, add dozens of controls and fixed effects, and so forth, throwing out most of the variation in the data in the admirable quest for cause-and-effect interpretability. Alas, that procedure can load the results up on measurement errors, or slightly different and equally plausible variations can produce very different results.
– There is often a lot of ambiguity in how to define variables, which proxies to use, which data series to use, and so forth, and equally plausible variations change the results.
I have seen many examples of these problems, in papers published in top journals. Many facts that you think are facts are not facts. Yet as more and more papers use secret data, it’s getting harder and harder to know.
The solution is pretty obvious: to be considered peer-reviewed “scientific” research, authors should post their programs and data. If the world cannot see your lab methods, you have an anecdote, an undocumented claim, you don’t have research. An empirical paper without data and programs is like a theoretical paper without proofs.
Faced with this problem, most economists jump to rules and censorship. They want journals to impose replicability rules, and refuse to publish papers that don’t meet those rules. The American Economic Review has followed this suggestion, and other journals such as the Journal of Political Economy, are following.
On reflection, that instinct is a bit of a paradox. Economists, when studying everyone else, by and large value free markets, demand as well as supply, emergent order, the marketplace of ideas, competition, entry, and so on, not tight rules and censorship. Yet in running our own affairs, the inner dirigiste quickly wins out. In my time at faculty meetings, were few problems that many colleagues did not want to address by writing more rules.
And with another moment’s reflection (much more below), you can see that the rule-and-censorship approach simply won’t work. There isn’t a set of rules we can write that assures replicability and transparency, without the rest of us having to do any work. And rule-based censorship invites its own type I errors.
Replicability is a squishy concept — just like every other aspect of evaluating scholarly work. Why do we think we need referees, editors, recommendation letters, subcommittees, and so forth to evaluate method, novelty, statistical procedure, and importance, but replicability and transparency can be relegated to a set of mechanical rules?
So, rather than try to restrict supply and impose censorship, let’s work on demand. If you think that replicability matters, what can you do about it? A lot:
– When a journal with a data policy asks you to referee a paper, check the data and program file. Part of your job is to see that this works correctly.
– When you are asked to referee a paper, and data and programs are not provided, see if data and programs are on authors’ websites. If not, ask for the data and programs. If refused, refuse to referee the paper. You cannot properly peer-review empirical work without seeing the data and methods.
– I don’t think it’s necessary for referees to actually do the replication for most papers, any more than we have to verify arithmetic. Nor, in my view, do we have to dot is and cross t’s on the journal’s policy, any more than we pay attention to their current list of referee instructions. Our job is to evaluate whether we think the authors have done an adequate and reasonable job, as standards are evolving, of making the data and programs available and documented. Run a regression or two to let them know you’re looking, and to verify that their posted data actually works. Unless of course you smell a rat, in which case, dig in and find the rat.
– Do not cite unreplicable articles. If editors and referees ask you to cite such papers, write back “these papers are based on secret data, so should not be cited.” If editors insist, cite the paper as “On request of the editor, I note that Smith and Jones (2016) claim x. However, since they do not make programs / data available, that claim is not replicable.”
– When asked to write a promotion or tenure letter, check the author’s website or journal websites of the important papers for programs and data. Point out secret data, and say such papers cannot be considered peer-reviewed for the purposes of promotion. (Do this the day you get the request for the letter. You might prompt some fast disclosures!)
– If asked to discuss a paper at a conference, look for programs and data on authors’ websites. If not available, ask for the data and programs. If they are not provided, refuse. If they are, make at least one slide in which you replicate a result, and offer one opinion about its robustness. By example, let’s make replication routinely accepted.
– A general point: Authors often do not want to post data and programs for unpublished papers, which can be reasonable. However, such programs and data can be made available to referees, discussants, letter writers, and so forth, in confidence.
– If organizing a conference, do not include papers that do not post data and programs. If you feel that’s too harsh, at least require that authors post data and programs for published papers and make programs and data available to discussants at your conference.
– When discussing candidates for your institution to hire, insist that such candidates disclose their data and programs. Don’t hire secret data artists. Or at least make a fuss about it.
– If asked to serve on a committee that awards best paper prizes, association presidencies, directorships, fellowships or other positions and honors, or when asked to vote on those, check the authors’ websites or journal websites. No data, no vote. The same goes for annual AEA and AFA elections. Do the candidates disclose their data and programs?
– Obviously, lead by example. Put your data and programs on your website.
– Value replication. One reason we have so little replication is that there is so little reward for doing it. So, if you think replication is important, value it. If you edit a journal, publish replication studies, positive and negative. (Especially if your journal has a replication policy!) When you evaluate candidates, write tenure letters, and so forth, value replication studies, positive and negative. If you run conferences, include a replication session.
In all this, you’re not just looking for some mess on some website, put together to satisfy the letter of a journal’s policy. You’re evaluating whether the job the authors have done of documenting their procedures and data rises to the standards of what you’d call replicable science, within reason, just like every other part of your evaluation.
Though this issue has bothered me a long time, I have not started doing all the above. I will start now.
Here, some economists I have talked to jump to suggesting a call to coordinated action. That is not my view
I think this sort of thing can and should emerge gradually, as a social norm. If a few of us start doing this sort of thing, others might notice. They think “that’s a good idea,” and start doing it too. They also may feel empowered to start doing it. The first person to do it will seem like a bit of a jerk. But after you read three or four tenure letters that say “this seems like fine research, but without programs and data we won’t really know,” you’ll feel better about writing that yourself. Like “would you mind putting out that cigarette.”
Also, the issues are hard, and I’m not sure exactly what is the right policy. Good social norms will evolve over time to reflect the costs and benefits of transparency in all the different kinds of work we do.
If we all start doing this, journals won’t need to enforce long rules. Data disclosure will become as natural and self-enforced part of writing a paper as is proving your theorems.
Conversely, if nobody feels like doing the above, then maybe replication isn’t such a problem at all, and journals are mistaken in adding policies.
RULES WON’T WORK WITHOUT DEMAND
Journals are treading lightly, and rightly so.
Journals are competitive too. If the JPE refuses a paper because the author won’t disclose data, and the QJE publishes it, the paper goes on to great acclaim, wins its author the Clark medal and the Nobel Prize, then the JPE falls in stature and the QJE rises. New journals will spring up with more lax policies. Journals themselves are a curious relic of the print age. If readers value empirical work based on secret data, academics will just post their papers on websites, working paper series, ssrn, repec, blogs, and so forth.
So if there is no demand, why restrict supply? If people are not taking the above steps on their own — and by and large they are not — why should journals try to shove it down authors’ throats?
Replication is not an issue about which we really can write rules. It is an issue — like all the others involving evaluation of scientific work — for which norms have to evolve over time and users must apply some judgement.
Perfect, permanent replicability is impossible. If replication is done with programs that access someone else’s database, those databases change and access routines change. Within a year, if the programs run at all, they give different numbers. New versions of software give different results. The best you can do is to freeze the data you actually use, hosted on a virtual machine that uses the same operating system, software version, and so on. Even that does not last forever. And no journal asks for it.
Replication is a small part of a larger problem, data collection itself. Much data these days is collected by hand, or scraped by computer. We cannot and should not ask for a webcam or keystroke log of how data was collected, or hand-categorized. Documenting this step so it can be redone is vital, but it will always be a fuzzy process.
In response to “post your data,” authors respond that they aren’t allowed to do so, and journal rules allow that response. You have only to post your programs, and then a would-be replicator must arrange for access to the underlying data. No surprise, very little replication that requires such extensive effort is occurring.
And rules will never be enough.
Regulation invites just-within-the-boundaries games. Provide the programs, but no poor documentation. Provide the data with no headers. Don’t write down what the procedures are. You can follow the letter and not the spirit of rules.
Demand invites serious effort towards transparency. I post programs and data. Judging by emails when I make a mistake, these get looked at maybe once every 5 years. The incentive to do a really good job is not very strong right now.
Poor documentation is already a big problem. My modal referee comment these days is “the authors did not write down what they did, so I can’t evaluate it.” Even without posting programs and data, the authors simply don’t write down the steps they took to produce the numbers. The demand for such documentation has to come from readers, referees, citers, and admirers, and posting the code is only a small part of that transparency.
A hopeful thought: Currently, one way we address these problems is by endless referee requests for alternative procedures and robustness checks. Perhaps these can be answered in the future by “the data and code are online, run them yourself if you’re worried!”
I’m not arguing against rules, such as the AER has put in. I just think that they will not make a dent in the issue until we economists show by our actions some interest in the issue.
PROPRIETARY DATA, COMMERCIAL DATA, GOVERNMENT DATA
Many data sources explicitly prohibit public disclosure of the data. Disclosing such secret data remains beyond the current journal policies, or policies that anyone imagines asking journals to impose. Journals can require that you post code, but then a replicator has to arrange for access to the data. That can be very expensive, or require a coauthor who works at the government agency. No surprise, such replication doesn’t happen very often.
However, this is mostly not an insoluble problem, as there is almost never a fundamental reason why the data needed for verification and robustness analysis cannot be disclosed. Rules and censorship is not strong enough to change things. Widespread demand for transparency might well be.
To substantiate much research, and check its robustness to small variations in statistical method, you do not need full access to the underlying data. An extract is enough, and usually the nature of that extract makes it useless for other purposes.
The extract needed to verify one paper is usually useless for writing other papers. The terms for using posted data could be, you cannot use this data to publish new original work, only for verification and comment on the posted paper. Abiding by this restriction is a lot easier to police than the current replication policies.
Even if the slice of data needed to check a paper’s results cannot be public, it can be provided to referees or discussants, after signing a stack of non-use and non-disclosure agreements. (That is a less-than-optimal outcome of course, since in the end real verification won’t happen unless people can publish verification papers.)
Academic papers take 3 to 5 years or more for publication. A 3 to 5 year old slice of data is useless for most purposes, especially the commercial ones that worry data providers.
Commercial and proprietary (banks) data sets are designed for paying customers who want up-to-the-minute data. Even CRSP data, a month old, is not much used commercially, because traders need up to the minute data useful for trading. Hedge fund and mutual fund data is used and paid for by people researching the histories of potential investments. Two-year old data is useless to them — so much so that getting the providers to keep old slices of data to overcome survivor bias is a headache.
In sum, the 3-5 year old, redacted, minimalist small slice of data needed to substantiate the empirical work in an academic paper are in fact seldom a substantial threat to the commercial, proprietary, or genuine privacy interest of the data collectors.
The problem is fundamentally about contracting costs. We are in most cases secondary or incidental users of data, not primary customers. Data providers’ legal departments don’t want to deal with the effort of writing contracts that allow disclosure of data that is 99% useless but might conceivably be of value or cause them trouble. Both private and government agency lawyers naturally adopt a CYA attitude by just saying no.
But that can change. If academics can’t get a paper conferenced, refereed, read and cited with secret data, if they can’t get tenure, citations, or a job on that basis, the academics will push harder. Our funding centers and agencies (NSF) will allocate resources to hire some lawyers. Government agencies respond to political pressure. If their data collection cannot be used in peer-reviewed research, that’s one less justification for their budget. If Congress hears loudly from angry researchers who want their data, there is a force for change. But so long as you can write famous research without pushing, the apparently immovable rock does not move.
The contrary argument is that if we impose these costs on researchers, then less research will be done, and valuable insights will not benefit society. But here you have to decide whether research based on secret data is really research at all. My premise is that, really, it is not, so the social value of even apparently novel and important claims based on secret data is not that large.
Clearly, nothing of this sort will happen if journals try to write rules, in a profession in which nobody is taking the above steps to demand replicability. Only if there is a strong, pervasive, professional demand for transparency and replicability will things change.
Authors often want to preserve their use of data until they’ve fully mined it. If they put in all the effort to produce the data, they want first crack at the results.
This valid concern does not mean that they cannot create redacted slices of data needed to substantiate a given paper. They can also let referees and discussants access such slices, with the above strict non-disclosure and agreement not to use the data.
In fact, it is usually in authors’ interest to make data available sooner rather than later. Everyone who uses your data is a citation. There are far more cases of authors who gained notoriety and long citation counts from making data public early then there are of authors who jealously guarded data so they would get credit for the magic regression that would appear 5 or more years after data collection.
Yet this property right is up to the data collector to decide. Our job is to say “that’s nice, but we won’t really believe you until you make the data public, at least the data I need to see how you ran this regression.” If you want to wait 5 years to mine all the data before making it public, then you might not get the glory of “publishing” the preliminary results. That’s again why voluntary pressure will work, and rules from above will not work.
One empiricist who I talked to about these issues does not want to make programs public, because he doesn’t want to deal with the consequent wave of emails from people asking him to explain bits of code, or claiming to have found errors in 20-year old programs.
Fair enough. But this is another reason why a loose code of ethics is better than a set of rules for journals.
You should make a best faith effort to document code and data when the paper is published. You are not required to answer every email from every confused graduate student for eternity after that point. Critiques and replication studies can be refereed in the usual way, and must rise to the usual standards of documentation and plausibility.
WHY REPLICATION MATTERS FOR ECONOMICS
Economics is unusual. In most experimental sciences, once you collect the data, the fact is there or not. If it’s in doubt, collect more data. Economics features large and sophisticated statistical analysis of non-experimental data. Collecting more data is often not an option, and not really the crux of the problem anyway. You have to sort through the given data in a hundred or more different ways to understand that a cause and effect result is really robust. Individual authors can do some of that — and referees tend to demand exhausting extra checks. But there really is no substitute for the social process by which many different authors, with different priors, play with the data and methods.
Economics is also unusual, in that the practice of redoing old experiments over and over, common in science, is rare in economics. When Ben Franklin stored lighting in a condenser, hundreds of other people went out to try it too, some discovering that it wasn’t the safest thing in the world. They did not just read about it and take it as truth. A big part of a physics education is to rerun classic experiments in the lab. Yet it is rare for anyone to redo — and question — classic empirical work in economics, even as a student.
Of course everything comes down to costs. If a result is important enough, you can go get the data, program everything up again, and see if it’s true. Even then, the question comes, if you can’t get x’s number, why not? It’s really hard to answer that question without x’s programs and data. But the whole thing is a whole lot less expensive and time consuming, and thus a whole lot more likely to happen, if you can use the author’s programs and data.
WHERE WE ARE
The American Economic Review has a strong data and programs disclosure policy. The JPE adopted the AER data policy. A good John Taylor blog post on replication and the history of the AER policy. The QJE has decided not to; I asked an editor about it and heard very sensible reasons. Here is a very good review article on data policies at journals by By Sven Vlaeminck
The AEA is running a survey about its journals, and asks some replication questions. If you’re an AEA member, you got it. Answer it. I added to mine, “if you care so much about replication, you should show you value it by routinely publishing replication articles.”
How is it working? The Report on the American Economic Review Data Availability Compliance Project
“All authors submitted something to the data archive. Roughly 80 percent of the submissions satisfied the spirit of the AER’s data availability policy, which is to make replication and robustness studies possible independently of the author(s). The replicated results generally agreed with the published results. There remains, however, room for improvement both in terms of compliance with the policy and the quality of the materials that authors submit.”
However, Andrew Chang and Phillip Li disagree, in the nicely titled “Is Economics Research Replicable? Sixty Published Papers from Thirteen Journals Say `Usually Not‘”
“We attempt to replicate 67 papers published in 13 well-regarded economics journals using author-provided replication files that include both data and code. … Aside from 6 papers that use confidential data, we obtain data and code replication files for 29 of 35 papers (83%) that are required to provide such files as a condition of publication, compared to 11 of 26 papers (42%) that are not required to provide data and code replication files. We successfully replicate the key qualitative result of 22 of 67 papers (33%) without contacting the authors. Excluding the 6 papers that use confidential data and the 2 papers that use software we do not possess, we replicate 29 of 59 papers (49%) with assistance from the authors. Because we are able to replicate less than half of the papers in our sample even with help from the authors, we assert that economics research is usually not replicable.”
I read this as confirmation that replicability must come from a widespread social norm, demand, not journal policies.
The quest for rules and censorship reflects a world-view that once we get procedures in place, then everything published in a journal will be correct. Of course, once stated, you know how silly that is. Most of what gets published is wrong. Journals are for communication. They should be invitations to replication, not carved in stone truths. Yes, peer-review sorts out a lot of complete garbage, but the balance of type 1 and type 2 errors will remain.
A few touchstones:
Mitch Petersen tallied up all papers in the top finance journals for 2001–2004. Out of 207 panel data papers, 42% made no correction at all for cross-sectional correlation of the errors. This is a fundamental error, that typically cuts standard errors by as much as a factor of 5 or more. If firm i had an unusually good year, it’s pretty likely firm j had a good year as well. Clearly, the empirical refereeing process is far from perfect, despite the endless rounds of revisions they typically ask for. (Nowadays the magic wand “cluster” is waved over the issue. Whether it’s being done right is a ripe topic for a similar investigation.)
“Why Most Published Research Findings are False” by John Ioannidis. Medicine, but relevant
A link on the controversy on replicability in psychology
There will be a workshop on replication and transparency in economic research following the ASSA meetings in San Francisco
I anticipate an interesting exchange in the comments. I especially more links to and summaries of existing writing on the subject