REED: Replications in Economics are Different from Replications in Psychology, and Other Thoughts

In July 2017, Economics: The Open Access, Open Assessment E-Journal issued a call for papers for a special issue on the practice of replication. The call stated, “This special issue is designed to highlight alternative approaches to doing replications, while also identifying core principles to follow when carrying out a replication. Contributors to the special issue will each select an influential economics article that has not previously been replicated, with each contributor selecting a unique article.  Each paper will discuss how they would go about “replicating” their chosen article, and what criteria they would use to determine if the replication study “confirmed” or “disconfirmed” the original study.”
The special issue was published late last year, with an accompanying “Takeaways” commentary appearing early this year. A total of eight articles were published in the special issue. The authors and paper titles are identified below. What follows are some thoughts from that exercise.
Replications in economics are different from replications in psychology
The first takeaway from the special issue is that replications in economics are different from replications in psychology. It is common in psychology to categorize replications discretely into two categories: direct and conceptual. A good example of this is provided by the website Curate Science, which identifies a continuum of replications running from “direct” to “conceptual”.
Psychology replications are more easily fitted onto a one-dimensional scale. Replications in psychology generally involve experiments. A typical concern is whether, and how closely, the replication matches the original study’s experimental design and implementation.
In contrast, most empirical economic studies are based on observational, versus experimental, data (experimental/behavioral economics being a notable exception). Problems that consume economic studies, such as endogeneity or non-stationarity, are not major concerns in psychology. This cuts down on the need for a vast arsenal of econometric procedures and reduces the relative importance of alternative statistical methodologies. 
Another major difference is that the number of variables and observations that characterize observational studies are large relative to studies that use experimental data. Datasets in economics often have hundreds of potential variables and many thousands of observations. As a result, the garden of forking paths is bigger in economics. With more paths to explore, there is greater value in re-analyzing existing data to check for robustness.
The bottom line is that economic replications are not easily compressed onto a one-dimensional scale. Consider the following two-dimensional taxonomy for replications in economics:
Here the dimension of measurement and analysis is distinguished from the dimension of target population. While I know of no data to support this next statement, I conjecture that a far greater share of replication studies in economics are concerned with the “vertical dimension” of empirical procedures.
In fact, this is exactly what shows in the eight studies of the special issue. The table below sorts the eight studies across the two dimensions of target population and methodology. Noteworthy is that most of the replications focus on re-analyzing the same data, either using the same or different empirical procedures. Only one study has interest in exploring the “boundaries” that determine the external validity of the original study.
Unless I am mistaken, this is also another difference with psychology. It seems to me that psychology has a greater interest in understanding effect heterogeneity. For example, an original study reports that men are more upset than women when their partner commits a sexual versus emotional infidelity. The original study found this result for a sample of young people (Buss et al., 1999). A later replication was interested in exploring this result for older people (Shackleford et al., 2004). It is my sense, again stated without supporting evidence, that these kinds of replication studies are more common in psychology than in economics. In my opinion, this is a shortcoming of replications in economics.
Compressing replications into the one-dimensional taxonomy common in psychology loses the distinction between replications focused on measurement and empirical procedures, and replications focused on establishing boundaries for external validity. Blurring this distinction may not be a great loss for psychology, but it is for economics, because it can hide “gaps” in the things that economic replications study (e.g., effect heterogeneity).
Whatever you call replications, you should call them replications
As represented in FIGURE 2, the special issue used a taxonomy that identified no less than six types of replications: Reproductions, Repetitions, Extensions, and three types of robustness checks. The number of such taxonomies is large and growing. In addition to Direct versus Indirect Replications, other classifications include (i) Verification, Reproduction, Re-analysis, and Extension, (ii) Replication, Reproduction and Re-analysis, (iii) Reproducibility, Replicability, and Generalization, and (iv) Pure replication, Statistical replication, and Scientific replication.
Does it make a difference? Yes, it makes a huge difference. But not for the reason most people give. Most commentators argue for a particular classification system in order to distinguish different types of replications. Much more important than distinguishing different shades of replications is that the literature be able to distinguish, and identify, replications from other types of empirical studies.
The biggest problem with replications is being able to find them. The confusing tangle of alternative replication vocabularies is not helping. For replications to make a difference, researchers need to know of their existence. They need to be easily identifiable in search algorithms. If a study calls itself a “re-analysis” rather than a replication, a researcher who searches for replications may miss it. Who cares about the fine point of distinguishing one type of replication from another when the replication is never read?
I don’t know which taxonomy is best. But I believe that all taxonomies should have the word “replication” in each of the categories so that they can be easily identified by search algorithms. Thus, I don’t care if somebody wants to use “Pure replication”/“Statistical replication”/ “Scientific replication”, or “Verification replication”/“Reproduction replication”/ “Re-analysis replication”/“Extension replication”, as long as the word “replication” appears in the text, ideally in the abstract. That makes it easy for search algorithms to find the paper, which is crucial if the paper is to be read.
There is no single standard for replication success
The eight papers in the special issue offered a variety of criteria for “replication success”. How one defines replication “success” depends on the goal of the replication. If the goal is to double check that the numbers in a published study are correct, then, as McCullough emphasizes, anything less than 100% reproduction is a failure: “For linear procedures with moderately-sized datasets, there should be ten digit agreement, for nonlinear procedures there may be as few as four or five digits of agreement” (McCullough, 2018, page 3).
Things become complicated if, instead, the goal is to determine if the claim from an original study is “true.” This is illustrated by the variety of criteria for replication “success” offered by the studies of the special issue. For Hannum, success depends on the significance of the estimated coefficient of a key variable. Owen suggests a battery of tests based upon significance testing, but acknowledges “fallacies of acceptance and rejection” as challenges to interpreting test results. Coupé proposes counting all the parameters that are reproduced exactly and calculating a percentage correct index, perhaps weighted by the importance of the respective parameters. Daniels & Kakar identify success if the replicated parameters have “the same size and significance for all specifications”, though they do not define what constitutes “the same”. Wood & Vasquez shy away from even using the words “success” or “failure”. Instead, they see the purpose of replication as contributing to a “research dialogue”. They advocate a holistic approach, “looking for similar coefficient sizes, direction of coefficients, and statistical significance”.
The nut of the problem is illustrated by Reed (2018) in the following example: “Suppose a study reports that a 10% increase in unemployment benefits is estimated to increase unemployment duration by 5%, with a 95% confidence interval of [4%, 6%]. Two subsequent replications are undertaken. Replication #1 finds a mean effect of 2% with corresponding confidence interval of [1%, 3%]. Replication #2 estimates a mean effect of 5%, but the effect is insignificant with a corresponding confidence interval of [0%, 10%]. In other words, consistent with the original study, Replication #1 finds that unemployment durations are positively and significantly associated with unemployment insurance benefits. However, the estimated effect falls significantly short of the effect reported by the original study. Replication #2 estimates a mean effect exactly the same as the original, but due to its imprecision, the effect is statistically insignificant. Did either of the two replications “successfully replicate” the original? Did both? Did none?”
This problem is not unique to economics and observational studies. Despite the fact that many experimental studies define success as “a significant effect in the same direction as the original study” (Camerer et al., 2018), there exist many definitions of “replication success” in the experimental literature. Open Science Collaboration (2015) used five definitions of replication success. And Curate Science identifies six outcomes for categorizing replication outcomes (see below).
This has important implications for assessments of the “reproducibility” of science. For example, the recently announced, DARPA-funded, SCORE Project (“Systematizing Confidence in Open Research and Evidence”) intends to develop algorithms for assessing approximately 30,000 findings from the social-behavioral sciences. Towards that end, experts will “review and score about 3,000 of those claims in surveys, panels, or prediction markets for their likelihood of being reproducible findings.” The criteria used to define “replication success” will have a huge influence on the results of the project, and the interpretation of those results.
The value of pre-registration
Pre-registration has received much attention by the practitioners of open science. There is hope that pre-registration can help solve the “replication crisis.” As part of a series on pre-registration hosted by the Psychonomic Society, Klaus Oberauer argues that our efforts should not be focused on pre-registration, but on making data and code available so other researchers can explore alternative forking paths: “If there are multiple equally justifiable analysis paths, we should run all of them, or a representative sample, to see whether our results are robust. … making the raw data publicly available enables other researchers … to run their own analyses … It seems to me that, once publication of the raw data becomes common practice, we have all we need to guard against bias in the choice of analysis paths without giving undue weight to the outcome of one analysis method that a research team happens to preregister.”
I agree with Oberauer that the bigger issue is making data and code available. As is ensuring that there are outlets to publish the results of replications. However, even if data and code are ubiquitous and replications publishable, there will still be value in pre-registering replication studies. In assessing the results of a replication study, there is a difference in how one interprets “I did one thing that I thought was most important and the results did not replicate” and “I did 10 things looking for problems and found one thing that didn’t replicate.” Pre-registration can establish which of these applies.
Bob Reed is a professor of economics at the University of Canterbury in New Zealand. He is also co-organizer of the blogsite The Replication Network. He can be contacted at 

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: