[Excerpts are taken from the article “Reproducibility and Replicability in Economics” by Lars Vilhuber, published in Harvard Data Science Review]
“In this overview, I provide a summary description of the history and state of reproducibility and replicability in the academic field of economics.”
“The purpose of the overview is not to propose specific solutions, but rather to provide the context for the multiplicity of innovations and approaches that are currently being implemented and developed, both in economics and elsewhere.”
“In this text, we adopt the definitions of reproducibility and replicability articulated, inter alia, by Bollen et al. (2015) and in the report by NASEM (2019).”
“At the most basic level, reproducibility refers to “to the ability […] to duplicate the results of a prior study using the same materials and procedures as were used by the original investigator.”
“Replicability, on the other hand, refers to “the ability of a researcher to duplicate the results of a prior study if the same procedures are followed but new data are collected”…, and generalizability refers to the extension of the scientific findings to other populations, contexts, and time frames.”
“Much of economics was premised on the use of statistics generated by national statistical agencies as they emerged in the late 19th and early 20th century…Economists were requesting access for research purposes to government microdata through various committees at least as far back as 1959 (Kraus, 2013).”
“Whether using private-sector data, school-district data, or government administrative records, from the United States and other countries, the use of these data for innovative research has been increasing in recent years. In 1960, 76% of empirical AER articles used public-use data. By 2010, 60% used administrative data, presumably none of which is public use.”
“In economics, complaints about the inability to properly conduct reproducibility studies, or about the absence of any attempt to do so by editors, referees, and authors, can be traced back to comments and replies in the 1970s.”
“In the early 2000s, as in other sciences (National Research Council, 2003), journals started to implement “data’ or ‘data availability’ policies. Typically, they required that data and code be submitted to the journal, for publication as ‘supplementary materials.’”
“Journals in economics that have introduced data deposit policies tend to be higher ranked…None of the journals…request that the data be provided before or during the refereeing process, nor does a review of the data or code enter the editorial decision contrast to other domains (Stodden et al., 2013). All make provision of data and code a condition of publication, unless an exemption for data provision is requested.”
“More recently, economics journals have increased the intensity of enforcement of their policies. Historically being mainly focused on basic compliance, associations that publish journals …have appointed staff dedicated to enforcing various aspects of their data and code availability policies…The enforcement varies across journals, and may include editorial monitoring of the contents of the supplementary materials, reexecution of computer code (verification of computational reproducibility), and improved archiving of data.”
If the announcement and implementation of data deposit policies improve the availability of researchers’ code and data…, what has the impact been on overall reproducibility? Table 2B, shows the reproduction rates both conditional on data availability as well as unconditionally, for a number of reproducibility studies”Data that is not provided due to licensing, privacy, or commercial reasons (often incorrectly collectively referred to as ‘proprietary’ data) can still be useful in attempts at reproduction, as long as others can reasonably expect to access the data…Providers will differ in the presence of formal access policies, and this is quite important for reproducibility: only if researchers other than the original author can access the non-public data can an attempt at reproducibility even be made, if it at some cost.
“We made a best effort to classify the access to the confidential data, and the commitment by the author or third parties to provide the data if requested. For instance, a data curator with a well-defined, nonpreferential data access policy would be classified under ‘formal commitment.’…We could identify a formal commitment or process to access the data only for 35% of all nonpublic data sets.”
“One of the more difficult topics to empirically assess is the extent to which reproducibility is taught in economics, and to what extent in turn economic education is helped by reproducible data analyses. The extent of the use of replication exercises in economics classes is anecdotally high, but I am not aware of any study or survey demonstrating this.”
“More recently, explicit training in reproducible methods (Ball & Medeiros, 2012; Berkeley Initiative for Transparency in the Social Sciences, 2015), and participation of economists in data science programs with reproducible methods has increased substantially, but again, no formal and systematic survey has been conducted.”
“Because most reproducibility studies of individual articles ‘only’ confirm existing results, they fail the ‘novelty test’ that most editors apply to submitted articles (Galiani et al., 2017). Berry and coauthors (2017) analyzed all papers in Volume 100 of the AER, identifying how many were referenced as part of replication or cited in follow-on work.”
“While partially confirming earlier findings that strongly cited articles will also be replicated (Hamermesh, 2007), the authors found that 60% of the original articles were referenced in replication or extension work, but only 20% appeared in explicit replications. Of the roughly 1,500 papers that cite the papers in the volume, only about 50 (3.5%) are replications, and of those, only 8 (0.5%) focused explicitly on replicating one paper.”
“Even rarer are studies that conduct replications prior to their publication, of their own volition. Antenucci et al. (2014) predict the unemployment rate from Twitter data. After having written the paper, they continued to update the statistics on their website (“Prediction of Initial Claims for Unemployment Insurance,” 2017), thus effectively replicating their paper’s results on an ongoing basis. Shortly after release of the working paper, the model started to fail. The authors posted a warning on their website in 2015, but continued to publish new data and predictions until 2017, in effect, demonstrating themselves that the originally published model did not generalize.”
“Reproducibility has certainly gained more visibility and traction since Dewald et al.’s (1986) wake-up call…Still, after 30 years, the results of reproducibility studies consistently show problems with about a third of reproduction attempts, and the increasing share of restricted access data in economic research requires new tools, procedures, and methods to enable greater visibility into the reproducibility of such studies. Incorporating consistent training in reproducibility into graduate curricula remains one of the challenges for the (near) future.”
To read the article, click here.
This final instalment on the state of replications in economics, 2020 version, continues the discussion of how to define “replication success” (see here and here for earlier instalments). It then delves further into interpreting the results of a replication. I conclude with an assessment of the potential for replications to contribute to our understanding of economic phenomena.
How should one define “replication success”?
In their seminal article assessing the rate of replication in psychology, Open Science Collaboration (2015) employed a variety of definitions of replication success. One of their measures has come to dominate all others: obtaining a statistically significant estimate with the same sign as the original study (“SS-SS”). For example, this is the definition of replication success employed by the massive SCORE project currently being undertaken by the Center for Open Science.
The reason for the “SS-SS” definition of replication success is obvious. It can easily be applied across a wide variety of circumstances, allowing a one-size, fits-all measure of success. It melds two aspects of parameter estimation – effect size and statistical significance – into a binary measure of success. However, studies differ in the nature of their contributions. For some studies, statistical significance may be all that matters, say when establishing the prediction of a given theory. For others, the size of the effect may be what’s important, say when one is concerned about the effect of a tax cut on government revenues.
The following example illustrates the problem. Suppose a study reports that a 10% increase in unemployment benefits is estimated to increase unemployment duration by 5%, with a 95% confidence interval of [4%, 6%]. Consider two replication studies. Replication #1 estimates a mean effect of 2% with corresponding confidence interval of [1%, 3%]. Replication #2 estimates a mean effect of 5%, but the effect is insignificant with a corresponding confidence interval of [0%, 10%].
Did either of the two replications “successfully replicate” the original? Did both? Did none? The answer to this question largely depends on the motivation behind the original analysis. Was the main contribution of the original study to demonstrate that unemployment benefits affect unemployment durations? Or was the motivation primarily budgetary? So that the size of the effect was the important empirical contribution?
There is no general right or wrong answer to these questions. It is study-specific. Maybe even researcher-specific. For this reason, while I understand the desire to develop one-size-fits-all measures of success, it is not clear how to interpret these “success rates”. This is especially true when one recognizes — and as I discussed in the previous instalment to this blog — that “success rates” below 100%, even well below 100%, are totally compatible with well-functioning science.
How should we interpret the results of a replication?
The preceding discussion might give the impression that replications are not very useful. While measures of the overall “success rate” of replications may not tell us much, they can be very insightful in individual cases.
In a blog I wrote for TRN entitled “The Replication Crisis – A Single Replication Can Make a Big Difference”, I showed how a single replication can substantially impact one’s assessment of a previously published study.
Define “Prior Odds” as the Prob(Treatment is effective):Prob(Treatment is ineffective). Define the “False Positive Rate” (FPR) as the percent of statistically significant estimates in published studies for which the true underlying effect is zero; i.e, the treatment has no effect. If the prior odds of a treatment being effective are relatively low, Type I error will generate a large number of “false” significant estimates that can overwhelm the significant estimates associated with effective treatments, causing the FPR to be high. TABLE 1 below illustrates this.
The FPR values in the table range from 0.24 to 0.91. For example, given 1:10 odds that a randomly chosen treatment is effective, and assuming studies have Power equal to 0.50, the probability that a statistically significant estimate is a false positive is 50%. Alternatively, if we take a Power value of 0.20, which is approximately equal to the value that Ioannidis et al. (2017) report as the median value for empirical research in economics, the FPR rises to 71%.
It needs to be emphasized that these high FPRs have nothing to do with publication bias or file drawer effects. They are the natural outcomes of a world of discovery in which Type I error is combined with a situation where most studied phenomena are non-existent or economically negligible.
TABLE 2 reports what happens when a researcher in this environment replicates a randomly selected significant estimate. The left column reports the researcher’s initial assessment that the finding is a false positive (as per TABLE 1). The table shows how that probability changes as a result of a successful replication.
For example, suppose the researcher thinks there is a 50% chance that a given empirical claim is a false positive (Initial FPR = 50%). The researcher then performs a replication and obtains a significant estimate. If the replication study had 50% Power, the updated FPR would fall from 50% to 9%.
TABLE 2 demonstrates that successful replications produce substantial decreases in false positive rates across a wide range of initial FPRs and Power values. In other words, while discipline-wide measures of “success rates” may not be very informative, replications can have a powerful impact on the confidence that researchers attach to individual estimates in the literature.
Do replications have a unique role to play in contributing to our understanding of economic phenomena?
To date, replications have not had much of an effect on how economists do their business. The discipline has made great strides in encouraging transparency by requiring authors to make their data and code available. However, this greater transparency has not resulted in a meaningful increase in published replications. While there are no doubt many reasons for this, one reason may be that economists do not appreciate the unique role that replications can play in contributing to our understanding of economic phenomena.
The potential for empirical analysis to inform our understanding of the world is conditioned on the confidence researchers have in the published literature. While economists may differ in their assessment of the severity of false positives, the message of TABLE 2 is that, for virtually all values of FPRs, replications substantially impact that assessment. A successful replication lowers, often dramatically lowers, the probability that a given empirical finding is a false positive.
It is worth emphasizing that replications are uniquely positioned to make this contribution. New studies fall under the cloud of uncertainty that hangs over all original findings; namely, the rational suspicion that reported results are merely a statistical artefact. Replications, because of their focus on individual findings, are able to break through the fog. It is hoped that economists will start to recognize the unique role that replications can play in the process of scientific discovery. And that publishing opportunities for well-done replications; and appropriate professional rewards for the researchers who do them, follow.
Bob Reed is a professor of economics at the University of Canterbury in New Zealand. He is also co-organizer of the blogsite The Replication Network and Principal Investigator at UCMeta. He can be contacted at bob.reed@canterbury.ac.nz.
This instalment follows on yesterday’s post where I addressed two questions: Are there more replications in economics than there used to be? And, Which journals publish replications? These questions deal with the descriptive aspect of replications. We saw that replications seemingly constitute a relatively small — arguably negligible – component of the empirical output of economists. And while that component appears to be growing, it is growing at a rate that is, for all practical purposes, inconsequential. I would like to move on to more prescriptive/normative subjects.
Before I can get there, however, I need to acknowledge that the assessment above relies on a very specific definition of a replication, and that the sample of replications on which it is based is primarily drawn from one data source: Replication Wiki. Is it possible that there are a lot more replications “out there” that are not being counted? More generally, is it even physically possible to know how many replications there are?
Is it possible to know how many replications there are?
One of the most comprehensive assessments of the number of replications in economics was done in a study by Frank Mueller-Langer, Benedikt Fecher, Dietmar Harhoff, and Gert Wagner, published in Research Policy in 2019 and blogged about here. ML et al. reviewed all articles published in the top 50 economics journals between 1974 and 2014. They calculated a “replication rate” of 0.1%. That is, 0.1% of all the articles in the top 50 economics journals during this time period were replication studies.
0.1% is likely an understatement of the overall replication rate in economics, as replications are likely to be underrepresented in the top journals. With 400 mainline economics journals, each publishing an average of approximately 100 articles a year, it is a daunting task to assess the replication rate for the whole discipline.
One possibility is to scrape the internet for economics articles and use machine learning algorithms to identify replications. In unpublished work, colleagues of mine at the University of Canterbury used “convolutional neural networks” to perform this task. They compared the texts of the replication studies listed at The Replication Network (TRN) with a random sample of economics articles from RePEc.
Their final analysis produced a false negative error rate (the rate at which replications are mistakenly classified as non-replications) of 17%. The false positive rate (the rate at which non-replications are mistakenly classified as replications) was 5%.
To give a better feel for what these numbers means, consider a scenario where the replication rate is 1%. Suppose we have a sample of 10,000 papers, of which 100 are replications. Applying the false negative and positive rates above produces the numbers in TABLE 1.
Given this sample, a researcher would identify 578 replications, of which 83 would be true replications, and 495 would be “false replications”, that is, non-replication studies falsely categorized as replication studies. One would have to get a false positive rate below 1% before even half of the identified “replications” were true replications. Given a relatively low replication rate (here 1%), it is obvious that it is highly unlikely that machine learning will ever be accurate enough to produce reliable estimates of the overall replication rate in the discipline.
A final alternative is to follow the procedure of ML et al., but choose a set of 50 journals outside the top economics journals. However, as reported in yesterday’s blog, replications tend to be clustered in a relatively small number of journals. Results of replication rates would likely depend greatly on the particular sample of journals that was used.
Putting the above together, the answer to the question “Is it possible to know how many replications there are” appears to be no.
I now move on to assessing what we have learned from the replications that have been done to date. Specifically, have replications uncovered a reproducibility problem in economics?
Is there a replication crisis in economics?
The last decade has seen increasing concern that science has a reproducibility problem. So it is fair to ask, is there a replication crisis in economics? Probably the most famous study of replication rates is the study by Brian Nosek and the Open Science Collaboration (Science, 2015) that assessed the replication rate of 100 experiments in psychology. They reported an overall “successful replication rate” of 39%. Similar studies focused more on economics report higher rates (see TABLE 2).
The next section will delve a little more into the meaning of “replication success”. For now, let’s first ask, what rate of success should we expect to see if science is performing as it is supposed to? In a blog for TRN (“The Statistical Fundamentals of (Non-)Replicability”), Jeff Miller considers the case where a replication is defined to be “successful” when it reproduces a statistically significant estimate reported in a previous study (see FIGURE 1 below).
FIGURE 1 assumes 1000 studies each assess a different treatment. 10% of the treatments are effective. 90% have no effect. Statistical significance is set at 5% and all studies have statistical power of 60%. The latter implies that 60 of the 100 studies with effective treatments produce significant estimates. The Type I error rate implies that 45 of the remaining 900 studies with ineffectual treatments also generate significant estimates. As a result, 105 significant estimates are produced from the initial set of 1000 studies.
If these 105 studies are replicated, one would expect to see approximately 38 significant estimates, leading to a replication “success rate” of 36% (see bottom right of FIGURE 1). Note that there is no publication bias here. No “file drawer effect”. Even when science works as it is supposed to, we should not expect a replication “success rate” of 100%. “Success rates” far less than 100% are perfectly consistent with well-functioning science.
Conclusion
Replications come in many sizes, shapes, and flavors. Even if we could agree on a common definition of a replication, it would be very challenging to make discipline-level conclusions about the number of replications that get published. Given the limitations of machine learning algorithms, there is no substitute for personally assessing each article individually. With approximately 400 mainline economics journals, each publishing approximately 100 articles a year, that is a monumental, seemingly insurmountable, challenge.
Beyond the problem of defining a replication, beyond the problem of defining “replication success”, there is the further problem of interpreting “success rates”. One might think that a 36% replication success rate was an indicator that science was failing miserably. Not necessarily so.
The final instalment of this series will explore these topics further. The goal is to arrive at an overall assessment of the potential for replications to make a substantial contribution to our understanding of economic phenomena (to read the next instalment, click here).
Bob Reed is a professor of economics at the University of Canterbury in New Zealand. He is also co-organizer of the blogsite The Replication Network and Principal Investigator at UCMeta. He can be contacted at bob.reed@canterbury.ac.nz.
This post is based on a keynote presentation I gave at the Editor’s Meeting of the International Journal for Re-Views of Empirical Economics in June 2020. It loosely follows up two previous attempts to summarize the state of replications in economics: (i) An initial paper by Maren Duvendack, Richard Palmer-Jones, and myself entitled “Replications in Economics: A Progress Report”, published in Econ Journal Watch in 2015; and (ii) a blog I wrote for The Replication Network (TRN) entitled “An Update on the Progress of Replications in Economics”, posted in October 2018.
In this instalment, I address two issues:
– Are there more replications in economics than there used to be?
– Which journals publish replications?
Are there more replications in economics than there used to be?
Before we count replications, we need to know what we are counting. Researchers use different definitions of replications, which produce different numbers. For example, at the time of this writing, Replication Wiki reports 670 replications at their website. In contrast, TRN, which relies heavily on Replication Wiki, lists 491 replications.
Why the difference? TRN employs a narrower definition of a replication. Specifically, it defines a replication as “any study published in a peer-reviewed journal whose main purpose is to determine the validity of one or more empirical results from a previously published study.”
Replications come in many sizes and shapes. For example, sometimes a researcher will develop a new estimator and want to see how it compares with another estimator. Accordingly, they replicate a previous study using the new estimator. An example is De Chaisemartin & d’Haultfoeuille’s “Fuzzy differences-in-differences” (Review of Economic Studies, 2018). D&H develop a DID estimator that accounts for heterogeneous treatment effects when the rate of treatment changes over time. To see the difference it makes, they replicate Duflo (2001) which uses a standard DID estimator.
Replication Wiki counts D&H as a replication. TRN does not. The reason TRN does not count D&H as a replication is because the main purpose of D&H is not to determine whether Duflo (2001) is correct. The main purpose of D&H is to illustrate the difference their estimator makes. This highlights the grey area that separates replications from other studies.
Reasonable people can disagree about the “best” definition of replication. I like TRN’s definition because it restricts attention to studies whose main goal is to determine “the truth” of a claim by a previous study. Studies that meet this criterion tend to be more intensive in their analysis of the original study and give it a more thorough empirical treatment. A further benefit is that TRN has consistently applied the same definition of replication over time, facilitating time series comparisons.
FIGURE 1 shows the growth in replications in economics over time. The graph is somewhat misleading because 2019 was an exceptional year, driven by special replication issues at the Journal of Development Studies, the Journal of Development Effectiveness, and, especially, Energy Economics. In contrast, 2020 will likely end up having closer to 20 replications. Even ignoring the big blip in 2019, it is clear that there has been a general upwards creep in the number of replications published in economics over time. It is, however, a creep, and not a leap. Given that there are approximately 40,000 articles published annually in Web of Science economics journals, the increase over time does not indicate a major shift in how the economics discipline values replications.
Which journals publish replications?
TABLE 1 reports the top 10 economics journals in terms of total number of replications published in their journal lifetimes. Over the years, a consistent leader in the publishing of replications has been the Journal of Applied Econometrics. In second place is the American Economic Review. However, an important distinction between these two journals is that JAE publishes both positive and negative replications; that is, replications that both confirm and refute the original studies. In contrast, the AER only very rarely publishes a positive replication.
There have been several new initiatives by journals to publish replications. Notably, the International Journal for Re-Views of Empirical Economics (IREE) was started in 2017 and is solely dedicated to the publishing of replications. It is an open access journal with no author processing charges (APCs), supported by a consortium of private and public funders. As of January 2021, it had published 10 replication studies.
To place the numbers in TABLE 1 in context, there are approximately 400 mainline economics journals. About one fourth (96) have ever published a replication. 2 journals account for approximately 25% of all replications that have ever been published. 9 journals account for over half of all replication studies. Only 25 journals (about 6% of all journals) have ever published more than 5 replications in their lifetimes.
Conclusion
While a little late to the party, economists have recently made noises about the importance of replication in their discipline. Notably, the 2017 Papers and Proceedings issue of the American Economic Review prominently featured 8 articles addressing various aspects of replications in economics. And indeed, there has been an increase in the number of replications over time. However, the growth in replications is best described as an upwards creep rather than a bold leap.
Perhaps the reason replications have not really caught on is because fundamental questions about replications have not been addressed. Is there a replication crisis in economics? How should “replication success” be measured? What is the “success rate” of replications in economics? How should the results of replications be interpreted? Do replications have a unique role to play in contributing to our understanding of economic phenomena? I take these up in subsequent instalments of this blog (to read the next instalment, click here).
Bob Reed is a professor of economics at the University of Canterbury in New Zealand. He is also co-organizer of the blogsite The Replication Network and Principal Investigator at UCMeta. He can be contacted at bob.reed@canterbury.ac.nz.
The Research network on Economic Experiments for the Common Agricultural Policy (REECAP) is an EU-wide informal consortium with the aim to promote and foster economic experimental designs and behavioural analysis in the context of evaluating European agricultural policies.
The network previously organised a roundtable discussion entitled “Reproducibility in Experimental Economics: Crisis or Opportunity?” (see this blog for a nice summary) as part of their annual event in Osnabrueck, Germany. The discussion acknowledged that the principle of reproducibility is a central tenet of experimental research to inform policy relevant decisions in the environmental and agricultural spheres. However, recent studies (e.g. Camerer et al 2018) cast doubts about the replicability of social science experiments, going as far as to say that the social sciences may experience a ‘replication crisis’.
This is the backdrop to the webinar REECAP is offering on October 19, 2020 from 10.30-12.30am CET, details of the programme can be found here. The webinar aims to discuss replications in agricultural economics, and it is of interest to researchers who wish to learn about the ‘replication crisis’, where it comes from and what has been done to tackle it so far. This event will also provide a forum for those interested in exploring participation in REECAP’s replication project that is aiming to coordinate the replication of experiments relevant for shaping agricultural policies.
These replications may then be submitted to a Special Issue in Applied Economic Perspectives and Policy – a call is forthcoming. However, this webinar is not just targeting researchers but also practitioners who may have an interest in research methodologies and who are curious to learn more about how best to improve the robustness of research findings, especially when these findings are used to inform policy relevant decisions in the context of agriculture.
Across scientific disciplines, researchers are increasingly questioning the credibility of empirical research. This research, they argue, is rife with unobserved decisions that aim to produce publishable results rather than accurate results. In fields where the results of empirical research are used to design policies and programs, these critiques are particularly concerning as they undermine the credibility of the science on which policies and programs are designed. In a paper published in the Review of Environmental Economics and Policy, we assess the prevalence of empirical research practices that could lead to a credibility crisis in the field of environmental and resource economics.
We looked at empirical environmental economics papers published between 2015 and 2018 in four top journals: The American Economic Review (AER), Environmental and Resource Economics (ERE), The Journal of the Association of Environmental and Resource Economics (JAERE), and The Journal of Environmental Economics and Management (JEEM). From 307 publications, we collected more than 21,000 test statistics to construct our dataset. We reported four key findings:
1. Underpowered Study Designs and Exaggerated Effect Sizes
As has been observed in other fields, the empirical designs used by environmental and resource economists are statistically underpowered, which implies that the magnitude and sign of the effects reported in their publications are unreliable. The conventional target for adequate statistical power in many fields of science is 80%. We estimated that, in environmental and resource economics, the median power of study designs is 33%, with power less than 80% for nearly two out of the three estimated parameters. When studies are underpowered and when scientific journals are more likely to publish results that pass conventional tests of statistical significance – tests that can only be passed in underpowered designs when the estimated effect is much larger than the true effect size – these journals will tend to be publish exaggerated effect sizes. We estimated that 56% of the reported effect sizes in the environmental and resource economics literature are exaggerated by a factor of two or more; 35% are exaggerated by a factor of four or more.
2. Selective Reporting of Statistical Significance or “p-hacking”
Researchers face strong professional incentives to report statistically significant results, which may lead them to selectively report results from their analyses. One indicator of selective reporting is an unusual pattern in the distribution of test statistics; specifically, a double-humped distribution around conventionally accepted values of statistical significance. In the figure below, we present the distribution of test statistics for the estimates in our sample, where 1.96 is the conventional value for statistical significance (p<0.05). The unusual dip just before 1.96, is consistent with selective reporting of results that are above the conventionally accepted level of statistical significance.
3. Multiple Comparisons and False Discoveries
Repeatedly testing the same data set in multiple ways increases the probability of making false (spurious) discoveries, a statistical issue that is often called the “multiple comparisons problem.” To mitigate the probability of false discoveries when testing more than one related hypothesis, researchers can adopt a range of approaches. For example, they can ensure the false discovery rate is no larger than a pre-specified level. These approaches, however, are rare in the environmental and resource economics literature: 63% of the studies in our sample conducted multiple hypothesis tests, but less than 2% of them used an accepted approach to mitigate the multiple comparisons problem.
4. Questionable Research Practices (QRPs)
To better understand empirical research practices in the field of environmental and resource economics, we also conducted a survey of members of the Association of Environmental and Resource Economists (AERE) and the European Association of Environmental and Resource Economists (EAERE). In the survey, we asked respondents to self-report whether they had engaged in research practices that other scholars have labeled “questionable”. These QRPs include selectively reporting only a subset of dependent variables or analyses conducted, hypothesizing after results are known (also called HaRKing), choosing regressors or re-categorizing data after looking at the results, etc. Although one might assume that respondents would be unlikely to self-report engaging in such practices, 92% admitted to engaging in at least one QRP.
Recommendations for Averting a Replication Crisis
To help improve the credibility of the environmental and resource economics literature, we recommended changes to the current incentive structures for researchers.
– Editors, funders, and peer reviewers should emphasize the designs and research questions more than results, abolish conventional statistical significance cut-offs, and encourage the reporting of statistical power for different effect sizes.
– Authors should distinguish between exploratory and confirmatory analyses, and reviewers should avoid punishing authors for exploratory analyses that yield hypotheses that cannot be tested with the available data.
– Authors should be required to be transparent by uploading to publicly-accessible, online repositories the datasets and code files that reproduce the manuscript’s results, as well as results that may have been generated but not reported in the manuscript because of space constraints or other reasons. Authors should be encouraged to report everything, and reviewers should avoid punishing them for transparency.
– To ensure their discipline is self-correcting, environmental and resource economists should foster a culture of open, constructive criticism and commentary. For example, journals should encourage the publication of comments on recent papers. In a flagship field journal, JAERE, we could find no published comments in the last five years.
– Journals should encourage and reward pre-registration of hypotheses and methodology, not just for experiments, but also for observational studies for which pre-registrations are rare. We acknowledge in our article that pre-registration is no panacea for eliminating QRPs, but we also note that, in other fields, it has been shown to greatly reduce the frequency of large, statistically significant effect estimates in the “predicted” direction.
– Journals should also encourage and reward replications of influential, innovative, or controversial empirical studies. To incentivize such replications, we recommend that editors agree to review a replication proposal as a pre-registered report and, if satisfactory, agree to publish the final article regardless of whether it confirms, qualifies, or contradict the original study.
Ultimately, however, we will continue to rely on researchers to self-monitor their decisions concerning data preparation, analysis, and reporting. To make that self-monitoring more effective, greater awareness of good and bad research practices is critical. We hope that our publication contributes to that greater awareness.
Paul J. Ferraro is the Bloomberg Distinguished Professor of Human Behavior and Public Policy at Johns Hopkins University. Pallavi Shukla is a Postdoctoral Research Fellow at the Department of Environmental Health and Engineering at Johns Hopkins University. Correspondence regarding this blog can be sent to Dr. Shukla at pshukla4@jhu.edu.