Across scientific disciplines, researchers are increasingly questioning the credibility of empirical research. This research, they argue, is rife with unobserved decisions that aim to produce publishable results rather than accurate results. In fields where the results of empirical research are used to design policies and programs, these critiques are particularly concerning as they undermine the credibility of the science on which policies and programs are designed. In a paper published in the Review of Environmental Economics and Policy, we assess the prevalence of empirical research practices that could lead to a credibility crisis in the field of environmental and resource economics.
We looked at empirical environmental economics papers published between 2015 and 2018 in four top journals: The American Economic Review (AER), Environmental and Resource Economics (ERE), The Journal of the Association of Environmental and Resource Economics (JAERE), and The Journal of Environmental Economics and Management (JEEM). From 307 publications, we collected more than 21,000 test statistics to construct our dataset. We reported four key findings:
1. Underpowered Study Designs and Exaggerated Effect Sizes
As has been observed in other fields, the empirical designs used by environmental and resource economists are statistically underpowered, which implies that the magnitude and sign of the effects reported in their publications are unreliable. The conventional target for adequate statistical power in many fields of science is 80%. We estimated that, in environmental and resource economics, the median power of study designs is 33%, with power less than 80% for nearly two out of the three estimated parameters. When studies are underpowered and when scientific journals are more likely to publish results that pass conventional tests of statistical significance – tests that can only be passed in underpowered designs when the estimated effect is much larger than the true effect size – these journals will tend to be publish exaggerated effect sizes. We estimated that 56% of the reported effect sizes in the environmental and resource economics literature are exaggerated by a factor of two or more; 35% are exaggerated by a factor of four or more.
2. Selective Reporting of Statistical Significance or “p-hacking”
Researchers face strong professional incentives to report statistically significant results, which may lead them to selectively report results from their analyses. One indicator of selective reporting is an unusual pattern in the distribution of test statistics; specifically, a double-humped distribution around conventionally accepted values of statistical significance. In the figure below, we present the distribution of test statistics for the estimates in our sample, where 1.96 is the conventional value for statistical significance (p<0.05). The unusual dip just before 1.96, is consistent with selective reporting of results that are above the conventionally accepted level of statistical significance.
3. Multiple Comparisons and False Discoveries
Repeatedly testing the same data set in multiple ways increases the probability of making false (spurious) discoveries, a statistical issue that is often called the “multiple comparisons problem.” To mitigate the probability of false discoveries when testing more than one related hypothesis, researchers can adopt a range of approaches. For example, they can ensure the false discovery rate is no larger than a pre-specified level. These approaches, however, are rare in the environmental and resource economics literature: 63% of the studies in our sample conducted multiple hypothesis tests, but less than 2% of them used an accepted approach to mitigate the multiple comparisons problem.
4. Questionable Research Practices (QRPs)
To better understand empirical research practices in the field of environmental and resource economics, we also conducted a survey of members of the Association of Environmental and Resource Economists (AERE) and the European Association of Environmental and Resource Economists (EAERE). In the survey, we asked respondents to self-report whether they had engaged in research practices that other scholars have labeled “questionable”. These QRPs include selectively reporting only a subset of dependent variables or analyses conducted, hypothesizing after results are known (also called HaRKing), choosing regressors or re-categorizing data after looking at the results, etc. Although one might assume that respondents would be unlikely to self-report engaging in such practices, 92% admitted to engaging in at least one QRP.
Recommendations for Averting a Replication Crisis
To help improve the credibility of the environmental and resource economics literature, we recommended changes to the current incentive structures for researchers.
– Editors, funders, and peer reviewers should emphasize the designs and research questions more than results, abolish conventional statistical significance cut-offs, and encourage the reporting of statistical power for different effect sizes.
– Authors should distinguish between exploratory and confirmatory analyses, and reviewers should avoid punishing authors for exploratory analyses that yield hypotheses that cannot be tested with the available data.
– Authors should be required to be transparent by uploading to publicly-accessible, online repositories the datasets and code files that reproduce the manuscript’s results, as well as results that may have been generated but not reported in the manuscript because of space constraints or other reasons. Authors should be encouraged to report everything, and reviewers should avoid punishing them for transparency.
– To ensure their discipline is self-correcting, environmental and resource economists should foster a culture of open, constructive criticism and commentary. For example, journals should encourage the publication of comments on recent papers. In a flagship field journal, JAERE, we could find no published comments in the last five years.
– Journals should encourage and reward pre-registration of hypotheses and methodology, not just for experiments, but also for observational studies for which pre-registrations are rare. We acknowledge in our article that pre-registration is no panacea for eliminating QRPs, but we also note that, in other fields, it has been shown to greatly reduce the frequency of large, statistically significant effect estimates in the “predicted” direction.
– Journals should also encourage and reward replications of influential, innovative, or controversial empirical studies. To incentivize such replications, we recommend that editors agree to review a replication proposal as a pre-registered report and, if satisfactory, agree to publish the final article regardless of whether it confirms, qualifies, or contradict the original study.
Ultimately, however, we will continue to rely on researchers to self-monitor their decisions concerning data preparation, analysis, and reporting. To make that self-monitoring more effective, greater awareness of good and bad research practices is critical. We hope that our publication contributes to that greater awareness.
Paul J. Ferraro is the Bloomberg Distinguished Professor of Human Behavior and Public Policy at Johns Hopkins University. Pallavi Shukla is a Postdoctoral Research Fellow at the Department of Environmental Health and Engineering at Johns Hopkins University. Correspondence regarding this blog can be sent to Dr. Shukla at firstname.lastname@example.org.