[From the working paper, “How Often Should We Believe Positive Results? Assessing the Credibility of Research Findings in Development Economics” by Aidan Coville and Eva Vivalt]
Over $140 billion is spent on donor assistance to developing countries annually to promote economic development. To improve the impact of these funds, aid agencies both produce and consume evidence about the effects of development interventions to inform policy recommendations. But how reliable is the evidence that development practitioners use? Given the “replication crisis” in psychology, we may wonder how studies in international development stack up.
There are several reasons that a study could fail to replicate. First, there may be changes in implementation or context between the original study and the replication, particularly in field settings, where most applied development economics research takes place. Second, publication bias can enter into the research process. Finally, studies may simply fail to replicate due to statistical reasons. Our analysis focuses on this last issue, especially as it relates to statistical power.
Ask a researcher what they think a reasonable power level is for a study and, inevitably, the answer will be “at least 80%”. The textbook suggestion of “reasonable” and the reality are, however, quite different. Reviews for the medical, economic and general social sciences literature estimate median power to be in the range of 8% – 24% (Button et al., 2013; Ioannidis et al., forthcoming; Smaldino & McElreath, 2016). This reduces the likelihood of identifying an effect when it is present. Importantly, however, this also increases the likelihood that a statistically significant result is spurious and exaggerated (Gelman & Carlin, 2014). In other words, the likelihood of false negatives and false positives depends critically on the power of the study.
To explore this issue, we follow Wacholder et al. (2004)’s “false positive report probability” (FPRP), an application of Bayes’ rule that leverages estimates of a study’s power, the significance level, and the prior belief that an intervention is likely to have a meaningful impact to estimate the likelihood that a statistically significant effect is spurious. Using this approach, Ioannidis (2005) estimates that more than half of the significant published literature in biomedical sciences could be false. A recent paper by Ioannidis et al. (2017) finds 90% of the more general economic literature is under-powered. As further measures of study credibility, we explore Gelman & Tuerlinckx (2000)’s errors of sign (Type S errors) and magnitude (Type M errors), respectively the probability that a given significant result has the wrong sign and the degree to which it is likely exaggerated compared to the true effect.
In order to calculate these statistics for a particular study, an informed estimate of the underlying “true” effect of the intervention being studied is needed. The standard approach in the literature is to use meta-analysis results as the benchmark. This is possible in settings where a critical mass of evidence is available, but that kind of evidence is not always available, and meta-analysis results may themselves be biased depending on the studies that are included. As an alternative approach to estimate the likely “true” effect sizes of each study intervention, we gathered up to five predictions from each of 125 experts covering 130 different results across typical interventions in development economics. This was used to estimate the power and consequently false positive or negative report probabilities for each study. To focus on those topics that were the most well-studied within development, we looked at the literature on cash transfers, deworming programs, financial literacy training, microfinance programs, and programs that provided insecticide-treated bed nets.
Our findings in this subset of studies are less dramatic than estimates for other disciplines. The median power was estimated to be between 18% and 59%, largely driven by large-scale conditional cash transfer programs. Experts predict that interventions will have a meaningful impact approximately 60% of the time, across interventions. With these inputs, we calculate the median FPRP to be between 0.001 and 0.008, compared to the median significant p-value of 0.002. The likelihood of a significant effect having the wrong sign (Type S error) is close to 0 while the median exaggeration factor (Type M error) of significant results is estimated to be between 1.2 and 2.2.
In short, the majority of studies reviewed fair exceptionally well, particularly when referenced against other disciplines that have performed similar exercises. We must emphasize that other study topics in development economics not covered in this review may be less credible; conditional cash transfer programs, in particular, tend to have very large sample sizes and thus low p-values. The broader contribution of the paper is to highlight how analysis of study power and the systematic collection of priors can help assess the quality of research, and we hope to see more work in this vein in the future.
Aidan Coville is an Economist in the Development Impact Evaluation Team (DIME) of the Development Research Group at the World Bank. Eva Vivalt is a Lecturer in the Research School of Economics at Australian National University. They can be contacted at firstname.lastname@example.org and email@example.com, respectively.