[From the blog “The “80% power” lie” posted by Andrew Gelman in December 2017 at Statistical Modeling, Causal Inference, and Social Science]
“Suppose we really were running studies with 80% power. In that case, the expected z-score is 2.8, and 95% of the time we’d see z-scores between 0.8 and 4.8. Let’s open up the R:”
“> 2*pnorm(-0.8) [1] 0.42”
“> 2*pnorm(-4.8) [1] 1.6e-06”
“So we should expect to routinely see p-values ranging from 0.42 to . . . ummmm, 0.0000016. And those would be clean, pre-registered p-values, no funny business, no researcher degrees of freedom, no forking paths.”
“Let’s explore further . . . the 75th percentile of the normal distribution is 0.67, so if we’re really running studies with 80% power, then one-quarter of the time we’d see z-scores above 2.8 + 0.67 = 3.47.”
“> 2*pnorm(-3.47) [1] 0.00052″
“Dayum. We’d expect to see clean, un-hacked p-values less than 0.0005, at least a quarter of the time, if we were running studies with minimum 80% power, as we routinely claim we’re doing, if we ever want any of that sweet, sweet NIH funding. And, yes, that’s 0.0005, not 0.005. There’s a bunch of zeroes there.”
“And, no, this ain’t happening. We don’t have 80% power. Heck, we’re lucky if we have 6% power.”
[From the working paper, “Methods Matter: P-Hacking and Causal Inference in Economics” by Abel Brodeur, Nikolai Cook, and Anthony Heyes]
“…Applying multiple methods to 13,440 hypothesis tests reported in 25 top economics journals in 2015, we show that selective publication and p-hacking is a substantial problem in research employing DID and (in particular) IV methods. RCT and RDD are much less problematic. Almost 25% of claims of marginally significant results in IV papers are misleading.”
[From the paper, “Practical Tools and Strategies for Researchers to Increase Replicability” by Michele Nuijten, forthcoming in Developmental Medicine & Child Neurology]
“Several large-scale problems are affecting the validity and reproducibility of scientific research. … Many of the suggested solutions are systemic, focused on top-down implementation, or on the training of students, but many other solutions are practical tools and strategies that researchers can immediately implement in their own workflow. Researchers can use online tools to double check reported statistics, follow online courses to improve their statistical inference, or use different statistical frameworks. They can also start to preregister their research plans, engage in multi-lab collaborations, and increase their level of transparency and openness.”
To read the full article, click here.
[From an editorial published in Nature entitled, “Referees should exercise their rights”]
“At Nature, we recognize that our peer reviewers have certain ‘rights’. One of the most well known is the right to anonymity. Less widely known is that referees have the right to view the data and code that underlie a work if it would help in the evaluation, even if these have not been provided with the submission. Yet few referees exercise this right. They should do so.”
[From the abstract of the article, “Many Analysts, One Data Set: Making Transparent How Variations in Analytic Choices Affect Results”, published by Silberzahn et al. in Advances in Methods and Practices in Psychological Science]
[From the abstract of the article “Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015″, published in Nature Human Behaviour by Colin Camerer et al.]
“Being able to replicate scientific findings is crucial for scientific progress. We replicate 21 systematically selected experimental studies in the social sciences published in Nature and Science between 2010 and 2015. The replications follow analysis plans reviewed by the original authors and pre-registered prior to the replications. The replications are high powered, with sample sizes on average about five times higher than in the original studies. We find a significant effect in the same direction as the original study for 13 (62%) studies, and the effect size of the replications is on average about 50% of the original effect size. Replicability varies between 12 (57%) and 14 (67%) studies for complementary replicability indicators. Consistent with these results, the estimated true-positive rate is 67% in a Bayesian analysis. The relative effect size of true positives is estimated to be 71%, suggesting that both false positives and inflated effect sizes of true positives contribute to imperfect reproducibility. Furthermore, we find that peer beliefs of replicability are strongly related to replicability, suggesting that the research community could predict which results would replicate and that failures to replicate were not the result of chance alone.”
To read the article, click here.
The study has generated much buzz in the media:
Buzzfeed: “In the last four years, 125 journals, mostly in the behavioral sciences, have adopted “registered reports,” …Similarly, more than 20,000 studies have been preregistered on the Center for Open Science’s website.”
Science News: “‘Replication crisis’ spurs reforms in how science studies are done”.
The Atlantic: “Fortunately, there are signs of progress. The number of pre-registered experiments—in which researchers lay out all their plans beforehand to obviate the possibility of p-hacking—has been doubling every year since 2012.”
NPR: “‘The social-behavioral sciences are in the midst of a reformation’… Scientists are..announcing in advance the hypothesis they are testing; they are making their data and computer code available so their peers can evaluate and check their results.”
Wired: “Thousands of researchers now pre-register their methodology and hypothesis before publication, to head off concerns that they’ll massage data after the fact.”
Washington Post: “the experiments scrutinized in this latest effort were published prior to a decision several years ago by Science, Nature and other journals to adopt new guidelines designed to increase reproducibility, in part by greater sharing of data”.
[From the blog, “Open science is now the only way forward for psychology” by Chris Chambers and Pete Etchells, posted in the Science Blog Network at http://www.theguardian.com]
“When we launched Head Quarters five years ago, psychology was in a pretty dark place. The field was still reeling from the impact of the Diederik Stapel fraud case – the largest perpetrated in psychology and one of the greatest ever uncovered in science. At the same time, a cascade of failures to replicate major findings was just beginning, and as if to add insult to injury, one of psychology’s most prestigious journals published a study claiming to confirm, of all things, the existence of psychic powers.”
“Psychologists were faced with one inescapable conclusion: that the research culture in the field was fundamentally flawed and needed urgent attention. Five years later, the field has taken some big steps forward toward righting the ship. Let’s take a look at some of those improvements.”
Replication is an important topic in economic research or any social science for that matter. This issue is most important when an analysis is undertaken to inform decisions by policymakers. Drawing inferences from null or insignificant finding is particularly problematic because it is often unclear when “not significant” can be interpreted as “no effect.” We recently wrestled with this issue in our paper, “The Effect of the Conservation Reserve Program on Rural Economies: Deriving a Statistical Verdict from a Null Finding,” published in the American Journal of Agricultural Economics. Below is a summary of our findings.
While an inherent bias to publish research with significant findings is widely recognized, there are times when not finding an effect may be more important. For example, suggestive evidence that a policy may not work is arguably more consequential than statistical confirmation that is does. The conundrum produced by null findings is not having any statistical basis for determining whether the true effect is close to zero or if the test is underpowered—that is, unlikely to detect a substantive effect. Our paper developed a method for deriving probabilities for null findings by providing a valid ex post estimate of statistical power. This allows economists and policymakers to more confidently conclude when “not significant” can, in fact, be interpreted as “no substantive effect.”
We demonstrate our method by replicating an analysis from the Economic Research Service’s (ERS) 2004 Report to Congress on the economic implications of the Conservation Reserve Program (CRP). The program, which was signed into law in 1985, was designed to remove environmentally vulnerable land from agricultural production. However, farm-dependent counties experienced both employment and population declines through the economically prosperous 1990s, raising concerns that the program might have cost jobs due to a reduction in agricultural production. Indeed, the ERS report identified worse employment growth in farm-dependent counties with high-CRP enrollments relative to their low-CRP enrollment peers. However, the report was unable to attribute lost employment to CRP enrollments.
While the report failed to identify a statistically significant, negative long-term effect of the program on employment growth, the authors cautioned that the verdict of “no negative employment effect” was only valid if the econometric test was statistically powerful. Replicating the 2004 analysis using new statistical inference methods allowed us to determine whether the tentative 2004 conclusion was correct. Our replication addresses two critical deficiencies that prevent economists from estimating statistical power: 1) we posit a compelling effect size—the level of job losses that would raise concerns regarding the trade-off with environmental benefits–and 2) we estimate the variability of an unobserved alternative distribution using simulation methods. We conclude that the test used in the ERS report had high power for detecting employment effects of −1 percent or lower, equivalent to job losses that would reduce the program’s environmental benefits by a third. An unrestricted test in line with Congress’s charge to search for “any effect” had very low power.
In many circumstances, economists do not have the opportunity to conduct power analysis before research starts. The approaches we suggest can be used to determine power for univariate analyses or multivariate regressions after the fact, provided the data-generating process can be replicated and the effect size of economic significance or policy relevance is stated. Given a range of posited effect sizes, our approach supplements an array of tools to inform decision making in the event of a null finding.”
In the spirit of replication, you can find our data and code in the supporting documentation of the article. If you are not able to access the article, the supplemental materials are also available here. We hope that others confronted with the “null hypothesis lacking error probability” conundrum will consider using the methods as a tool for making null findings potentially more informative, and for making our toolkit of applied econometric methods more useful for decision-making.
Jason P. Brown is an assistant vice president and economist at the Federal Reserve Bank of Kansas City. Dayton M. Lambert is a professor and Willard Sparks Chair, Department of Agricultural Economics, Oklahoma State University. Timothy R. Wojan is a senior economist, USDA, Economic Research Service. The opinions expressed are those of the authors and are not attributable to the Federal Reserve Bank of Kansas City, the Federal Reserve System, Oklahoma State University, the Economic Research Service, or USDA. Correspondence can be directed to Jason Brown at Jason.Brown@kc.frb.org.
[From the website of the journal Japanese Journal of Political Science, published by Cambridge University Press]
The website of the Japanese Journal of Political Science recently announced that it was allowing authors to select “results-blind” reviewing as an alternative to traditional “full paper” review (see below).

This is in addition to the journal’s policy of encouraging the submission of both “reanalysis” and “replication” manuscripts.

[From the article, “Will Facebook’s New Research Initiative Make The Replication Crisis Worse?” by Kalev Leetaru, published online at Forbes.com]
“The era of “big data” has transformed our understanding of the human world, making it possible for researchers to study billions of users at once, while at the same time making it impossible for other researchers to replicate their work. The ever-larger datasets that increasingly define modern quantitative social science are controlled by an ever-shrinking number of researchers that have exclusive access to the digital riches of our modern world. Facebook’s new academic research initiative with Social Science One was supposed to fix all of this, granting researchers across the world access to the private data of Facebook’s two billion users to mine, but the unanswered question is how initiatives like Social Science One will address the replication crisis.”
“…When it comes to the replication process, however, Social Science One has remained nearly entirely silent.”
You must be logged in to post a comment.