[The following is an adaption of (and in large parts identical to) a recent blog post by Anne Scheel that appeared on The 100% CI .]
Many, probably most empirical scientists use frequentist statistics to decide if a hypothesis should be rejected or accepted, in particular null hypothesis significance testing (NHST).
NHST works when we have access to all statistical tests that are being conducted. That way, we should at least in theory be able to see the 19 null results accompanying every statistical fluke (assuming an alpha level of 5%) and decide that effect X probably does not exist. But publication bias throws this off-kilter: When only or mainly significant results end up being published, whereas null results get p-hacked, file-drawered, or rejected, it becomes very difficult to tell false positive from true positive findings.
The number of true findings in the published literature depends on something significance tests can’t tell us: The base rate of true hypotheses we’re testing. If only a very small fraction of our hypotheses are true, we could always end up with more false positives than true positives (this is one of the main points of Ioannidis’ seminal 2005 paper).
When Felix Schönbrodt and Michael Zehetleitner released this great Shiny app a while ago, I remember having some vivid discussions with Felix about what the rate of true hypotheses in psychology may be. In his very nice accompanying blog post, Felix included a flowchart assuming that 30% of all tested hypotheses are true. At the time I found this grossly pessimistic: Surely our ability to develop hypotheses can’t be worse than a coin flip? We spent years studying our subject! We have theories! We are really smart! I honestly believed that the rate of true hypotheses we study should be at least 60%.
A few months ago, this interesting paper by Johnson, Payne, Want, Asher, & Mandal came out. They re-analysed 73 effects from the Reproducibility Project: Psychology and tried to model publication bias. I have to admit that I’m not maths-savvy enough to understand their model and judge their method, but they estimate that over 700 hypothesis tests were run to produce these 73 significant results. They assume that the statistical power for tests of true hypotheses was 75%, and that 7% of the tested hypotheses were true. Seven percent.
Er, ok, so not 60% then. To be fair to my naive 2015 self: this number refers to all hypothesis tests that were conducted, including p-hacking. That includes the one ANOVA main effect, the other main effect, the interaction effect, the same three tests without outliers, the same six tests with age as covariate, … and so on.
Let’s see what these numbers mean for the rates of true and false findings. For this we will need the positive predictive value (PPV) and the negative predictive value (NPV). I tend to forget what exactly they and their two siblings, FDR and FOR, stand for and how they are calculated, so added the table above as a cheat sheet.
Ok, now we got that out of the way, let’s stick the numbers estimated by Johnson et al. into a flowchart. You see that the positive predictive value is shockingly low: Of all significant results, only 53% are true. Wow. I must admit that even after reading Ioannidis (2005) several times, this hadn’t quite sunk in. If the 7% estimate is anywhere near the true rate, that basically means that we can flip a coin any time we see a significant result to estimate if it reflects a true effect.
But I want to draw your attention to the negative predictive value. The chance that a non-significant finding is true is 98%! Isn’t that amazing and heartening? In this scenario, null results are vastly more informative than significant results.
I know what you’re thinking: 7% is ridiculously low. Who knows what those statisticians put into their Club Mate when they calculated this? For those of you who are more like 2015 me and think psychologists are really smart, I plotted the PPV and NPV for different levels of power across the whole range of the true hypothesis rate, so you can pick your favourite one. I chose five levels of power: 21% (estimate for neuroscience by Button et al., 2013), 75% (Johnson et al. estimate), 80% and 95% (common conventions), and 99% (upper bound of what we can reach).