Observed power (or post-hoc power) is the statistical power of the test you have performed, based on the effect size estimate from your data. Statistical power is the probability of finding a statistical difference from 0 in your test (aka a ‘significant effect’), if there is a true difference to be found. Observed power differs from the true power of your test, because the true power depends on the true effect size you are examining. However, the true effect size is typically unknown, and therefore it is tempting to treat post-hoc power as if it is similar to the true power of your study. In this blog, I will explain why you should never calculate the observed power (except for blogs about why you should not use observed power). Observed power is a useless statistical concept. –Daniël Lakens from his blog “Observed power, and what to do if your editor asks for post-hoc power analyses” at The 20% Statistician
Is observed power a useless statistical concept? Consider two researchers, each interested in estimating the effect of a treatment T on an outcome variable Y. Each researcher assembles an independent sample of 100 observations. Half the observations are randomly assigned the treatment, with the remaining half constituting the control group. The researchers estimate the equation Y = a + bT + error.
The first researcher obtains the results:
The estimated treatment effect is relatively small in size, statistically insignificant, and has a p-value of 0.72. A colleague suggests that perhaps the researcher’s sample size is too small and, sure enough, the researcher calculates a post-hoc power value of 5.3%.
The second researcher estimates the treatment effect for his sample, and obtains the following results:
The estimated treatment effect is relatively large and statistically significant with a p-value below 1%. Further, despite having the same number of observations as the first researcher, there is apparently no problem with power here, because the post-hoc power associated with these results is 91.8%.
Would it surprise you to know that both samples were drawn from the same data generating process (DGP): Y = 1.984×T + e, where e ~ N(0, 5)? The associated study has a true power of 50%.
The fact that post-hoc power can differ so substantially from true power is a point that has been previously made by a number of researchers (e.g., Hoenig and Heisey, 2001), and highlighted in Lakens’ excellent blog above.
The figure below presents a histogram of 10,000 simulations of the DGP, Y = 1.984×T + e, where e ~ N(0, 5), each with 100 observations, and each calculating post-hoc power following estimation of the equation. The post-hoc power values are distributed uniformly between 0 and 100%.
So are post-hoc power analyses good for nothing? That would be the case if a finding that an estimated effect was “underpowered” told us nothing more about its true power than a finding that it had high, post-hoc power. But that is not the case. In general, the expected value of a study’s true power will be lower for studies that are calculated to be “underpowered.”
Define “underpowered” as having a post-hoc power less than 80%, with studies having post-hoc power greater than or equal to 80% deemed to be “sufficiently powered.” The table below reports the results of a simulation exercise where “Beta” values are substituted into the DGP, Y = Beta × T + e, e ~ N(0, 5), such that true power values range from 10% to 90%. A 1000 simulations for each Beta value were run and the percent of times recorded that the estimated effects were calculated to be “underpowered.”