*An oft-overlooked detail in the significance debate is the challenge of calculating correct p-values and confidence intervals, the favored statistics of the two sides. Standard methods rely on assumptions about how the data were generated and can be way off when the assumptions don’t hold. Papers on heterogenous effect sizes by ***Kenny and Judd** and **McShane and Böckenholt** present a compelling scenario where the standard calculations are highly optimistic. Even worse, the errors grow as the sample size increases, negating the usual heuristic that bigger samples are better.

**Kenny and Judd**and

**McShane and Böckenholt**present a compelling scenario where the standard calculations are highly optimistic. Even worse, the errors grow as the sample size increases, negating the usual heuristic that bigger samples are better.

###### Standard methods like the t-test imagine that we’re repeating a study an infinite number of times, drawing a different sample each time from a population with a fixed true effect size. A competing, arguably more realistic, model is the heterogeneous effect size model (*het*). This assumes that each time we do the study, we’re sampling from a different population with a different true effect size. **Kenny and Judd** suggest that the population differences may be due to “variations in experimenters, participant populations, history, location, and many other factors… we can never completely specify or control.”

**Kenny and Judd**

###### In the meta-analysis literature, the *het* model is called the “random effects model” and the standard model the “fixed effects model”. While the distinction is well-recognized, the practical implications may not be. The purpose of this blog is to illustrate the practical consequences of the *het* model for p-values and confidence intervals.

###### I model the *het* scenario as a two stage random process. The first stage selects a population effect size, *d*_{pop}, from a normal distribution with mean *d*_{het} and standard deviation *sd*_{het}. The second carries out a two group difference-of-mean study with that population effect size: it selects two random samples of size *n* from standard normal distributions, one with *mean=0* and the other with *mean=d*_{pop}, and uses standardized difference, aka Cohen’s *d*, as the effect size statistic. The second stage is simply a conventional study with population effect size *d*_{pop}. *d*_{het}, the first stage mean, plays the role of true effect size.

_{pop}

_{het}

_{het}

_{pop}

_{pop}

_{het}

###### Figure 1 shows a histogram of simulated *het* results under the null (*d*_{het}=0) with *sd*_{het}=0.2 for *n=200*. Overlaid on the histogram is the sampling distribution for the conventional scenario colored by conventional p-value along with the 95% confidence interval. Note that the histogram is wider than the sampling distribution.

_{het}=0

_{het}=0.2

###### Recall that the p-value for an effect *d* is the probability of getting a result as or more extreme than *d* under the null. Since the histogram is wider than the sampling distribution, it has more data downstream of the point where *p=0.05* (where the color switches from blue to red) and so the correct p-value is more than 0.05. In fact the correct p-value is much more: 0.38. The confidence interval also depends on the width of the distribution and is wider than for the conventional case: -0.44 to 0.44 rather than -0.20 to 0.20.

###### Note that effect size heterogeneity “inflates” both the true p-value and true confidence interval. In this particular example, *p-value inflation* is 7.6 ( 0.38/0.05), and *confidence interval inflation* is 2.2 (0.44/0.20). In general, these inflation factors will change with *sd*_{het} and *n*. Figures 2 and 3 plot p-value and confidence interval inflation vs. *n* for several values of *sd*_{het}. The p-value results (Figure 2) show inflation when the conventional p-value is barely significant (*p=0.05*); the confidence interval results (Figure 3) are for *d=0* (same as Figure 1).

_{het}

_{het}

###### Not surprisingly, the results get worse as heterogeneity increases. For *n=200*, p-value inflation grows from 1.59 when *sd*_{het}=0.05 to 12.68 for *sd*_{het}=0.4; over the same range, confidence interval inflation grows from 1.12 to 4.12.

_{het}=0.05

_{het}=0.4

###### More worrisome is that the problem also gets worse as the sample size increases. For *sd*_{het}=0.05, p-value inflation grows from a negligible 1.05 when *n=20* to 1.59 for *n=200* and 2.19 for *n=400*; the corresponding values for confidence interval inflation are 1.01, 1.12, and 1.22. For *sd*_{het}=0.2, p-value inflation grows from 1.90 for *n=20* to 10.26 for *n=400*, while confidence interval inflation increases from 1.18 to 3.00.

_{het}=0.05

_{het}=0.2

###### What’s driving this sample size dependent inflation is that increasing *n* tightens up the second stage (where we select samples of size *n*) but not the first (where we select *d*_{pop}). As *n* grows and the second stage becomes narrower, the unchanging width of the first stage becomes proportionally larger.

_{pop}

###### Another way to see it is to compare the sampling distributions. Figure 4 shows sampling distributions for *n=20* and *n=200* for the conventional scenario (colored by p-value) and the *het* scenario (in grey) for *sd*_{het}=0.2. For *n=20*, the *het* (grey) curve is only slightly wider than the conventional one, while for *n=200* the difference is much greater. In both scenarios, the distributions are tighter for the larger *n*, but the conventional curve gets tighter faster.

_{het}=0.2