*Confidence intervals get top billing as the alternative to significance. But beware: confidence intervals rely on the same math as significance and share the same shortcominings. Confidence intervals don’t tell where the true effect lies even probabilistically. What they do is delimit a range of true effects that are broadly consistent with the observed effect.*

###### Confidence intervals, like p-values and power, imagine we’re repeating a study an infinite number of times, drawing a different sample each time from the same population. Though unnatural for basic, exploratory research, it’s a useful mathematical trick that let’s us define the concept of *sampling distribution* – the distribution of expected results – which in turn is the basis for many common stats. The math is the same across the board; I’ll start with a pedantic explanation of p-values, then generalize the terminology a bit, and use the new terminology to explain confidence intervals.

###### Recall that the (two-sided) p-value for an observed effect *d*_{obs} is the probability of getting a result as or more extreme than *d*_{obs} *under the null*. “Under the null” means we assume the population effect size *d*_{pop}=0. In math terms, the p-value for *d*_{obs} is the tail probability of the sampling distribution – the area under the curve beyond *d*_{obs} – times *2* to account for the two sides. Recall further that we declare a result to be *significant* and *reject the null* when the tail probability is so low that we deem it implausible that *d*_{obs} came from the null sampling distribution.

_{obs}

_{obs}

_{pop}=0

_{obs}

_{obs}

_{obs}

###### Figure 1a shows a histogram of simulated data overlaid with the sampling distribution for sample size *n=40* and *d*_{pop}=0. I color the sampling distribution by p-value, switching from blue to red at the conventional significance cutoff of *p=0.05*. The studies are simple two group difference-of-mean studies with equal sample size and standard deviation, and the effect size statistic is standardized difference (aka *Cohen’s d*). The line at *d*_{obs}=0.5 falls in the red indicating that we deem the null hypothesis implausible and reject it.

_{pop}=0

_{obs}=0.5

###### Figure 1b shows sampling distributions for *n=40* and several values of *d*_{pop}. The coloring, analogous to Figure 1a, indicates how plausible we deem *d*_{obs} given *n* and *d*_{pop}. The definition of “plausibility value” (*plaus*) is the same as p-value but for arbitrary *d*_{pop}. *d*_{obs}=0.5 is in the red for the outer values of *d*_{pop} but in the blue for the inner ones. This means we deem *d*_{obs}=0.5 implausible for the outer values of *d*_{pop} but plausible for the inner ones. The transition from implausible to plausible happens somewhere between *d*_{pop}=0 and *0.1*; the transition back happens between *0.9* and *1*.

_{pop}

_{obs}

_{pop}

_{pop}

_{obs}=0.5

_{pop}

_{obs}=0.5

_{pop}

_{pop}=0

###### Plausibility is a function of three parameters: *n*, *d*_{pop}, and *d*_{obs}. We can compute and plot plaus-values for any combination of the three parameters.

_{pop}

_{obs}

###### Figure 2a plots *plaus* vs. *d*_{pop} for *d*_{obs}=0.5 and several values of *n*. The figure also shows the limits of plausible *d*_{pop}s at points of interest. As with significance, we can declare “plausibility” at thresholds other than *0.05*. From the figure, we see that the limits gets tighter as we increase *n* or the plausibility cutoff. For *n=40* and the usual *0.05* cutoff, the limits are wide: [0.5, 0.94]. For *n=400* and a stringent cutoff of *0.5*, the limits are narrow: [0.45,0.55]. This makes perfect sense: (1) all things being equal, bigger samples have greater certainty; (2) a higher cutoff means we demand more certainty before deeming a result plausible, ie, we require that *d*_{pop} be closer to *d*_{obs}.

_{pop}

_{obs}=0.5

_{pop}

_{pop}

_{obs}

###### The plausibility limits in Figure 2a are consistent with Figure 1b. Both figures say that with *n=40* and *d*_{obs}=0.5, *d*_{pop} must be a little more than *0* and a little less than *1* to deem the solution plausible.

_{obs}=0.5

_{pop}

###### Now I’ll translate back to standard confidence interval terminology: *confidence level* is *1 – plaus* usually stated as a percentage; *confidence intervals* are plausibility limits expressed in terms of confidence levels. Figure 2b restates 2a using the standard terminology. It’s the same but upside down. This type of plot is called a *consonance curve*.

###### A further property of confidence intervals, called the *coverage* property, states that if we repeatedly sample from a fixed *d*_{pop} and compute the C% confidence intervals, C% of the intervals will contain *d*_{pop}. Figure 3 illustrates the property for 95% confidence intervals, *n=40*, and *d*_{pop}=0.05. The figure shows the sampling distribution colored by plaus-value, and confidence intervals as solid blue or dashed red lines depending on whether the interval covers *d*_{pop}. I arrange the intervals along the sampling distribution for visual separation.

_{pop}

_{pop}

_{pop}=0.05

_{pop}

###### Many texts use the coverage property as the definition of confidence interval and plausibility limits as a derived property. This points the reader in the wrong direction: since C% of intervals cover *d*_{pop}, it’s natural to believe there’s a C% chance that the interval computed from an observed effect size contains *d*_{pop}. This inference is invalid: the interval delimits the range of *d*_{pop}s that are close enough to *d*_{obs} to be deemed plausible. This says nothing about probability.

_{pop}

_{pop}

_{pop}

_{obs}

###### Stats likes strong words: “significance”, “power”, “confidence”. Words matter: “significant” suggests important; “power” suggests the ability to get the right answer; “95% confidence” suggests we’re pretty darn sure; “95% confidence interval” and “95% of intervals cover *d*_{pop}” suggest a 95% chance that the true effect falls in the interval. None of these inferences are valid.

_{pop}