A Must-Read on the Statistical Analysis of Replications

[Excerpts taken from the article, “The Statistics of Replication” by Larry Hedges, pubished in the journal Methodology]

Background

“Some treatments of replication have defined replication in terms of the conclusions obtained by studies (e.g., did both studies conclude that the treatment effect was positive, or not…).”

“For example, one might say that investigator Smith found an effect (by which we mean that Smith obtained a statistically significant positive treatment effect) while investigator Jones failed to replicate (meaning that Jones did not obtain a statistically significant positive treatment effect).”

“While this definition of replication may be in accord with common language usage, it is not useful as a scientific definition of replication for both conceptual and statistical reasons.”

“…decisions about replication…should be based on effect sizes…”

“When effects are identical (homogeneous across studies) θ₁ = …= θ_k.”

“The Q-statistic is used in testing for heterogeneity of effects across studies in meta-analysis.”

“…the distribution of Q…is determined only by k, the number of studies, and the noncentrality parameter λ…”

“The noncentrality parameter λ is a natural way to characterize heterogeneity when studies are assumed to be fixed, but there are alternatives, particularly when the studies themselves are considered a random sample from a universe of studies – the so called random effects model for meta-analysis…”

“If studies are a random sample from a universe of studies, so that their effect parameters are also a sample from a universe of effect parameters with mean μ and variance τ², then τ² (the between-studies variance component of effects) is a natural way to characterize heterogeneity of effects.”

How Should Replication Be Defined?

“It is logical to think of defining replication across studies as corresponding to the case when all of the effect parameters are identical, that is, when θ₁ = …= θ_k or equivalently when λ = 0, or when τ² = 0. This situation might be characterized as exact replication.”

“It is also possible to think of that if the θi are quite similar, but not identical then the results of the studies replicate “approximately.” When the value of λ (or τ²) is “small enough,” that is, smaller than some negligible value λ₀ (or τ₀²), we might conclude that the studies approximately replicate.”

“Because replication is a concern of essentially all sciences, it is possible to examine empirical evidence about replication in various sciences to provide a context for understanding replication in the social sciences.”

“The example of physics is particularly illuminating because it is among the most respected sciences and because it has a long tradition of examining empirical evidence about replication…”

“Determining the values of the so-called fundamental constants of mathematical physics is a continuing interest in physics. Theory suggests that the speed of light in a vacuum is a universal constant and there has been a considerable amount of empirical work to determine its value.”

“Figure 1 shows the values of the studies estimating the speed of light.”

“Each determination is given by a dot surrounded by one standard error bars…It is clear that…the estimates differ by more than would be expected by chance due to their estimation error. In fact, the Q-statistic is Q = 36.92 with a p-value of less than .01.”

“This has led to an understanding that reasonable scientific practice should be to tolerate a small amount of heterogeneity in estimates as negligible for scientific purposes.”

“The principle used to arrive at an acceptable level of heterogeneity in physics is that competent experimenters attempt to reduce the biases in their experiments to a point that they are small compared to the…variance due to estimation errors (ν).”

“…conventions in physics, medicine, and personnel psychology provide a range of definitions of negligible heterogeneity from λ₀ = (k-1)/4 to λ₀ = 2(k-1)/3 or alternatively, τ₀²/ν = 1/4 to τ₀²/ν = 2/3.”

“Note that all of these definitions of negligible heterogeneity are social conventions among a group of scientists as all conventions for interpretation must be.”

Conclusion about Defining Replication

“The definition of replication is more complex than it first appears. While exact replication is logically appealing, it is too strict to be useful, even in well-established sciences like physics, chemistry, or medicine.”

“Approximate replication has proven more scientifically useful in these sciences and in personnel psychology. However, it requires establishment of conventions of negligible heterogeneity among groups of scientists. The fact that conventions have been established in these sciences shows that it is possible to do so.”

Statistical Analysis of Replication

The statistical test for heterogeneity typically used in meta-analysis…[is based on]…the Q-statistic…”

“…the details of the statistical approaches to studying replication depend on three considerations that are largely conceptual: How the hypotheses are structured (whether the burden of proof lies with replication or with failure to replicate), how replication is defined (as exact or approximate replication), and whether the studies are conceived as a fixed set or a random sample from a universe of studies…”

“…evaluation of the sensitivity (e.g., statistical power) of tests based on Q is somewhat different when studies are considered fixed than when they are considered random…”

“…tests of approximate replication have the same test statistic Q, but larger critical values than tests of exact replication and, therefore, have lower statistical power.”

“…the results of the analysis can only be conclusive if the test has high power to detect meaningful amounts of heterogeneity. The inherent problem with this formulation is that concluding that studies replicate involves accepting the null hypothesis…”

“A different way to structure the test is to alter the burden of proof so that it lies on replication (not failure to replicate).”

Conclusion about Statistical Analyses for Replication

“The major conclusion about testing hypotheses about replication is that different tests are possible and the choice among them is not automatic, but a principled analytic decision that requires some care.”

Design of Replication Studies

“The design of an ensemble of two or more studies to investigate replication might seem straightforward, but quite different designs have been used with little justification of why that design was appropriate.”

“For example, the Open Science Collaborative (2016) and Camerer et al. (2018) chose to use a total of k = 2 studies (the original and one replication), while the Many Labs Project (Klein et al., 2014) used as many as k = 36 studies (the original and 35 replications) of each result. One might ask which, if either design is adequate and why.”

“The simplest conception of the design to test whether Study 1 can be replicated is to simply repeat the study, so that the ensemble is two studies (Study 1 and Study 2).”

“The statistical power of a test for replication based on a total of k = 2 studies is limited by the study with the least statistical power. This means that it will be virtually impossible to achieve a high power test of replication unless both studies have very high power.”

“Moreover, this analysis was based on a test for exact replication. Test for approximate replication have lower power than the corresponding test for exact replication, so they would have even lower power in this situation than a test for exact replication.”

“…one design that might seem appealing is to use several replication studies (i.e., more than one replication of the original), and then to compare the results of replication studies (as a group) to the original study.”

“…such a strategy is mathematically equivalent to combining the estimates from all of the replication studies into one “synthetic study” and computing an effect size estimate (and its variance) for that synthetic replication study.”

“The analysis of the difference between the original study and the synthetic replication study is subject to exactly the same limitations of analyses comparing two studies that are described in this article. In other words, the sensitivity of that analysis is limited by the least sensitive of the two studies being compared (which will usually be the original study).”

“Thus, no matter how many replication studies are conducted, it may be impossible to obtain a design of this type with adequate sensitivity.”

Overall Conclusions

“Exact replication is logically appealing, but appears to be too strict a definition to be satisfied even in the most mature sciences like physics or medicine.”

“Approximate replication is a more useful concept, but requires the development of social conventions in each area of science. Moreover, tests of approximate replication are less powerful than those of exact replication, leading to lower sensitivity in analyses of approximate replication.”

“For any particular definition of (exact or approximate) replication, several different, but perfectly valid, analyses are possible. They differ depending on whether a studies-fixed or studies-random framework is used and whether the burden of proof is imposed on failure to replicate (so that rejection of the null hypothesis leads to rejection of replication) or on replication (so that rejection of the null hypothesis leads to rejection of failure to replicate).”

“Finally there have been unappreciated problems in the design of a replication investigation (an ensemble of studies to study replication). The sensitivity of an ensemble of two studies is limited by the least sensitive of the studies, so that an ensemble of two studies will almost never be adequate to evaluate replication.”

To read the article, click here.

The Replication Network

Furthering the Practice of Replication in Economics

A Must-Read on the Statistical Analysis of Replications

[Excerpts taken from the article, “The Statistics of Replication” by Larry Hedges, pubished in the journal Methodology]

Background

“Some treatments of replication have defined replication in terms of the conclusions obtained by studies (e.g., did both studies conclude that the treatment effect was positive, or not…).”

“While this definition of replication may be in accord with common language usage, it is not useful as a scientific definition of replication for both conceptual and statistical reasons.”

“…decisions about replication…should be based on effect sizes…”

“When effects are identical (homogeneous across studies) θ1 = …= θk.”

“The Q-statistic is used in testing for heterogeneity of effects across studies in meta-analysis.”

“…the distribution of Q…is determined only by k, the number of studies, and the noncentrality parameter λ…”

How Should Replication Be Defined?

“It is logical to think of defining replication across studies as corresponding to the case when all of the effect parameters are identical, that is, when θ1 = …= θk or equivalently when λ = 0, or when τ2 = 0. This situation might be characterized as exact replication.”

“Because replication is a concern of essentially all sciences, it is possible to examine empirical evidence about replication in various sciences to provide a context for understanding replication in the social sciences.”

“The example of physics is particularly illuminating because it is among the most respected sciences and because it has a long tradition of examining empirical evidence about replication…”

“Determining the values of the so-called fundamental constants of mathematical physics is a continuing interest in physics. Theory suggests that the speed of light in a vacuum is a universal constant and there has been a considerable amount of empirical work to determine its value.”

“Figure 1 shows the values of the studies estimating the speed of light.”

“Each determination is given by a dot surrounded by one standard error bars…It is clear that…the estimates differ by more than would be expected by chance due to their estimation error. In fact, the Q-statistic is Q = 36.92 with a p-value of less than .01.”

“This has led to an understanding that reasonable scientific practice should be to tolerate a small amount of heterogeneity in estimates as negligible for scientific purposes.”

“The principle used to arrive at an acceptable level of heterogeneity in physics is that competent experimenters attempt to reduce the biases in their experiments to a point that they are small compared to the…variance due to estimation errors (ν).”

“…conventions in physics, medicine, and personnel psychology provide a range of definitions of negligible heterogeneity from λ0 = (k-1)/4 to λ0 = 2(k-1)/3 or alternatively, τ02/ν = 1/4 to τ02/ν = 2/3.”

“Note that all of these definitions of negligible heterogeneity are social conventions among a group of scientists as all conventions for interpretation must be.”

Conclusion about Defining Replication

“The definition of replication is more complex than it first appears. While exact replication is logically appealing, it is too strict to be useful, even in well-established sciences like physics, chemistry, or medicine.”

Statistical Analysis of Replication

The statistical test for heterogeneity typically used in meta-analysis…[is based on]…the Q-statistic…”

“…evaluation of the sensitivity (e.g., statistical power) of tests based on Q is somewhat different when studies are considered fixed than when they are considered random…”

“…tests of approximate replication have the same test statistic Q, but larger critical values than tests of exact replication and, therefore, have lower statistical power.”

“…the results of the analysis can only be conclusive if the test has high power to detect meaningful amounts of heterogeneity. The inherent problem with this formulation is that concluding that studies replicate involves accepting the null hypothesis…”

“A different way to structure the test is to alter the burden of proof so that it lies on replication (not failure to replicate).”

Conclusion about Statistical Analyses for Replication

“The major conclusion about testing hypotheses about replication is that different tests are possible and the choice among them is not automatic, but a principled analytic decision that requires some care.”

Design of Replication Studies

“The design of an ensemble of two or more studies to investigate replication might seem straightforward, but quite different designs have been used with little justification of why that design was appropriate.”

“The simplest conception of the design to test whether Study 1 can be replicated is to simply repeat the study, so that the ensemble is two studies (Study 1 and Study 2).”

“The statistical power of a test for replication based on a total of k = 2 studies is limited by the study with the least statistical power. This means that it will be virtually impossible to achieve a high power test of replication unless both studies have very high power.”

“Moreover, this analysis was based on a test for exact replication. Test for approximate replication have lower power than the corresponding test for exact replication, so they would have even lower power in this situation than a test for exact replication.”

“…one design that might seem appealing is to use several replication studies (i.e., more than one replication of the original), and then to compare the results of replication studies (as a group) to the original study.”

“…such a strategy is mathematically equivalent to combining the estimates from all of the replication studies into one “synthetic study” and computing an effect size estimate (and its variance) for that synthetic replication study.”

“Thus, no matter how many replication studies are conducted, it may be impossible to obtain a design of this type with adequate sensitivity.”

Overall Conclusions

“Exact replication is logically appealing, but appears to be too strict a definition to be satisfied even in the most mature sciences like physics or medicine.”

“Approximate replication is a more useful concept, but requires the development of social conventions in each area of science. Moreover, tests of approximate replication are less powerful than those of exact replication, leading to lower sensitivity in analyses of approximate replication.”

To read the article, click here.

Share this:

Leave a comment Cancel reply

Tags

Blogroll

Recent Posts

“When effects are identical (homogeneous across studies) θ₁ = …= θ_k.”

“It is logical to think of defining replication across studies as corresponding to the case when all of the effect parameters are identical, that is, when θ₁ = …= θ_k or equivalently when λ = 0, or when τ² = 0. This situation might be characterized as exact replication.”

“…conventions in physics, medicine, and personnel psychology provide a range of definitions of negligible heterogeneity from λ₀ = (k-1)/4 to λ₀ = 2(k-1)/3 or alternatively, τ₀²/ν = 1/4 to τ₀²/ν = 2/3.”