REED: EiR* – More on Heterogeneity in Two-Way Fixed Effects Models

[* EiR = Econometrics in Replications, a feature of TRN that highlights useful econometrics procedures for re-analysing existing research. The material for this blog is primarily drawn from the recent working paper “Difference-in-differences with variation in treatment timing” by Andrew Goodman-Bacon, available from his webpage at Vanderbilt University]
In a recent blog at TRN, I discussed research by Clément de Chaisemartin and Xavier D’Haultfoeuille (C&H) that pointed out how heterogeneity in treatment effects causes two-way fixed effects (2WFE) estimation to produce biased estimates of Average Treatment Effects on the Treated (ATT).
This paper by Andrew Goodman-Bacon (GB) provides a nice complement to C&H. In particular, it decomposes the 2WFE estimate into mutually exclusive components. One of these can be used to identify the change in treatment effects over time. An accompanying Stata module (“bacondecomp”) allows researchers to apply GB’s procedure.
In this blog, I summarize GB’s decomposition result and reproduce his example demonstrating how his Stata command can be applied.
Conventional difference-in-differences with homogeneous treatment effects
The canonical DD example consists of two groups, “Treatment” and “Control”, and two time periods, “Pre” and “Post”. The treatment is simultaneously applied to all members of the treatment group. The control group never receives treatment. The treatment effect is homogenous both across the treated individuals and “within” individuals over time. If there are time trends, we assume they are identical across both groups (“common trends assumption”).
FIGURE 1 motivates the corresponding DD estimator.
Let δ be the ATT (which is the same for everybody and constant over time). Note that ATT is given by the double difference DD, where,TRN2(20191018)
The first difference sweeps out any unobserved fixed effects that characterize Treatment individuals. This leaves δ plus the time trend for the Treatment group.
The second difference (in parentheses) sweeps out unobserved effects associated with Control individuals. This leaves the time trend for the Control group.
The first difference minus the second difference then leaves δ, the ATT, assuming both groups have a common time trend. (Note how the “common trends” assumption is key to identifying δ.)
It is easily shown that, given the above assumptions, that OLS estimation of the regression specification below produces an unbiased estimate of δ.
A more realistic, three-period example
Now consider a more realistic example, close in spirit to what researchers actually encounter in practice. Let there be three groups, “Early Treatment”, “Late Treatment” and “Never Treated”; and three time periods, “Pre”, “Mid”, and “Post”.
FIGURE 2 motivates the following discussion.
The Early Treatment group receives treatment at t*k (GB uses the k subscript to indicate early treatees).
The Late Treatment group receives treatment at t*l ,  t*l > t*k .
Suppose a researcher were to estimate the following 2WFE regression equation, where Dit is a dummy variable indicating whether individual i was treated at or before time t. For example, Dit = 0 and 1 for Late treatees at times “Mid” and “Post”, while Dit = 1 for Early treatees at times “Mid” and “Post”,TRN5a(20191018)GB shows that the OLS estimate of βDD is a weighted average of all possible DD paired differences. One of those paired differences (cf. Equation 6 in GB) isTRN5(20191018)Note that in this case, the Early Treatment group (subscripted by k) can serve as a control group for the Late Treatment group because its treatment status does not change over the “Mid”/“Post” period. This particular paired difference ends up being important.
GB goes on to derive the following decomposition result: The probability limit of the OLS estimator of βDD consists of three components:TRN6(20191018)VWATT is the Variance Weighted ATT, VWCT is the Variance Weighted Common Trends, and ΔATT is the change in individuals’ treatment effects that occurs over time.
When the common trends assumption is valid (VWCT=0), and the treatment effect is homogeneous both across individuals and within individuals over time, then the probability limit equals δ, the homogeneous treatment effect.
However, if treatment effects are heterogeneous, then even if the common trends assumption holds, estimation of the 2WFE specification will not equal the ATT. There are two sources of bias.
The first bias arise because OLS weights individual treatment effects differently depending on (i) the number of people who are treated and (ii) the timing of the treatment. This will introduce a bias if the size of the treatment effect is associated with either of these. However, this bias is not necessarily a bad thing. It is the byproduct of minimizing the variance of the estimator, so there are some efficiency gains that accompany this bias.
The second bias is associated with changes in the treatment effect over time, ΔATT. It’s entirely a bad thing.
Consider again the paired differenceTRN5(20191018)
The second term is the difference in outcomes for the Early treatees between the Post and Mid periods. Because Early treatees are treated for both of these periods, this difference should sweep away everything but the time trend if the treatment effect stays constant over time.
However, if treatment effects vary over time, say the benefits depreciate (or, alternatively, accumulate), the treatment effect will not be swept out. As a result, the change in the treatment effect will carry through to the respective DD estimate. As a result, the DD estimate will respectively over- or under-estimate the true treatment effect.
GB’s decomposition allows one to investigate this last type of bias. Towards that end, GB (along with Thomas Goldring and Austin Nichols) has written a Stata module called bacondecomp.
Application: Replication of Stevenson and Wolfers (2006)
To demonstate bacondecomp, GB replicates a result from the paper “Bargaining in the Shadow of the Law: Divorce Laws and Family Distress” by Betsey Stevenson and Justin Wolfers (S&W), published in The Quarterly Journal of Economics in 2006.
Among other things, S&W estimate the effect of state-level, no-fault divorce laws on female suicide. Over their sample period of 1964–1996, 37 US states gradually adopted no-fault divorce. 8 states had already done so, and 5 states never did.
[NOTE: The data and code to reproduce the results below are taken directly from the examples in the help file accompanying bacondecomp. They can be obtained by installing bacondecomp and then accessing the help documentation through Stata].
GB does not exactly reproduce S&W’s result, but uses a similar specification and obtains a similar result. In particular, he estimates the following regression
TRN8(20191018)where asmrs is the suicide mortality rate per one million women; post is the treatment dummy variable (i.e., Dit); pcinc, asmrh, and cases are control variables for income, the homicide mortality rate, and the welfare case load; and αi and αt are individual and time fixed effects.
The 2WFE estimate of βDD is -2.52. In other words, this specification estimates that no-default divorce reform reduced female suicides by 2.52 fatalities per million women.
The bacondecomp command decomposes the 2WFE estimate of -2.52 into three separate components by treatment (T) and control (C) groups.
Timing_groups: Early treatees (T) versus Late treatees (C) & Late treatees (T) versus Early treatees (C).
Always_v_timing: Treatees (T) versus Always treated/Pre-reform states (C).
And Never_v_timing: Treatees (T) versus Never treated (C).
bacondecomp produces the following table, where “Beta” is the DD estimate for the respective group and “TotalWeight” represents its share in the overall estimated effect (-2.52). Notice that the sum of the products of “Beta” × “TotalWeight” ≈ the 2WFE estimate.TRN10(20191018)
Conspiculously, the first group (Timing_groups) finds that no-fault divorce reform is associated with an increase in the female suicide rate (+2.60). In contrast, the latter two groups find a decrease (-7.02 and -5.26). This is indicative that there may be changes in treatment effects over time. If so, this would invalidate the difference-in-differences estimation framework.
Unfortunately, bacondecomp does not produce a corrected estimate of ATT. It is primarily useful for identifying a potential problem with time-varying treatment effects. As a result, it should be seen as complementing other approaches, such as the estimation procedures of de Chaisemartin and D’Haultfoeuille (see here), or an alternative approach such as an event study framework that includes dummies for each post-treatment period (see here).
Bob Reed is a professor of economics at the University of Canterbury in New Zealand. He is also co-organizer of the blogsite The Replication Network. He can be contacted at
Goodman-Bacon, A. (2018). Difference-in-differences with variation in treatment timing. National Bureau of Economic Research, No. w25018.
Goodman-Bacon, A., Goldring, T., & Nichols, A. (2019).  bacondecomp: Stata module for decomposing difference-in-differences estimation with variation in treatment timing.
Stevenson, B. & Wolfers, J. (2006). Bargaining in the shadow of the law: Divorce laws and family distress. The Quarterly Journal of Economics, 121(1):267-288.



Reproducibility and Meta-Analyses: Two Great Concepts That Apparently Don’t Mix

[Excerpts taken from the report “Examining the Reproducibility of Meta-Analyses in Psychology: A Preliminary Report” by Daniel Lakens et al., posted at MetaArXiv Preprints ]
“…given the broad array of problems that make it difficult to evaluate the evidential value of single studies, it seems more imperative than ever to use meta-analytic techniques…”
“…some researchers have doubted whether meta-analyses can actually produce objective knowledge when the selection, inclusion, coding, and interpretation of studies leaves as least as much room for flexibility in the analysis and conclusions …as is present in single studies…”
“Our goals were to…assess the similarity between published and reproduced meta-analyses and identifying specific challenges that thwart reproducibility…”
“We collected all (54) meta-analyses that were published in Psychological Bulletin, Perspectives on Psychological Science, and Psychonomic Bulletin & Review in 2013 and 2014.”
“From these 54 meta-analyses, only…67% or the meta-analyses (36) included a table where each individual effect size, and the study it was calculated from, was listed.”
“Twenty out of the 54 meta-analyses were randomly selected to be reproduced. Teams attempted to reproduce these meta-analyses as completely as possible based on the specifications in the original articles.”
“Five meta-analyses proved so difficult to reproduce that is was not deemed worthwhile to continue with the attempt.”
“For these five meta-analyses, two did not contain a table with all effect sizes, making it impossible to know which effects were coded, and to check whether effects were coded correctly. Two other meta-analyses could not be reproduced from articles in the published literature because the relevant information was missing, and thus needed access to raw data, and in another meta-analysis not enough details were provided on how effect sizes were selected and calculated.”
“Coding for 9 meta-analyses was completed, and coding for 6 metaanalyses is completed to some extent, but still ongoing.”
“All reproduced meta-analyses can be found on the Open Science Framework:”
“Overall, there was unanimous agreement across all seven teams involved in extracting data from the literature that reproducing published meta-analyses was much more difficult than we expected.”
“Many effect sizes in meta-analyses proved impossible to reproduce due to a combination of a lack of available raw data, difficulty locating the relevant effect size from the article, unclear rules about how multiple effect sizes were combined or averaged, and lack of specification of the effect size formula used to convert effect sizes.”
“A frequent source of disagreement between the meta-analysis and the recoded effect sizes concerned the sample size in the study, or the sample sizes in each group…sample sizes are used to calculate the variance for effect sizes…due to the fact that sample sizes for specific analyses are often not clearly provided in original articles, this information turned out to be surprisingly difficult to code.”
“Doing a meta-analysis is a great way to learn which information one needs to report to make sure the article can be included in a future meta-analysis.”
“Similarly, reproducing a meta-analysis is a great way to learn which information one needs to report in a meta-analysis to make sure a meta-analysis can be reproduced.”
“We can highly recommend researchers interested in performing a meta-analysis to start out by (in part) reproducing a published meta-analysis.”
“Because meta-analyses play an important role in cumulative science, they should be performed with great transparency, and be reproducible. With this project we hope to have provided some important preliminary observations that stress the need for improvement, and provided some practical suggestions to make meta-analyses more reproducible in the future.”
To read the report, click here.

Registered Reports 2.0

[Excerpt taken from the article “What’s next for Registered Reports?” by Chris Chambers, published in Nature]
“For the past six years, I have championed Registered Reports (RRs), a type of research article that is radically different from conventional papers. The 30 or so journals that were early adopters have together published some 200 RRs, and more than 200 journals are now accepting submissions in this format (see ‘Rapid rise’).”
“With RRs on the rise, now is a good time to take stock of their potential and limitations.”
“The Registered Report format splits conventional peer review in half. First, authors write an explanation of how they will probe an important question. …Before researchers do the studies, peer reviewers assess the value and validity of the research question, the rationale of the hypotheses and the rigour of the proposed methods. They might reject the Stage 1 manuscript, accept it or accept it pending revisions to the study design and rationale. This ‘in-principle acceptance’ means that the research will be published whatever the outcome, as long as the authors adhere closely to their protocol and interpret the results according to the evidence.”
“After the Stage 1 manuscript is accepted, the authors formally preregister it in a recognized repository such as the Open Science Framework, either publicly or under a temporary embargo. They then collect and analyse data and submit a completed ‘Stage 2’ manuscript that includes results and a discussion.”
“The Stage 2 submission is sent back to the original reviewers, who cannot question the study rationale or design now that the results are known. Whether the results are judged by reviewers to be new, groundbreaking or exciting is irrelevant to acceptance.”
“An analysis this year suggests that RRs are more likely to report null findings than are conventional articles: 66% of RRs for replication studies did not support initial hypotheses; for RRs of novel studies, the figure was 55%. Estimates for conventional papers range from 5 to 20%.”
“RRs are not a panacea — the format needs constant refinement. It currently sits rather awkwardly between the old world of scientific publishing and the new. Innovations over the next few years should make this format even more powerful, and stimulate wider reforms.”
“Transparency. When RRs first launched, some journals published Stage 2 manuscripts but not those for Stage 1, making it impossible for readers to see whether the completed protocol matched the planned one. In 2018, the Center for Open Science launched a simple tool that places submitted Stage 1 manuscripts in a public registry. This is now used by many journals…”
“Standardization. …Currently, submitted manuscripts are often prepared in word-processing software and contain insufficient methodological detail or linking between predictions and analyses. The next generation of RRs — ‘Registered Reports 2.0’ — is likely to be template-based and could integrate tools such as Code Ocean. This would ensure that analyses are immutable within a stable, self-contained software environment. With standardized metadata and badging, RRs will become useful for systematic reviews and meta-analyses.”
“Efficiency. The review process can be extended even further back in the research life cycle. Under the emerging RR grant model, reviewers award funding and signal in-principle acceptance of a research publication simultaneously or in rapid succession. The Children’s Tumor Foundation and PLoS ONE have pioneered such a partnership, as have Cancer Research UK and the journal Nicotine & Tobacco Research. More are in the works.”
“The lesson of RRs speaks to all areas of science reform.…Instead of pitting what is best for the individual against what is best for all, create a model that benefits everyone — the scientist, their community and the taxpayer — and the rest will come naturally.”
To read the article, click here.

A Major New Initiative in Meta-Science

[Excerpts taken from the article “Research on research gains steam” by Dalmeet Singh Chawla, published in Chemical and Engineering News]
“The use of scientific methodology to study science itself is called metascience. The discipline has become mainstream in recent years, tackling some of the thorniest problems science faces, including a lack of reproducibility of academic literature, biases in peer review, and the fair allocation of research funding.”
“On Sept. 30, metaresearch got another boost when an international coalition of policymakers, funders, universities, publishers, and researchers launched the Research on Research Institute (RoRI), which will be dedicated to tackling metascience questions on a mass scale.”
“James Wilsdon, a research policy scholar at the University of Sheffield and the institute’s founding director, says that the scientific community is “woefully underinvesting” in metascience given the benefits it could have for the “efficiency, dynamism and sustainability” of the research enterprise.”
“The four founding partners of the institute are the biomedical research charity Wellcome Trust, Leiden University’s Centre for Science and Technology Studies, research technology firm Digital Science, and the University of Sheffield.”
“Wilsdon says a consortium of additional strategic partners, including public and private organizations from 11 nations, are already onboard, and conversations are underway with more potential partners.”
“The three broad headings under which RoRI is spearheading its initial projects are decision-making, careers, and culture. Wilsdon thinks the first few initiatives are likely to focus on investigating how funding applications pass peer review and how research money is allocated, including testing different funding models.”
To read the article, click here.

A Nice “Shoutout” to TRN in Nature Sustainability

[Excerpts taken from the article “Economists run experiments, too” by Aiora Zabala, published at Nature Sustainability]
“For most who don’t know this literature, it’s easy to lightly recommend that we should implement taxes, subsidies or regulations to tackle environmental challenges…Economists and behavioural scientists are trying to find the answer by running experiments (much like in life sciences labs). With these, researchers test policy variations before they are implemented…”
“Two years ago, scholars concerned with these questions formed the Research Network on Economic Experiments for the Common Agricultural Policy (REECAP). In early September this year, they gathered in Osnabrück (Germany) to discuss their latest studies, future directions for the field and common concerns.”
“Among the sessions, they included speakers at the interface between science and policy, directly advising European institutions on their agricultural policies, doing environmental lobbying in Brussels, or directly making decisions, for example.”
“Another topic concerned the so-called crisis of reproducibility of published studies, first discussed in psychology and in life sciences, and put on the spot among economists by studies like this and this.”
“To a panel on the topic, I brought the view on the challenges and opportunities of such crisis from an editors’ perspective. Some areas for action are policy checklists and code and data statements (as currently done across Nature journals).”
“Replication studies are increasing and interesting initiatives are ongoing (like the Replication Network)…”
To read the article, click here.



GOODMAN: What’s the True Effect Size? It Depends What You Think

What’s the true effect size? That’s my bottom line question when doing a study or reading a paper. I don’t expect an exact answer, of course. What I want is a probability distribution telling where the true effect size probably lies. I used to think confidence intervals answered this question, but they don’t except under artificial conditions. A better answer comes from Bayes’s formula. But beware of the devil in the priors.
Confidence intervals, like other standard methods such as the t-test, imagine we’re repeating a study an infinite number of times, drawing a different sample each time from the same population. That seems unnatural for basic, exploratory research, where the usual practice is to run a study once (or maybe twice for confirmation).
As I looked for a general way to estimate true effect size from studies done once, I fell into Bayesian analysis. Much to my surprise, this proved to be simple and intuitive. The code for the core Bayesian analysis (available here) is simple, too: just a few lines of R.
The main drawback is the answer depends on your prior expectation. Upon reflection, this drawback may really be a strength, because it forces you to articulate key assumptions.
Being a programmer, I always start with simulation when learning a new statistical method. I model the scenario as a two stage random process. The first stage selects a population (aka “true”) effect size, dpop, from a distribution; the second carries out a study with that population effect size yielding an observed effect size, dobs. The studies are simple two group difference-of-mean studies with equal sample size and standard deviation, and the effect size statistic is standardized difference (aka Cohen’s d).
I record dpop and dobs from each simulation producing a table showing which dpops give rise to which dobss. Then I pick a target value for dobs, say 0.5, and limit the table to rows where dobs is near 0.5. The distribution of dpop from this subset is the answer to my question. In Bayesian-speak, the first-stage distribution is the prior, and the final distribution is the posterior.
Now for the cool bit. The Bayesian approach lets us pick a prior that represents our assumptions about the distribution of effect sizes in our research field. From what I read in the blogosphere, the typical population effect size in social science research is 0.3. I model this as a normal distribution with mean=0.3 and small standard deviation, 0.1. I also do simlulations with a bigger prior, mean=0.7, to illustrate the impact of the choice.
Figures 1a-d show the results for small and large samples (n=10 or 200) and small and big priors for dobs=0.5. Each figure shows a histogram of simulated data, the prior and posterior distributions (blue and red curves), the medians of the two distributions (blue and red dashed vertical lines), and dobs (gray dashed vertical line).
The posteriors and histograms match pretty well, indicating my software works. For n=10 (left column), the posterior is almost identical to the prior, while for n=200 (right column), it’s shifted toward the observed. It’s a tug-of-war: for small samples, the prior wins, while for large samples, the data is stronger and keeps the posterior closer to the observation. The small prior (top row) pulls the posterior down; the big one (bottom row) pushes it up. Completely intuitive.
But wait. I forgot an important detail: some of the problems we study are “false” (“satisfy the null”). No worries. I model the null effect sizes as normal with mean=0 and very small standard deviation, 0.05, and the totality of effect sizes as a mixture of this distribution and the “true” ones as above. To complete the model, I have to specify the proportion of true vs. null problems. To illustrate the impact, I use 25% true for the small prior and 75% for the big one.
Figures 2a-d show the results.
The priors have two peaks, reflecting the two classes. With 25% true (top row), the false peak is tall and the true peak short; with 75% true (bottom row), it’s the opposite though not as extreme. For small samples (left column), the posterior also has two peaks, indicating that the data does a poor job of distinguishing true from null cases. For big samples (right column), the posterior has a single peak, which is clearly in true territory. As in Figure 1, the small prior (top row) pulls the result down, while the bigger one (bottom row) pushes it up. Again completely intuitive.
What’s not to like?
The devil is in the priors. Small priors yield small answers; big priors yield bigger ones. Assumptions about the proportion of true vs. null problems amplify the impact. Reasonable scientists might choose different priors making it hard to compare results across studies. Unscrupulous scientists may choose priors that make their answers look better, akin to p-hacking.
For this elegant method to become the norm, something has to be done about the priors. Perhaps research communities could adopt standard priors for specific types of studies. Maybe we can use data from reproducibility projects to inform these choices. It seems technically feasible. No doubt I’ve missed some important details, but this seems a promising way to move beyond p-values.
Please post comments on Twitter or Facebook, or contact me by email
Nat Goodman is a retired computer scientist living in Seattle Washington. His working years were split between mainstream CS and bioinformatics and orthogonally between academia and industry. As a retiree, he’s working on whatever interests him, stopping from time-to-time to write papers and posts on aspects that might interest others. 


ReproducibiliTeas: Bottoms Up.

[Excerpts taken from the article, “A journal club to fix science” by Amy Orben, published in Nature]
“If science had generations, mine would not be defined by war or Woodstock, but by reproducibility and open science….Early-career researchers do not need to wait passively for coveted improvements. We can create communities and push for bottom-up change. ReproducibiliTea is one way to do this.”
“Sam Parsons, Sophia Crüwell and I (all trainees) started this grass-roots journal club in early 2018, at the experimental-psychology department at the University of Oxford, UK. We hoped to promote a stronger open-science community and more prominent conversations about reproducibility. The initiative soon spread, and is now active at more than 27 universities in 8 countries.”
“During each meeting, a scientific paper lays the groundwork for a conversation. Concerns vary from field to field and institution to institution, so each club focuses on aspects of scientific methods and systems that concern them most. Topics for my group ranged from discussions on replicability (Open Science Collaboration Science 349, aac4716; 2015), to debates about open-access publishing (J. P. Tennant et al. F1000 Res. 5, 632; 2016), the problems of analytical flexibility (J. P. Simmons et al. Psychol. Sci. 22, 1359–1366; 2011) and the potential of Registered Reports, a publication format in which papers are reviewed primarily on the research question and protocol, before results are known (C. D. Chambers Cortex 49, 609–610; 2013).”
“To launch their own ReproducibiliTea group, motivated researchers need only to select some articles and set a time and a place. No minimum group size or meeting frequency is required. They will then join a community of ReproducibiliTea journal clubs that continually discuss improvements and support each other. (For more information, see”
To read the article, click here.

Down With Confidence Intervals. Up With Uncertainty Intervals? Compatibility Intervals?

[Excerpts taken from the article “Are confidence intervals better termed ‘uncertainty intervals’?” by Andrew Gelman and Sander Greenland, published in the BMJ.]
Are confidence intervals better termed “uncertainty intervals?”
Yes—Andrew Gelman
“Confidence intervals can be a useful summary in model based inference. But the term should be “uncertainty interval,” not “confidence interval”…”
“Officially, all that can be interpreted are the long term average properties of the procedure that’s used to construct the interval, but people tend to interpret each interval implicitly in a bayesian way—that is, by acting as though there’s a 95% probability that any given interval contains the true value.”
“Using confidence intervals to rule out zero (or other parameter values) involves all of the well known problems of significance testing. So, rather than constructing this convoluted thing called a confidence procedure, which is defined to have certain properties on average but can’t generally be interpreted for individual cases, I prefer to aim for an uncertainty interval, using the most appropriate statistical methods to get there.”
“Let’s use the term “uncertainty interval” instead of “confidence interval.” The uncertainty interval tells us how much uncertainty we have.”
No—Sander Greenland
“The label “95% confidence interval” evokes the idea that we should invest the interval with 95/5 (19:1) betting odds that the observed interval contains the true value…”
“…the 95% is overconfident because it takes no account of procedural problems and model uncertainties that should reduce confidence in statistical results. Those possibilities include uncontrolled confounding, selection bias, measurement error, unaccounted-for model selection, and outright data corruption.”
“…no conventional interval adequately accounts for procedural problems that afflict data generation or for uncertainties about the statistical assumptions.”
“Nonetheless, all values in a conventional 95% interval can be described as highly compatible with data under the background statistical assumptions, in the very narrow sense of having P>0.05 under those assumptions.”
“In equivalent terms: given any value in the interval and the background assumptions, the data should not seem very surprising. This leads to the intentionally modest term “compatibility interval” as a replacement for ‘confidence interval.'”
“In summary, both “confidence interval” and “uncertainty interval” are deceptive terms, for they insinuate that we have achieved valid quantification of confidence or uncertainty despite omitting important uncertainty sources.
“Replacing “significance” and “confidence” labels with “compatibility” is a simple step to encourage honest reporting of how little we can confidently conclude from our data.”
To read the full article, click here. (NOTE: Article is behind a paywall.)

Predicting Reproducibility. No PhD Required.

[Excerpts taken from the article “Laypeople Can Predict Which Social Science Studies Replicate” by Suzanne Hoogeveen, Alexandra Sarafoglou, and Eric-Jan Wagenmakers, posted at PsyArXiv Preprints]
“…we assess the extent to which a finding’s replication success relates to its intuitive plausibility. Each of 27 high-profile social science findings was evaluated by 233 people without a PhD in psychology. Results showed that these laypeople predicted replication success with above-chance performance (i.e., 58%). In addition, when laypeople were informed about the strength of evidence from the original studies, this boosted their prediction performance to 67%.”
“Participants were presented with 27 studies, a subset of the studies included in the Social Sciences Replication Project…and the Many Labs 2 Project.”
“For each study, participants read a short description of the research question, its operationalization, and the key finding…In the Description Only condition, solely the descriptive texts were provided; in the Description Plus Evidence condition, the Bayes factor and its verbal interpretation (e.g., “moderate evidence”) for the original study were added to the descriptions.”
“After the instructions, participants…indicated whether they believed that this study would replicate or not (yes / no), and expressed their confidence in their decision on a slider ranging from 0 to 100.”
“Figure 1 displays participants’ confidence ratings concerning the replicability of each of the 27 included studies, ordered according to the averaged confidence score.”
“Positive ratings reflect confidence in replicability, and negative ratings reflect confidence in non-replicability, with −100 denoting extreme confidence that the effect would fail to replicate. Note that these data are aggregated across the Description Only and the Description Plus Evidence condition.”
“The top ten rows indicate studies for which laypeople showed relatively high agreement that the associated studies would replicate. Out of these ten studies, nine replicated and only one did not (i.e., the study by Anderson, Kraus, Galinsky, & Keltner, 2012; note that light-grey indicates a successful replication, and dark-grey indicates a failed replication).”
“The bottom four rows indicate studies for which laypeople showed relatively high agreement that the associated studies would fail to replicate. Consistent with laypeople’s predictions, none of these four studies replicated.”
“For the remaining 13 studies in the middle rows, the group response was relatively ambiguous, as reflected by a bimodal density that is roughly equally distributed between the negative and positive end of the scale. Out of these 13 studies, five replicated successfully and eight failed to replicate successfully.”
“Overall, Figure 1 provides a compelling demonstration that laypeople are able to predict whether or not high-profile social science findings will replicate successfully.”
“…the relative ordering of laypeople’s confidence in replicability for a given set of studies may provide estimations of the relative probabilities of replication success.”
“If a replicator’s goal is to purge the literature of unreliable effects, he or she may start by conducting replications of the studies for which replication failure is predicted by naive forecasters.”
“Alternatively, if the goal is to clarify the reliability of studies for which replication outcomes are most uncertain, one could select studies for which the distribution of the expected replicability is characterized by a bi-modal shape.”
“As such, prediction surveys may serve as ‘decision surveys’, instrumental in the selection stage of replication research (cf. Dreber et al., 2015). These informed decisions could not only benefit the replicator, but also optimize the distribution of funds and resources for replication projects.”
To read the article, click here.



New Cambridge University Press Journal Seeks Non-Novel Research

[Excerpts taken from the article “New Journal Focused on Reproducibility” by Colleen Flaherty, published at]
“Cambridge University Press is launching a new open-access journal to help address science’s reproducibility issues and glacial peer-review timelines. Experimental Results, announced today, gives researchers a “place to publish valid, standalone experimental results, regardless of whether those results are novel, inconclusive, negative or supplementary to other published work,” according to the press. It will also publish work about attempts to reproduce previously published experiments.”
To read more, click here.