REED & WU: EiR* – Missing Data

[* EiR = Econometrics in Replications, a feature of TRN that highlights useful econometrics procedures for re-analysing existing research.]

NOTE: This blog uses Stata for its estimation. All the data and code necessary to reproduce the results in the tables below are available at Harvard’s Dataverse: click here.

Missing data is ubiquitous in economics. Standard practice is to drop observations for which any variables have missing values. At best, this can result in diminished power to identify effects. At worst, it can generate biased estimates. Old-fashioned ways to address missing data assigned values using some form of interpolation or imputation. For example, time series data might fill in gaps in the record using linear interpolation. Cross-sectional data might use regression to replace missing values with their predicted values. These procedures are now known to be flawed (Allison, 2001; Enders, 2010).

The preferred way to deal with missing data is to use maximum likelihood (ML) or multiple imputation (MI), assuming the data are “missing at random”. Missing at random (MAR) essentially means that the probability a variable is missing is independent of the value of that variable. For example, if a question about illicit drug use is more likely to go unanswered for respondents who use drugs, then those data would not be MAR. Assuming that the data are MAR, both ML and MI will produce estimates that are consistent and asymptotically efficient.

ML is in principle the easiest to perform. In Stata, one can use the structural equation modelling command (“sem”) with the option “method(mlmv)”. That’s it! Unfortunately, the simplicity of ML is also its biggest disadvantage. For linear models, ML simultaneously estimates means, variances, and covariances while also accounting for the incomplete records associated with missing data. Not infrequently, this causes convergence problems. This is particularly a problem for panel data where one might have a large number of fixed effects.

In this blog, we illustrate how to apply both ML and MI to a well-cited study on mortality and inequality by Andrew Leigh and Christopher Jencks (Journal of Health Economics, 2007). Their analysis focused on the relationship between life expectancy and income inequality, measured by the share of pre-tax income going to the richest 10% of the population. Their data consisted of annual observations from 1960-2004 for Australia, Canada, France, Germany, Ireland, the Netherlands, New Zealand, Spain, Sweden, Switzerland, the UK, and the US. We use their study both because their data and code are publicly available, and because much of the original data were missing.

The problem is highlighted in TABLE 1, which uses a reconstruction of L&J’s original dataset. The full dataset has 540 observations. The dependent variable, “Life expectancy”, has approximately 11 percent missing values. The focal variable, “Income share of the richest 10%”, has approximately 24 percent missing values. The remaining control variables vary widely in their missingness. Real GDP has no missing values. Education has the most missing values, with fully 80% of the variable’s values missing. This is driven by the fact that the Barro and Lee data used to measure education only reports values at five-year intervals.

In fact, the problem is more serious than TABLE 1 indicates. If we run the regression using L&J’s specification (cf. Column 7, Table 4 in their study), we obtain the results in Column (1) of TABLE 2. The estimates indicate that a one-percentage point increase in the income share of the richest 10% is associated with an increase in life expectancy of 0.003 years, a negligible effect in terms of economic significance, and statistically insignificant. Notably, this estimate is based on a mere 64 observations (out of 540).

In fact, these are not the results that L&J reported in their study. No doubt because of the small number of observations, they used linear interpolation on some (but not all) of their data to fill in missing values. Applying their approach to our data yields the results in Column (2) of Table 2 below. There are two problems with using their approach.

First, for various reasons, L&J did not fill in values for all the missing values. The ended up using only 430 out of a possible 540 observations. As a result, their estimates did not exploit all the information that was available to them. Second, interpolation replaces missing values with their predicted values without accounting for the randomness that occurs in real data. This biases standard errors, usually downwards. ML and MI allow one to do better.

ML is the easiest method to apply. To estimate the regression in Table 2 requires a one-line command:

sem (le <- ts10 gdp gdpsq edu phealth thealth id2-id12 year2-year45), method(mlmv) vce(cluster id)

The “sem” command calls up Stata’s structural equation modelling procedure. The option “method(mlmv)” tells Stata to use maximum likelihood to accommodate missing values. If this option is omitted from the above, then the command will produce results identical to those in Column 1 of Table 1, except that the standard errors will be slightly smaller.

While the simplicity of ML is a big advantage, it also introduces complications. Specifically, ML estimates all the parameters simultaneously. The inclusion of 11 country fixed effects and 44 year dummies makes the number of elements in the variance-covariance matrix huge. This, in combination with the fact that ML simultaneously integrates over distributions of variables to account for missing values creates computational challenges. The ML procedure called up by the command above did not converge after 12 hours. As a result, we next turn to MI.

Unlike ML, MI fills in missing values with actual data. The imputed values are created to incorporate the randomness that occurs in real data. The most common MI procedure assumes that all of the variables are distributed multivariate normal. It turns out that this is a serviceable assumption even if the regression specification includes variables that are not normally distributed, like dummy variables (Horton et al., 2003; Allison, 2006).

As the name suggests, MI creates multiple datasets using a process of Monte Carlo simulation. Each of the datasets produces a separate set of estimates. These are then combined to produce one overall set of estimation results. Because each data set is created via a simulation process that depends on randomness, each dataset will be different. Furthermore, unless a random seed is set, different attempts will produce different results. This is one disadvantage of MI versus ML.

A second disadvantage is that MI requires a number of subjective assessments to set key parameters. The key parameters are (i) the “burnin”, the number of datasets that are initially discarded in the simulation process; (ii) the “burnbetween”, the number of intervening datasets that are discarded between retained datasets to maintain dataset independence; and (iii) the total number of imputed datasets that are used for analysis.

The first two parameters are related to the properties of “stationarity” and “independence”. The analogue to convergence in estimated parameters in ML is convergence in distributions in MI. To assess these two properties we first do a trial run of imputations.

The command “mi impute mvn” identifies the variables with missing values to the left of the “=” sign, while the variables to the right are identified as being complete.

mi impute mvn le ts10 edu phealth thealth im  = gdp gdpsq id2-id12 year2-year45, prior(jeffreys)  mcmconly rseed(123) savewlf(wlf, replace)

The option “mcmconly” lets Stata know that we are not retaining the datasets for subsequent analysis, but only using them to assess their characteristics.

The option “rseed(123)” ensures that we will obtain the same data every time we run this command.

The option “prior(jeffreys)” sets the posterior prediction distribution used to generate the imputed datasets as “noninformative”. This makes the distribution used to impute the missing values solely determined by the estimates from the last regression. 

Lastly, the option “savewlf(wlf, replace)” creates an aggregate variable called the “worst linear function” that allows one to investigate whether the imputed datasets are stationary and independent.

Note that Stata sets the default values for “burnin” and “burnbetween” at 100 and 100.

The next set of key commands are given below.

use wlf, clear

tsset iter

tsline wlf, ytitle(Worst linear function) xtitle(Burn-in period) name(stable2,replace)

ac wlf, title(Worst linear function) ytitle(Autocorrelations) note(“”) name(ac2,replace)

The “tsline” command produces a “time series” graph of the “worst linear function” where “time” is measured by number of simulated datasets. We are looking for trends in the data. That is, do the estimated parameters (which includes elements in the variance-covariance matrix) tend to systematically depart from the overall mean.

The graph above is somewhat concerning because it appears to first trend up and then trend down. As a result, we increase the “burnin” value to 500 from its default value of 100 with the following command. Why 500? We somewhat arbitrarily choose a number that is substantially larger than the previous “burnin” value.

mi impute mvn le ts10 edu phealth thealth im  = gdp gdpsq id2-id12 year2-year45, prior(jeffreys) mcmconly burnin(500) rseed(123) savewlf(wlf, replace)

use wlf, clear

tsset iter

tsline wlf, ytitle(Worst linear function) xtitle(Burn-in period) name(stable2,replace)

ac wlf, title(Worst linear function) ytitle(Autocorrelations) note(“”) name(ac2,replace)

This looks a lot better. The trending that is apparent in the first half of the graph is greatly reduced in the second half. We subjectively determine that this demonstrates sufficient “stationarity” to proceed. Note that there is no formal test to determine stationarity.

The next thing is to check for independence. The posterior distributions used to impute the missing values rely on Bayesian updating. While our use of the Jeffrys prior reduces the degree to which contiguous imputed datasets are related, there is still the opportunity for correlations across datasets. The “ac” command produces a correlogram of the “worst linear function” that allows us to assess independence. This is produced below.

This correlogram indicates that as long as we retain imputed datasets that are at least “10 datasets apart”, we should be fine. The default value of 100 for “burnbetween” is thus more than sufficient.

The remaining parameter to be set is the total number of imputed datasets to use for analysis. For this we use a handy, user-written Stata (and SAS) command from von Hippel (2020) called “how_many_imputations”.

The problem with random data is that it produces different results each time. “how_many_imputations” allows us to set the number of imputations so that the variation in estimates will remain within some pre-determined threshold value. The default value is to set the number of imputations so that the coefficient of variation of the standard error of the “worst linear function” is equal to 5%.

It works like this. First we create a small initial set of imputed datasets. The command below imputes 10 datasets (“add(10)”).

mi impute mvn le ts10 edu phealth thealth im  = gdp gdpsq id2-id12 year2-year45, prior(jeffreys) burnin(500) add(10) rseed(123)

We then estimate a fixed effects regression for each of the 10 datasets. Note that we use the standard Stata command for “xtreg, fe” after “mi estimate:”

mi xtset id year

mi estimate: xtreg le ts10 gdp gdpsq edu phealth thealth year2-year45, fe vce(cluster id)


The command “how_many_imputations” determines the number of imputed datasets calculated to produce standard errors with a coefficient of variation for the standard errors equal to 5%. In this particular case, the output is given by:

The output says to create 182 more imputed datasets.

We can feed this number directly into the “mi impute” command using the “add(`r(add_M)’)” option:

mi impute mvn le ts10 edu phealth thealth im  = gdp gdpsq id2-id12 year2-year45, prior(jeffreys) burnin(500) add(`r(add_M)’)

After running the command above, our stored data now consists of 104,220 observations: The initial set of 540 observations plus 192 imputed datasets × 540 observations/dataset. To combine the individual estimates from each dataset to get an overall estimate, we use the following command:

mi xtset id year

mi estimate: xtreg le ts10 gdp gdpsq edu phealth thealth year2-year45, fe vce(cluster id)

The results are reported in Column (3) of Table 2. A further check re-runs the program using different seed numbers. These show little variation, confirming the robustness of the results.

In calculating standard errors, Table 2 follows L&J’s original procedure of estimating standard errors clustered on countries. One might want to improve on this given that their sample only included 12 countries.

An alternative to the “mi estimate” command above is to use a user-written program that does wild cluster bootstrapping. One such package is “wcbregress”.  While not all user-written programs can be accommodated by Stata’s “mi estimate”, one can use wcb by modifying the “mi estimate” command as follows:

mi estimate, cmdok: wcbregress le ts10 gdp gdpsq edu phealth thealth year2-year45, group(id)

A comparison of Columns (3) and (1) reveals what we have to show for all our work. Increasing the number of observations substantially reduced the sizes of the standard errors. The standard error of the focal variable, “Income share of the richest 10%”, decreased from 0.051 to 0.035.

While the estimated coefficient remained statistically insignificant for this variable, the smaller standard errors boosted two other variables into significance: “Real GDP per capital squared” and “Log public health spending pc”. Furthermore, the larger sample provides greater confidence that the estimated coefficients are representative of the population from which we are sampling.

Overall, the results provide further support for Leigh & Jencks (2007)’s claim that the relationship between inequality and mortality is small and statistically insignificant.

Given that ML and MI estimation procedures are now widely available in standard statistical packages, they should be part of the replicator’s standard toolkit for robustness checking of previously published research. 

Weilun (Allen) Wu is a PhD student in economics at the University of Canterbury. This blog covers some of the material that he has researched for his thesis. Bob Reed is Professor of Economics and the Director of UCMeta at the University of Canterbury. They can be contacted at and, respectively.


Allison, P. D. (2001). Missing data. Sage publications.

Enders, C. K. (2010). Applied missing data analysis. Guilford press.

Leigh, A., & Jencks, C. (2007). Inequality and mortality: Long-run evidence from a panel of countries. Journal of Health Economics, 26(1), 1-24.

Horton, N. J., Lipsitz, S. R., & Parzen, M. (2003). A potential for bias when rounding in multiple imputation. The American Statistician, 57(4), 229-232.

Allison, P. (2006, August). Multiple imputation of categorical variables under the multivariate normal model. In Annual Meeting of the American Sociological Association, Montreal.

Von Hippel, P. T. (2020). How many imputations do you need? A two-stage calculation using a quadratic rule. Sociological Methods & Research, 49(3), 699-718.

BRODEUR: Launching the Institute for Replication (I4R)

Replication is key to the credibility and confidence in research findings. As falsification checks of past evidence, replication efforts contribute in essential ways to the production of scientific knowledge. They allow us to assess which findings are robust, making science a self-correcting system, with major downstream effects on policy-making. Despite these benefits, reproducibility and replicability rates are surprisingly low, and direct replications rarely published. Addressing these challenges requires innovative approaches in how we conduct, reward, and communicate the outcomes of reproductions and replications.

That is why we are excited to announce the official launch of the Institute for Replication (I4R), an institute working to improve the credibility of science by systematically reproducing and replicating research findings in leading academic journals. Our team of collaborators supports researchers and aims to improve the credibility of science by

– Reproducing, conducting sensitivity analysis and replicating results of studies published in leading journals.

– Establishing an open access website to serve as a central repository containing the replications, responses by the original authors and documentation.

– Developing and providing access to educational material on replication and open science.

– Preparing standardized file structure and code and documentation aimed at facilitating reproducibility and replicability by the broader community.

How I4R works

Our primary goal is to promote and generate replications. Replications may be achieved using the same or different data and procedures/codes, and a variety of definitions are being used.

While I4R is not a journal, we are actively looking for replicators and have an ongoing list of studies we’re looking to be replicated. Once a set of studies has been selected by I4R, our team of collaborators will confirm that the codes and data provided by the selected studies are sufficient to reproduce their results. Once that has been established, our team recruits replicators to test the robustness of the main results of the selected studies.

For their replication, replicators may use the Social Science Reproduction Platform. We also developed a template for writing replications which is available here. This template provides examples of robustness checks and how to report the replication results. Once the replication is completed, we will be sending a copy to the original author(s) who will have the opportunity to provide an answer. Both the replication and answer from the original author(s) will be simultaneously released on our website and working paper series.

Replicators may decide to remain anonymous. The decision to remain anonymous can be made at any point during the process; initially, once completed or once the original author(s) provided an answer. See Conflict of Interest page for more details.

We will provide assistance for helping replicators publish their work. Replicators will also be invited to co-author a large meta-analysis paper which will combine the work of all replicators and answer questions such as which type of studies replicate and what characterizes results that replicate. For more on publishing replications, keep reading!

We need your help

I4R is open to all researchers interested in advancing the reproducibility and replicability of research. We need your help reproducing and replicating as many studies as possible. Please contact us if you are interested in helping out. We are also actively looking for researchers with large networks to serve on the editorial board, especially in the field of macroeconomics and international relations for political science.

Beyond helping out with replication efforts, you can help our community by bringing replication to your classroom. If you want to teach replication in class assignments, our team has developed some resources that might be of interest. A list of educational resources is available here.

A very useful resource is the Social Science Reproduction Platform (SSRP) which was developed by our collaborators at the Berkeley Initiative for Transparency in the Social Sciences in collaboration with the AEA Data Editor. This is a platform for systematically conducting and recording reproductions of published social science research. The SSRP can be easily incorporated as a module in applied social science courses at graduate and undergraduate levels. Students can use the platform and materials with little to no supervision, covering learning activities such as assessing and improving the reproducibility of published work and applying good coding and data management practices. Guidance for instructors such as how to select a paper, timelines and grading strategy is available here.

Reach out to us if you want to learn more about the SSRP and other teaching resources. We are here to help!

Where to publish replications

Incentives for replications are currently limited, with a small number of replications published in top journals. Moreover, reproducing or replicating others’ work can lead to disagreements with the original author(s) whose work is re-analyzed. One of I4R’s main objective is to address these challenges and help researchers conduct and disseminate reproductions and replications.

As a first step to better understand publication possibilities for replicators, our collaborators (Jörg Peters and Nathan Fiala) and the Chair, Abel Brodeur, have been contacting journal editors for top economic, finance and political journals asking them whether they are willing to publish comments for papers published in their journal and/or comments on studies published elsewhere. The answers are made publicly available on our website. We also highlight special issues/symposiums dedicated to replications and journals which strictly publish comments. Please contact us if you want to advertize other replication efforts or special issues related to open science and replications.

We will continue developing new and exciting features based on input from the community. Do not hesitate to reach out to us!

RESOURCES:, Twitter @I4Replication

Abel Brodeur is Associate Professor in the Department of Economics, University of Ottawa, and founder and chair of the Institute for Replication (I4R). He can be reached at

REPLICATION: Losing is NOT the Key to Winning

[Excerpts are taken from the article “Does Losing Lead to Winning? An Empirical Analysis for Four Sports” by Bouke Klein Teeselink, Martijn J. van den Assem, and Dennie van Dolder, forthcoming in Management Science.]

“In an influential paper, Berger and Pope (2011, henceforth BP) argue that lagging behind halfway through a competition does not necessarily imply a lower likelihood of winning, and that being slightly behind can actually increase the chance of coming out on top.”

“To test this hypothesis, BP analyze more than sixty thousand professional and collegiate basketball matches. Their main analyses focus on the score difference at half-time because the relatively long break allows players to reflect on their position relative to their opponent.” 

“BP find that  National Basketball Association (NBA) teams that are slightly behind are between 5.8 and 8.0 percentage points more likely to win the match than those that are slightly ahead.”

“The present paper…extends the analysis of BP to large samples of Australian football, American football, and rugby matches, and then revisits the analysis of basketball.”

“In our main analyses, the running variable is the score difference at half-time and the cutoff value is zero. We estimate the following regression model:”

“where Yi is an indicator variable that takes the value of 1 if team i wins the match, and Xi is the half-time score difference between team i and the opposing team.”

“The treatment variable Ti takes the value of 1 if team i is behind at half-time. The coefficient τ represents the discontinuity in the winning probability at a zero score difference. This coefficient is positive under the hypothesis that being slightly behind improves performance. The interaction term Ti × Xi allows for different slopes above and below the cutoff.”

“If the assumption of a piecewise linear relationship between the winning probability and the half-time score difference is violated, then the regression model will generate a biased estimate of the treatment effect. Hahn et al. (2001) propose the use of local-linear regression to solve this problem.”

“Even if the true relationship is non-linear, a linear specification can provide a close approximation within a limited bandwidth around the cutoff. A downside of this solution is that it reduces the effective number of observations and therefore the precision of the estimate.”

“To strike the appropriate balance between bias and precision, we use the local-linear method proposed by Calonico et al. (2014). This method selects the bandwidth that minimizes the mean squared error, corrects the estimated treatment effect for any remaining non-linearities within the bandwidth, and linearly downweights observations that are farther away from the cutoff.”

“We find no supporting evidence for [BP’s result that marginally trailing improves the odds of winning in Australian football, American football, and rugby]: the estimated effects are sometimes positive and sometimes negative, and statistically always insignificant.”

“We then also revisit the phenomenon for basketball. We replicate the  finding that half-time trailing improves the chances of winning in NBA matches from the period analyzed in BP, but consistently  find null results for NBA matches from outside this period, for the sample of NCAA matches analyzed in BP, for more recent NCAA matches, and for WNBA matches.”

“Moreover, our high-powered meta-analyses across the different sports and competitions cannot reject the hypothesis of no effect of marginally trailing on winning, and the confidence intervals suggest that the true effect, if existent at all, is likely relatively small.”

“In our view, the performance-enhancing effect documented in BP is most likely a chance occurrence.”

To read the full article, click here.

Results from the “Reproducibility Project: Cancer Biology”

[Excerpts are taken from the article “Investigating the replicability of preclinical cancer biology” by Errington et al., published in eLife.]

“Large-scale replication studies in the social and behavioral sciences provide evidence of replicability challenges (Camerer et al., 2016; Camerer et al., 2018; Ebersole et al., 2016; Ebersole et al., 2020; Klein et al., 2014; Klein et al., 2018; Open Science Collaboration, 2015).”

“In psychology, across 307 systematic replications and multisite replications, 64% reported statistically significant evidence in the same direction and effect sizes 68% as large as the original experiments (Nosek et al., 2021).”

“In the biomedical sciences, the ALS Therapy Development Institute observed no effectiveness of more than 100 potential drugs in a mouse model in which prior research reported effectiveness in slowing down disease, and eight of those compounds were tried and failed in clinical trials costing millions and involving thousands of participants (Perrin, 2014).”

“Of 12 replications of preclinical spinal cord injury research in the FORE-SCI program, only two clearly replicated the original findings – one under constrained conditions of the injury and the other much more weakly than the original (Steward et al., 2012).”

“And, in cancer biology and related fields, two drug companies (Bayer and Amgen) reported failures to replicate findings from promising studies that could have led to new therapies (Prinz et al., 2011; Begley and Ellis, 2012). Their success rates (25% for the Bayer report, and 11% for the Amgen report) provided disquieting initial evidence that preclinical research may be much less replicable than recognized.”

“In the Reproducibility Project: Cancer Biology, we sought to acquire evidence about the replicability of preclinical research in cancer biology by repeating selected experiments from 53 high-impact papers published in 2010, 2011, and 2012 (Errington et al., 2014). We describe in a companion paper (Errington et al., 2021b) the challenges we encountered while repeating these experiments. … These challenges meant that we only completed 50 of the 193 experiments (26%) we planned to repeat. The 50 experiments that we were able to complete included a total of 158 effects that could be compared with the same effects in the original paper.”

“There is no single method for assessing the success or failure of replication attempts (Mathur and VanderWeele, 2019; Open Science Collaboration, 2015; Valentine et al., 2011), so we used seven different methods to compare the effect reported in the original paper and the effect observed in the replication attempt…”

“…136 of the 158 effects (86%) reported in the original papers were positive effects – the original authors interpreted their data as showing that a relationship between variables existed or that an intervention had an impact on the biological system being studied. The other 22 (14%) were null effects – the original authors interpreted their data as not showing evidence for a meaningful relationship or impact of an intervention.”

“Furthermore, 117 of the effects reported in the original papers (74%) were supported by a numerical result (such as graphs of quantified data or statistical tests), and 41 (26%) were supported by a representative image or similar. For effects where the original paper reported a numerical result for a positive effect, it was possible to use all seven methods of comparison. However, for cases where the original paper relied on a representative image (without a numerical result) as evidence for a positive effect, or when the original paper reported a null effect, it was not possible to use all seven methods.”

Summarizing replications across five criteria

“To provide an overall picture, we combined the replication rates by five of these criteria, selecting criteria that could be meaningfully applied to both positive and null effects … The five criteria were: (i) direction and statistical significance (p < 0.05); (ii) original effect size in replication 95% confidence interval; (iii) replication effect size in original 95% confidence interval; (iv) replication effect size in original 95% prediction interval; (v) meta-analysis combining original and replication effect sizes is statistically significant (p < 0.05).”

FIGURE 6. Investigating the replicability of preclinical cancer biology

“For replications of original positive effects, 13 of 97 (13%) replications succeeded on all five criteria, 15 succeeded on four, 11 succeeded on three, 22 failed on three, 15 failed on four, and 21 (22%) failed on all five (… Figure 6).”

“For original null effects, 7 of 15 (47%) replications succeeded on all five criteria, 2 succeeded on four, 3 succeeded on three, 0 failed on three, 2 failed on four, and 1 (7%) failed on all five.”

“… Combining positive and null effects, 51 of 112 (46%) replications succeeded on more criteria than they failed, and 61 (54%) replications failed on more criteria than they succeeded.”

“We explored five candidate moderators of replication success and did not find strong evidence to indicate that any of them account for variation in replication rates we observed in our sample. The clearest indicator of replication success was that smaller effects were less likely to replicate than larger effects … Research into replicability in other disciplines has also found that findings with stronger initial evidence (such as larger effect sizes and/or smaller p-values) is more likely to replicate (Nosek et al., 2021; Open Science Collaboration, 2015).”

“The present study provides substantial evidence about the replicability of findings in a sample of high-impact papers published in the field of cancer biology in 2010, 2011, and 2012. The evidence suggests that replicability is lower than one might expect of the published literature. Causes of non-replicability could be due to factors in conducting and reporting the original research, conducting the replication experiments, or the complexity of the phenomena being studied. The present evidence cannot parse between these possibilities…”

“Stakeholders from across the research community have been raising concerns and generating evidence about dysfunctional incentives and research practices that could slow the pace of discovery. This paper is just one contribution to the community’s self-critical examination of its own practices.”

“Science pursuing and exposing its own flaws is just science being science. Science is trustworthy because it does not trust itself. Science earns that trustworthiness through publicly, transparently, and continuously seeking out and eradicating error in its own culture, methods, and findings. Increasing awareness and evidence of the deleterious effects of reward structures and research practices will spur one of science’s greatest strengths, self-correction.”

To read the full article, click here.


Begley CG, Ellis LM. 2012. Drug development: Raise standards for preclinical cancer research. Nature 483:531–533.

Camerer C. F, Dreber A, Forsell E, Ho TH, Huber J, Johannesson M, Kirchler M, Almenberg J, Altmejd A, Chan T, Heikensten E, Holzmeister F, Imai T, Isaksson S, Nave G, Pfeiffer T, Razen M, Wu H. 2016. Evaluating replicability of laboratory experiments in economics. Science 351: 1433–1436.

Camerer CF, Dreber A, Holzmeister F, Ho T-H, Huber J, Johannesson M, Kirchler M, Nave G, Nosek BA, Pfeiffer T, Altmejd A, Buttrick N, Chan T, Chen Y, Forsell E, Gampa A, Heikensten E, Hummer L, Imai T, Isaksson S, et al. 2018. Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour 2: 637–644.

Ebersole CR, Mathur MB, Baranski E, Bart-Plange D-J, Buttrick NR, Chartier CR, Corker KS, Corley M, Hartshorne JK, IJzerman H, Lazarević LB, Rabagliati H, Ropovik I, Aczel B, Aeschbach LF, Andrighetto L, Arnal JD, Arrow H, Babincak P, Bakos BE, et al. 2020. Many Labs 5: Testing Pre-Data-Collection Peer Review as an Intervention to Increase Replicability. Advances in Methods and Practices in Psychological Science 3:309–331.

Ebersole CR, Atherton OE, Belanger AL, Skulborstad HM, Allen JM, Banks JB, Baranski E, Bernstein MJ, Bonfiglio DBV, Boucher L, Brown ER, Budiman NI, Cairo AH, Capaldi CA, Chartier CR, Chung JM, Cicero DC, Coleman JA, Conway JG, Davis WE, et al. 2016. Many Labs 3: Evaluating participant pool quality across the academic semester via replication. Journal of Experimental Social Psychology 67: 68–82.

Errington TM, Iorns E, Gunn W, Tan FE, Lomax J, Nosek BA. 2014. An open investigation of the reproducibility of cancer biology research. eLife 3: e04333.

Errington TM, Denis A, Perfito N, Iorns E, Nosek BA. 2021b. Challenges for assessing replicability in preclinical cancer biology. eLife 10: e67995.

Klein RA, Ratliff KA, Vianello M, Adams RB, Bahník Š, Bernstein MJ, Bocian K, Brandt MJ, Brooks B, Brumbaugh CC, Cemalcilar Z, Chandler J, Cheong W, Davis WE, Devos T, Eisner M, Frankowska N, Furrow D, Galliani EM, Nosek BA. 2014. Investigating Variation in Replicability: A “Many Labs” Replication Project. Social Psychology 45: 142–152.

Klein RA, Vianello M, Hasselman F, Adams BG, Adams RB, Alper S, Aveyard M, Axt JR, Babalola MT, Bahník Š, Batra R, Berkics M, Bernstein MJ, Berry DR, Bialobrzeska O, Binan ED, Bocian K, Brandt MJ, Busching R, Rédei AC, et al. 2018. Many Labs 2: Investigating Variation in Replicability Across Samples and Settings. Advances in Methods and Practices in Psychological Science 1: 443–490.

Mathur MB, VanderWeele TJ. 2019. Challenges and suggestions for defining replication “success” when effects may be heterogeneous: Comment on Hedges and Schauer (2019). Psychological Methods 24: 571–575.

Nosek BA, Hardwicke TE, Moshontz H, Allard A, Corker KS, Dreber A, Fidler F, Hilgard J, Struhl MK, Nuijten MB, Rohrer JM, Romero F, Scheel AM, Scherer LD, Schönbrodt FD, Vazire S. 2021. Replicability, Robustness, and Reproducibility in Psychological Science. Annual Review of Psychology 73: 114157.

Open Science Collaboration. 2015. Estimating the reproducibility of psychological science. Science 349:aac4716.

Perrin S. 2014. Preclinical research: Make mouse studies work. Nature 507: 423–425.

Prinz F, Schlange T, Asadullah K. 2011. Believe it or not: how much can we rely on published data on potential drug targets? Nature Reviews Drug Discovery 10: 712.

Steward O, Popovich PG, Dietrich WD, Kleitman N. 2012. Replication and reproducibility in spinal cord injury research. Experimental Neurology 233: 597–605.

Valentine JC, Biglan A, Boruch RF, Castro FG, Collins LM, Flay BR, Kellam S, Mościcki EK, Schinke SP. 2011. Replication in prevention science. Prevention Science 12: 103–117.

Want to Be In on the Final Stage of the SCORE Project? A Call for Collaborators from COS

The SCORE project is entering its final phase of conducting reproductions (repeating the original analysis with original data) and replications (testing the same claim with new data) on a stratified random sample of claims from papers across the social-behavioral sciences.

Here is your opportunity to contribute to these efforts for non-human subjects research (non-HSR) work, meaning projects that use existing data and do not require any additional IRB review steps. We’d love to have your collaboration! 

This spreadsheet contains all of the non-HSR projects available, with different tabs corresponding to different project categories.

This announcement outlines the different project categories at a high level. More details can be found in this explanation of our terminology, as well as specific instructions tailored for replication projects (DARs) and reproduction projects.

There are six project categories, depending on two factors: the type of data used and the number of claims selected.

Data types: (i) Datasets provided by the author (PBR/ADR), (ii) Original observations reconstructed from the underlying data sources (SDR), and (iii) New observations that were not already analyzed in the original article (DAR)

Number of claims: (i) Just one claim per article (singe-trace) and (ii) A minimum of five claims per article (bushel), unless fewer are available.

We have identified the most feasible projects in separate tabs. We highly encourage collaborators to review these projects first. Projects are highly feasible if we already have the data in hand, or if we’ve identified the data sources as relatively simple to obtain.

Finally, we have included a column in each tab labeled ‘high_incentive,’ coded as yes or no. If a project is coded as ‘yes,’ it means the payment for completing the project will be higher than other projects from the same category. The full set of payments can be found here.

Your next steps

Select one or more projects you anticipate being able to complete, by adding your name in the respective signup cell. Please consider whether you’ll be able to obtain the necessary materials before signing up.

After signing up but before submitting the commitment form, please confirm you can access all of the necessary materials to complete the project. If this requires author data that COS does not currently have but that you think could be made available, please do not reach out to the authors directly. Instead, please contact COS for assistance in getting in touch with the authors.

After you have obtained all of the materials necessary to begin your project, please complete the commitment form linked in the spreadsheet, after which someone from COS will provide you access to an OSF project and preregistration form.

Please keep the following privacy statement in mind as you complete these steps: Other teams are making predictions about the outcomes of many different studies, not knowing which studies have been selected for replication/reproduction. As a consequence, the success of this project requires full confidentiality of the research process. This includes privacy about which studies have been selected for replication and all aspects of the discussion about these replication designs.

Et tu, Finance?

[Excerpts are taken from the article “The hidden ‘replication crisis’ of finance”, by Robin Wigglesworth, published at Financial Times online.]

“It may sound like a low-budget Blade Runner rip-off, but over the past decade the scientific world has been gripped by a “replication crisis” — the findings of many seminal studies cannot be repeated, with huge implications. Is investing suffering from something similar?”

“That is the incendiary argument of Campbell Harvey, professor of finance at Duke University. He reckons that at least half of the 400 supposedly market-beating strategies identified in top financial journals over the years are bogus. Worse, he worries that many fellow academics are in denial about this.”

“Harvey is not some obscure outsider or performative contrarian attempting to gain attention through needless controversy. He is the former editor of the Journal of Finance, a former president of the American Finance Association, and an adviser to investment firms like Research Affiliates and Man Group. He has written more than 150 papers on finance, several of which have won prestigious prizes.”

“To understand what the ‘replication crisis’ is, how it has happened and its implications for finance, it helps to start at its broader genesis. In 2005, Stanford medical professor John Ioannidis published a bombshell essay titled “Why Most Published Research Findings Are False ”, which noted that the results of many medical research papers could not be replicated by other researchers. Subsequently, several other fields have turned a harsh eye on themselves and come to similar conclusions. The heart of the issue is a phenomenon that researchers call “p-hacking”.”

“P-hacking is when researchers overtly or subconsciously twist the data to find a superficially compelling but ultimately spurious relationship between variables. It can be done by cherry-picking what metrics to measure, or subtly changing the time period used. Just because something is narrowly statistically significant, does not mean it is actually meaningful. A trading strategy that looks golden on paper might turn up nothing but lumps of coal when actually implemented.”

“AQR, a prominent quant investment group, is also sceptical that there are hundreds of durable and successful factors that can help investors beat markets, but argues that the “replication crisis” brouhaha is overdone. Earlier this year it published a paper that concluded that not only could the majority of the studies it examined be replicated, they still worked “out of sample” — in actual live trading —and were actually further corroborated by international data.”

“Harvey is unconvinced by the riposte, and will square up to the AQR paper’s authors at the American Finance Association’s annual meeting in early January. “That’s going to be a very interesting discussion,” he promises.”

To read the full article, click here.

IREE Scores a Top Score in TOP Factor

The International Journal for Re-Views in Empirical Economics (IREE) is the only journal in economics solely dedicated to publishing replications. Recently, IREE was evaluated by TOP Factor. TOP Factor is an initiative launched by the Center for Open Science to assess journals according to “a values-aligned rating of journal policies as a counterweight to metrics that incentivize journals to publish exciting results regardless of credibility” (see here). The assessment of IREE‘s journal policies resulted in a journal score of 13 points. This puts IREE in 4th place among all 136 economic journals rated by TOP Factor, ahead of the American Economic Review, Econometrica, Plos One, and Science.

TOP Factor provides an alternative to metrics such as the journal impact factor (JIF). It constitutes a first step towards evaluating journals based on their quality of process and implementation of scholarly values. “Too often, journals are compared using metrics that have nothing to do with their quality,” says Evan Mayo-Wilson, Associate Professor in the Department of Epidemiology and Biostatistics at Indiana University School of Public Health-Bloomington. “The TOP Factor measures something that matters. It compares journals based on whether they require transparency and methods that help reveal the credibility of research findings.” (see COS announcement of TOP factor, 2020).

TOP Factor is based on the Transparency and Openness Promotion (TOP) Guidelines, a framework of eight standards that summarize behaviors that can improve transparency and reproducibility of research such as transparency of data, materials, code, and research design, preregistration, and replication.

Editor Martina Grunow announced that she was very pleased with this rating, as TOP Factor reflects exactly what IREE stands for: reducing the publication bias towards literally incredible and non-reproducible results and the resulting “publish-or-perish” culture. Like TOP Factor, IREE promotes the reproducibility and transparency of published results and scientific discourse in economics based on high-quality and credible research results.

AIMOS 2021 Is Happening. You Can Be a Part of It.

Registration is now open for the 3rd annual Association for Interdisciplinary Metaresearch and Open Science Conferenceaimos 2021 to be held Tuesday 30 Nov – Friday 3 Dec 2021! 

aimos 2021 will offer the opportunity for researchers from many fields – psychology, ecology, medicine, biology, economics, statistics, philosophy, social studies of science, and more, to talk about how we do research, and how we can improve it.  

This year’s conference includes plenary lectures opened by Professor Brian Nosek reflecting on the last 10 years of metaresearch, and closed by Dr. Rose O’Dea on the future of metaresearch. The program will also explore other areas of metaresearch including notable plenary lectures from Prof. Sarah de Rijcke and Dr Julia Rohrer.

In addition to invited speaker sessions, aimos 2021 will open submissions for: discussions about norms, practices, and cultures with science and scholarship more broadly; learning and development, through practical skills workshops in open science (e.g., R, lab notebooks, pre-registration); and getting things done in hackathons.

Visit the Conference website to check out the schedule and speakers, submit your proposal and express your interest in attending!

Let’s connect at aimos 2021!

Fudging Data About Dishonesty

[Excerpts are taken from the blog “Evidence of Fraud in an Influential Field Experiment About Dishonesty” posted by Uri Simonsohn, Joe Simmons, Leif Nelson and anonymous researchers at Data Colada]

“This post is co-authored with a team of researchers who have chosen to remain anonymous. They uncovered most of the evidence reported in this post.”

“In 2012, Shu, Mazar, Gino, Ariely, and Bazerman published a three-study paper in PNAS reporting that dishonesty can be reduced by asking people to sign a statement of honest intent before providing information (i.e., at the top of a document) rather than after providing information (i.e., at the bottom of a document).”

“In 2020, Kristal, Whillans, and the five original authors published a follow-up in PNAS entitled, “Signing at the beginning versus at the end does not decrease dishonesty”.

“Our focus here is on Study 3 in the 2012 paper, a field experiment (N = 13,488) conducted by an auto insurance company … under the supervision of the fourth author. Customers were asked to report the current odometer reading of up to four cars covered by their policy.”

“The authors of the 2020 paper did not attempt to replicate that field experiment, but they did discover an anomaly in the data…our story really starts from here, thanks to the authors of the 2020 paper, who posted the data of their replication attempts and the data from the original 2012 paper.”

“A team of anonymous researchers downloaded it, and discovered … very strong evidence that the data were fabricated.”

“Let’s start by describing the data file. Below is a screenshot of the first 12 observations:”

“You can see variables representing the experimental condition, a masked policy number, and two sets of mileages for up to four cars. The “baseline_car[x]” columns contain the mileage that had been previously reported for the vehicle x (at Time 1), and the “update_car[x]” columns show the mileage reported on the form that was used in this experiment (at Time 2).”

“On to the anomalies.”

Anomaly #1: Implausible Distribution of Miles Driven

“Let’s first think about what the distribution of miles driven should look like…we might expect…some people drive a whole lot, some people drive very little, and most people drive a moderate amount.”

“As noted by the authors of the 2012 paper, it is unknown how much time elapsed between the baseline period (Time 1) and their experiment (Time 2), and it was reportedly different for different customers. … It is therefore hard to know what the distribution of miles driven should look like in those data.”

“It is not hard, however, to know what it should not look like. It should not look like this:”

“First, it is visually and statistically (p=.84) indistinguishable from a uniform distribution ranging from 0 miles to 50,000 miles. Think about what that means. Between Time 1 and Time 2, just as many people drove 40,000 miles as drove 20,000 as drove 10,000 as drove 1,000 as drove 500 miles, etc. This is not what real data look like, and we can’t think of a plausible benign explanation for it.”

“Second, there is some weird stuff happening with rounding…”

Anomaly #2: No Rounded Mileages At Time 2

“The mileages reported in this experiment … are what people wrote down on a piece of paper. And when real people report large numbers by hand, they tend to round them.”

“Of course, in this case some customers may have looked at their odometer and reported exactly what it displayed. But undoubtedly many would have ballparked it and reported a round number.”

“In fact, as we are about to show you, in the baseline (Time 1) data, there are lots of rounded values.”

“But random number generators don’t round. And so if, as we suspect, the experimental (Time 2) data were generated with the aid of a random number generator (like RANDBETWEEN(0,50000)), the Time 2 mileage data would not be rounded.”

“The figure shows that while multiples of 1,000 and 100 were disproportionately common in the Time 1 data, they weren’t more common than other numbers in the Time 2 data.”

“These data are consistent with the hypothesis that a random number generator was used to create the Time 2 data.”

“In the next section we will see that even the Time 1 data were tampered with.”

Interlude: Calibri and Cambria

“Perhaps the most peculiar feature of the dataset is the fact that the baseline data for Car #1 in the posted Excel file appears in two different fonts. Specifically, half of the data in that column are printed in Calibri, and half are printed in Cambria.”

“The analyses we have performed on these two fonts provide evidence of a rather specific form of data tampering.”

“We believe the dataset began with the observations in Calibri font. Those were then duplicated using Cambria font. In that process, a random number from 0 to 1,000 (e.g., RANDBETWEEN(0,1000)) was added to the baseline (Time 1) mileage of each car, perhaps to mask the duplication.”

“In the next two sections, we review the evidence for this particular form of data tampering.”

Anomaly #3: Near-Duplicate Calibri and Cambria Observations

“…the baseline mileages for Car #1 appear in Calibri font for 6,744 customers in the dataset and Cambria font for 6,744 customers in the dataset. So exactly half are in one font, and half are in the other. For the other three cars, there is an odd number of observations, such that the split between Cambria and Calibri is off by exactly one (e.g., there are 2,825 Calibri rows and 2,824 Cambria rows for Car #2).”

“… each observation in Calibri tends to match an observation in Cambria.”

“To understand what we mean by “match” take a look at these two customers:”

“The top customer has a “baseline_car1” mileage written in Calibri, whereas the bottom’s is written in Cambria. For all four cars, these two customers have extremely similar baseline mileages.”

“Indeed, in all four cases, the Cambria’s baseline mileage is (1) greater than the Calibri mileage, and (2) within 1,000 miles of the Calibri mileage. Before the experiment, these two customers were like driving twins.”

“Obviously, if this were the only pair of driving twins in a dataset of more than 13,000 observations, it would not be worth commenting on. But it is not the only pair.”

“There are 22 four-car Calibri customers in the dataset. All of them have a Cambria driving twin…there are twins throughout the data, and you can easily identify them for three-car, two-car, and unusual one-car customers, too.”

“To see a fuller picture of just how similar these Calibri and Cambria customers are, take a look at Figure 5, which shows the cumulative distributions of baseline miles for Car #1 and Car #4.”

“Within each panel, there are two lines, one for the Calibri distribution and one for the Cambria distribution. The lines are so on top of each other that it is easy to miss the fact that there are two of them:”

Anomaly #4: No Rounding in Cambria Observations

“As mentioned above, we believe that a random number between 0 and 1,000 was added to the Calibri baseline mileages to generate the Cambria baseline mileages. And as we have seen before, this process would predict that the Calibri mileages are rounded, but that the Cambria mileages are not.”

“This is indeed what we observe:”


“The evidence presented in this post indicates that the data underwent at least two forms of fabrication: (1) many Time 1 data points were duplicated and then slightly altered (using a random number generator) to create additional observations, and (2) all of the Time 2 data were created using a random number generator that capped miles driven, the key dependent variable, at 50,000 miles.”

“We have worked on enough fraud cases in the last decade to know that scientific fraud is more common than is convenient to believe… There will never be a perfect solution, but there is an obvious step to take: Data should be posted.” 

“The fabrication in this paper was discovered because the data were posted. If more data were posted, fraud would be easier to catch. And if fraud is easier to catch, some potential fraudsters may be more reluctant to do it. … All of our journals should require data posting.”

“Until that day comes, all of us have a role to play. As authors (and co-authors), we should always make all of our data publicly available. And as editors and reviewers, we can ask for data during the review process, or turn down requests to review papers that do not make their data available.”

“A field that ignores the problem of fraud, or pretends that it does not exist, risks losing its credibility. And deservedly so.”

To read the full blog, click here.