Replication is one key to ensuring the credibility of and confidence in research findings. Yet replication papers are rare in political science and related fields. Research & Politics welcomes replications as regular submissions and is happy to announce a call for abstracts for a special issue on methods, practices and ethics for replication.
This special issue is in collaboration with the Institute for Replication and particularly welcomes three types of papers, without excluding other forms of replication.
First, we invite reproduction or replication of prominent research, especially studies that are frequently cited or used in policy making. We particularly encourage replication/reproduction manuscripts that advance theory or methods in some way beyond just the specific application. For example, papers of this type might examine whether new data and methods lead to new insights, and if so, why.
Second, we welcome multiple replications or reproductions on a given topic. This type of meta-replication may, for instance, contextualize the replicability of an individual study in the context of related studies. Another possibility is to combine studies using the same data (methods) with different methods (data) and investigate whether the difference in findings may be explained.
Third, we encourage theoretical papers that consider incentives, norms and ethics that can guide the practice of replication.
Abstracts should be submitted as a normal submission though the Research & Politics ScholarOne submission site. In the cover letter please indicate that you wish the abstract to be considered for the special issue on replication.
Abstracts should be less than 1 page long. Proposals that do not meet our editorial criteria will receive only a brief reply. Proposals that appear to have potential for publication will receive more detailed feedback and may be invited to submit a complete paper.
If selected to this second stage, full papers will be peer reviewed and managed by the guest editors.
The submission deadline is September 15, 2022.
For more information, contact Abel Brodeur, abrodeur@uottawa.ca.
The Multi100 project is a crowdsourced empirical project aiming to estimate how robust published results and conclusions in social and behavioral sciences are to analysts’ analytical choices. The project will involve more than 200 researchers.
The Center for Open Science is currently looking to finalize teams of researchers to undertake the work. They are particularly interested in recruiting grad and post-grad data analysts in Economics, International Relations, and Political Science to join the project.
And in case contributing to science isn’t reward enough, analysts can become co-authors and receive compensation.
For more info, click here.
[* EiR = Econometrics in Replications, a feature of TRN that highlights useful econometrics procedures for re-analysing existing research.]
NOTE: This blog uses Stata for its estimation. All the data and code necessary to reproduce the results in the tables below are available at Harvard’s Dataverse: click here.
Missing data is ubiquitous in economics. Standard practice is to drop observations for which any variables have missing values. At best, this can result in diminished power to identify effects. At worst, it can generate biased estimates. Old-fashioned ways to address missing data assigned values using some form of interpolation or imputation. For example, time series data might fill in gaps in the record using linear interpolation. Cross-sectional data might use regression to replace missing values with their predicted values. These procedures are now known to be flawed (Allison, 2001; Enders, 2010).
The preferred way to deal with missing data is to use maximum likelihood (ML) or multiple imputation (MI), assuming the data are “missing at random”. Missing at random (MAR) essentially means that the probability a variable is missing is independent of the value of that variable. For example, if a question about illicit drug use is more likely to go unanswered for respondents who use drugs, then those data would not be MAR. Assuming that the data are MAR, both ML and MI will produce estimates that are consistent and asymptotically efficient.
ML is in principle the easiest to perform. In Stata, one can use the structural equation modelling command (“sem”) with the option “method(mlmv)”. That’s it! Unfortunately, the simplicity of ML is also its biggest disadvantage. For linear models, ML simultaneously estimates means, variances, and covariances while also accounting for the incomplete records associated with missing data. Not infrequently, this causes convergence problems. This is particularly a problem for panel data where one might have a large number of fixed effects.
In this blog, we illustrate how to apply both ML and MI to a well-cited study on mortality and inequality by Andrew Leigh and Christopher Jencks (Journal of Health Economics, 2007). Their analysis focused on the relationship between life expectancy and income inequality, measured by the share of pre-tax income going to the richest 10% of the population. Their data consisted of annual observations from 1960-2004 for Australia, Canada, France, Germany, Ireland, the Netherlands, New Zealand, Spain, Sweden, Switzerland, the UK, and the US. We use their study both because their data and code are publicly available, and because much of the original data were missing.
The problem is highlighted in TABLE 1, which uses a reconstruction of L&J’s original dataset. The full dataset has 540 observations. The dependent variable, “Life expectancy”, has approximately 11 percent missing values. The focal variable, “Income share of the richest 10%”, has approximately 24 percent missing values. The remaining control variables vary widely in their missingness. Real GDP has no missing values. Education has the most missing values, with fully 80% of the variable’s values missing. This is driven by the fact that the Barro and Lee data used to measure education only reports values at five-year intervals.
In fact, the problem is more serious than TABLE 1 indicates. If we run the regression using L&J’s specification (cf. Column 7, Table 4 in their study), we obtain the results in Column (1) of TABLE 2. The estimates indicate that a one-percentage point increase in the income share of the richest 10% is associated with an increase in life expectancy of 0.003 years, a negligible effect in terms of economic significance, and statistically insignificant. Notably, this estimate is based on a mere 64 observations (out of 540).
In fact, these are not the results that L&J reported in their study. No doubt because of the small number of observations, they used linear interpolation on some (but not all) of their data to fill in missing values. Applying their approach to our data yields the results in Column (2) of Table 2 below. There are two problems with using their approach.
First, for various reasons, L&J did not fill in values for all the missing values. The ended up using only 430 out of a possible 540 observations. As a result, their estimates did not exploit all the information that was available to them. Second, interpolation replaces missing values with their predicted values without accounting for the randomness that occurs in real data. This biases standard errors, usually downwards. ML and MI allow one to do better.
ML is the easiest method to apply. To estimate the regression in Table 2 requires a one-line command:
sem (le <- ts10 gdp gdpsq edu phealth thealth id2-id12 year2-year45), method(mlmv) vce(cluster id)
The “sem” command calls up Stata’s structural equation modelling procedure. The option “method(mlmv)” tells Stata to use maximum likelihood to accommodate missing values. If this option is omitted from the above, then the command will produce results identical to those in Column 1 of Table 1, except that the standard errors will be slightly smaller.
While the simplicity of ML is a big advantage, it also introduces complications. Specifically, ML estimates all the parameters simultaneously. The inclusion of 11 country fixed effects and 44 year dummies makes the number of elements in the variance-covariance matrix huge. This, in combination with the fact that ML simultaneously integrates over distributions of variables to account for missing values creates computational challenges. The ML procedure called up by the command above did not converge after 12 hours. As a result, we next turn to MI.
Unlike ML, MI fills in missing values with actual data. The imputed values are created to incorporate the randomness that occurs in real data. The most common MI procedure assumes that all of the variables are distributed multivariate normal. It turns out that this is a serviceable assumption even if the regression specification includes variables that are not normally distributed, like dummy variables (Horton et al., 2003; Allison, 2006).
As the name suggests, MI creates multiple datasets using a process of Monte Carlo simulation. Each of the datasets produces a separate set of estimates. These are then combined to produce one overall set of estimation results. Because each data set is created via a simulation process that depends on randomness, each dataset will be different. Furthermore, unless a random seed is set, different attempts will produce different results. This is one disadvantage of MI versus ML.
A second disadvantage is that MI requires a number of subjective assessments to set key parameters. The key parameters are (i) the “burnin”, the number of datasets that are initially discarded in the simulation process; (ii) the “burnbetween”, the number of intervening datasets that are discarded between retained datasets to maintain dataset independence; and (iii) the total number of imputed datasets that are used for analysis.
The first two parameters are related to the properties of “stationarity” and “independence”. The analogue to convergence in estimated parameters in ML is convergence in distributions in MI. To assess these two properties we first do a trial run of imputations.
The command “mi impute mvn” identifies the variables with missing values to the left of the “=” sign, while the variables to the right are identified as being complete.
mi impute mvn le ts10 edu phealth thealth im = gdp gdpsq id2-id12 year2-year45, prior(jeffreys) mcmconly rseed(123) savewlf(wlf, replace)
The option “mcmconly” lets Stata know that we are not retaining the datasets for subsequent analysis, but only using them to assess their characteristics.
The option “rseed(123)” ensures that we will obtain the same data every time we run this command.
The option “prior(jeffreys)” sets the posterior prediction distribution used to generate the imputed datasets as “noninformative”. This makes the distribution used to impute the missing values solely determined by the estimates from the last regression.
Lastly, the option “savewlf(wlf, replace)” creates an aggregate variable called the “worst linear function” that allows one to investigate whether the imputed datasets are stationary and independent.
Note that Stata sets the default values for “burnin” and “burnbetween” at 100 and 100.
The next set of key commands are given below.
use wlf, clear
tsset iter
tsline wlf, ytitle(Worst linear function) xtitle(Burn-in period) name(stable2,replace)
ac wlf, title(Worst linear function) ytitle(Autocorrelations) note(“”) name(ac2,replace)
The “tsline” command produces a “time series” graph of the “worst linear function” where “time” is measured by number of simulated datasets. We are looking for trends in the data. That is, do the estimated parameters (which includes elements in the variance-covariance matrix) tend to systematically depart from the overall mean.
The graph above is somewhat concerning because it appears to first trend up and then trend down. As a result, we increase the “burnin” value to 500 from its default value of 100 with the following command. Why 500? We somewhat arbitrarily choose a number that is substantially larger than the previous “burnin” value.
mi impute mvn le ts10 edu phealth thealth im = gdp gdpsq id2-id12 year2-year45, prior(jeffreys) mcmconly burnin(500) rseed(123) savewlf(wlf, replace)
…
use wlf, clear
tsset iter
tsline wlf, ytitle(Worst linear function) xtitle(Burn-in period) name(stable2,replace)
ac wlf, title(Worst linear function) ytitle(Autocorrelations) note(“”) name(ac2,replace)
This looks a lot better. The trending that is apparent in the first half of the graph is greatly reduced in the second half. We subjectively determine that this demonstrates sufficient “stationarity” to proceed. Note that there is no formal test to determine stationarity.
The next thing is to check for independence. The posterior distributions used to impute the missing values rely on Bayesian updating. While our use of the Jeffrys prior reduces the degree to which contiguous imputed datasets are related, there is still the opportunity for correlations across datasets. The “ac” command produces a correlogram of the “worst linear function” that allows us to assess independence. This is produced below.
This correlogram indicates that as long as we retain imputed datasets that are at least “10 datasets apart”, we should be fine. The default value of 100 for “burnbetween” is thus more than sufficient.
The remaining parameter to be set is the total number of imputed datasets to use for analysis. For this we use a handy, user-written Stata (and SAS) command from von Hippel (2020) called “how_many_imputations”.
The problem with random data is that it produces different results each time. “how_many_imputations” allows us to set the number of imputations so that the variation in estimates will remain within some pre-determined threshold value. The default value is to set the number of imputations so that the coefficient of variation of the standard error of the “worst linear function” is equal to 5%.
It works like this. First we create a small initial set of imputed datasets. The command below imputes 10 datasets (“add(10)”).
mi impute mvn le ts10 edu phealth thealth im = gdp gdpsq id2-id12 year2-year45, prior(jeffreys) burnin(500) add(10) rseed(123)
We then estimate a fixed effects regression for each of the 10 datasets. Note that we use the standard Stata command for “xtreg, fe” after “mi estimate:”
mi xtset id year
mi estimate: xtreg le ts10 gdp gdpsq edu phealth thealth year2-year45, fe vce(cluster id)
how_many_imputations
The command “how_many_imputations” determines the number of imputed datasets calculated to produce standard errors with a coefficient of variation for the standard errors equal to 5%. In this particular case, the output is given by:
The output says to create 182 more imputed datasets.
We can feed this number directly into the “mi impute” command using the “add(`r(add_M)’)” option:
mi impute mvn le ts10 edu phealth thealth im = gdp gdpsq id2-id12 year2-year45, prior(jeffreys) burnin(500) add(`r(add_M)’)
After running the command above, our stored data now consists of 104,220 observations: The initial set of 540 observations plus 192 imputed datasets × 540 observations/dataset. To combine the individual estimates from each dataset to get an overall estimate, we use the following command:
mi xtset id year
mi estimate: xtreg le ts10 gdp gdpsq edu phealth thealth year2-year45, fe vce(cluster id)
The results are reported in Column (3) of Table 2. A further check re-runs the program using different seed numbers. These show little variation, confirming the robustness of the results.
In calculating standard errors, Table 2 follows L&J’s original procedure of estimating standard errors clustered on countries. One might want to improve on this given that their sample only included 12 countries.
An alternative to the “mi estimate” command above is to use a user-written program that does wild cluster bootstrapping. One such package is “wcbregress”. While not all user-written programs can be accommodated by Stata’s “mi estimate”, one can use wcb by modifying the “mi estimate” command as follows:
mi estimate, cmdok: wcbregress le ts10 gdp gdpsq edu phealth thealth year2-year45, group(id)
A comparison of Columns (3) and (1) reveals what we have to show for all our work. Increasing the number of observations substantially reduced the sizes of the standard errors. The standard error of the focal variable, “Income share of the richest 10%”, decreased from 0.051 to 0.035.
While the estimated coefficient remained statistically insignificant for this variable, the smaller standard errors boosted two other variables into significance: “Real GDP per capital squared” and “Log public health spending pc”. Furthermore, the larger sample provides greater confidence that the estimated coefficients are representative of the population from which we are sampling.
Overall, the results provide further support for Leigh & Jencks (2007)’s claim that the relationship between inequality and mortality is small and statistically insignificant.
Given that ML and MI estimation procedures are now widely available in standard statistical packages, they should be part of the replicator’s standard toolkit for robustness checking of previously published research.
Weilun (Allen) Wu is a PhD student in economics at the University of Canterbury. This blog covers some of the material that he has researched for his thesis. Bob Reed is Professor of Economics and the Director of UCMeta at the University of Canterbury. They can be contacted at weilun.wu@pg.canterbury.ac.nz and bob.reed@canterbury.ac.nz, respectively.
REFERENCES
Allison, P. D. (2001). Missing data. Sage publications.
Enders, C. K. (2010). Applied missing data analysis. Guilford press.
Leigh, A., & Jencks, C. (2007). Inequality and mortality: Long-run evidence from a panel of countries. Journal of Health Economics, 26(1), 1-24.
Horton, N. J., Lipsitz, S. R., & Parzen, M. (2003). A potential for bias when rounding in multiple imputation. The American Statistician, 57(4), 229-232.
Allison, P. (2006, August). Multiple imputation of categorical variables under the multivariate normal model. In Annual Meeting of the American Sociological Association, Montreal.
Von Hippel, P. T. (2020). How many imputations do you need? A two-stage calculation using a quadratic rule. Sociological Methods & Research, 49(3), 699-718.
[Excerpts are taken from the article “Does Losing Lead to Winning? An Empirical Analysis for Four Sports” by Bouke Klein Teeselink, Martijn J. van den Assem, and Dennie van Dolder, forthcoming in Management Science.]
“In an influential paper, Berger and Pope (2011, henceforth BP) argue that lagging behind halfway through a competition does not necessarily imply a lower likelihood of winning, and that being slightly behind can actually increase the chance of coming out on top.”
“To test this hypothesis, BP analyze more than sixty thousand professional and collegiate basketball matches. Their main analyses focus on the score difference at half-time because the relatively long break allows players to reflect on their position relative to their opponent.”
“BP find that National Basketball Association (NBA) teams that are slightly behind are between 5.8 and 8.0 percentage points more likely to win the match than those that are slightly ahead.”
“The present paper…extends the analysis of BP to large samples of Australian football, American football, and rugby matches, and then revisits the analysis of basketball.”
“In our main analyses, the running variable is the score difference at half-time and the cutoff value is zero. We estimate the following regression model:”
“where Yi is an indicator variable that takes the value of 1 if team i wins the match, and Xi is the half-time score difference between team i and the opposing team.”
“The treatment variable Ti takes the value of 1 if team i is behind at half-time. The coefficient τ represents the discontinuity in the winning probability at a zero score difference. This coefficient is positive under the hypothesis that being slightly behind improves performance. The interaction term Ti × Xi allows for different slopes above and below the cutoff.”
“If the assumption of a piecewise linear relationship between the winning probability and the half-time score difference is violated, then the regression model will generate a biased estimate of the treatment effect. Hahn et al. (2001) propose the use of local-linear regression to solve this problem.”
“Even if the true relationship is non-linear, a linear specification can provide a close approximation within a limited bandwidth around the cutoff. A downside of this solution is that it reduces the effective number of observations and therefore the precision of the estimate.”
“To strike the appropriate balance between bias and precision, we use the local-linear method proposed by Calonico et al. (2014). This method selects the bandwidth that minimizes the mean squared error, corrects the estimated treatment effect for any remaining non-linearities within the bandwidth, and linearly downweights observations that are farther away from the cutoff.”
“We find no supporting evidence for [BP’s result that marginally trailing improves the odds of winning in Australian football, American football, and rugby]: the estimated effects are sometimes positive and sometimes negative, and statistically always insignificant.”
“We then also revisit the phenomenon for basketball. We replicate the finding that half-time trailing improves the chances of winning in NBA matches from the period analyzed in BP, but consistently find null results for NBA matches from outside this period, for the sample of NCAA matches analyzed in BP, for more recent NCAA matches, and for WNBA matches.”
“Moreover, our high-powered meta-analyses across the different sports and competitions cannot reject the hypothesis of no effect of marginally trailing on winning, and the confidence intervals suggest that the true effect, if existent at all, is likely relatively small.”
“In our view, the performance-enhancing effect documented in BP is most likely a chance occurrence.”
To read the full article, click here.
[Excerpts are taken from the article “Investigating the replicability of preclinical cancer biology” by Errington et al., published in eLife.]
“Large-scale replication studies in the social and behavioral sciences provide evidence of replicability challenges (Camerer et al., 2016; Camerer et al., 2018; Ebersole et al., 2016; Ebersole et al., 2020; Klein et al., 2014; Klein et al., 2018; Open Science Collaboration, 2015).”
“In psychology, across 307 systematic replications and multisite replications, 64% reported statistically significant evidence in the same direction and effect sizes 68% as large as the original experiments (Nosek et al., 2021).”
“In the biomedical sciences, the ALS Therapy Development Institute observed no effectiveness of more than 100 potential drugs in a mouse model in which prior research reported effectiveness in slowing down disease, and eight of those compounds were tried and failed in clinical trials costing millions and involving thousands of participants (Perrin, 2014).”
“Of 12 replications of preclinical spinal cord injury research in the FORE-SCI program, only two clearly replicated the original findings – one under constrained conditions of the injury and the other much more weakly than the original (Steward et al., 2012).”
“And, in cancer biology and related fields, two drug companies (Bayer and Amgen) reported failures to replicate findings from promising studies that could have led to new therapies (Prinz et al., 2011; Begley and Ellis, 2012). Their success rates (25% for the Bayer report, and 11% for the Amgen report) provided disquieting initial evidence that preclinical research may be much less replicable than recognized.”
“In the Reproducibility Project: Cancer Biology, we sought to acquire evidence about the replicability of preclinical research in cancer biology by repeating selected experiments from 53 high-impact papers published in 2010, 2011, and 2012 (Errington et al., 2014). We describe in a companion paper (Errington et al., 2021b) the challenges we encountered while repeating these experiments. … These challenges meant that we only completed 50 of the 193 experiments (26%) we planned to repeat. The 50 experiments that we were able to complete included a total of 158 effects that could be compared with the same effects in the original paper.”
“There is no single method for assessing the success or failure of replication attempts (Mathur and VanderWeele, 2019; Open Science Collaboration, 2015; Valentine et al., 2011), so we used seven different methods to compare the effect reported in the original paper and the effect observed in the replication attempt…”
“…136 of the 158 effects (86%) reported in the original papers were positive effects – the original authors interpreted their data as showing that a relationship between variables existed or that an intervention had an impact on the biological system being studied. The other 22 (14%) were null effects – the original authors interpreted their data as not showing evidence for a meaningful relationship or impact of an intervention.”
“Furthermore, 117 of the effects reported in the original papers (74%) were supported by a numerical result (such as graphs of quantified data or statistical tests), and 41 (26%) were supported by a representative image or similar. For effects where the original paper reported a numerical result for a positive effect, it was possible to use all seven methods of comparison. However, for cases where the original paper relied on a representative image (without a numerical result) as evidence for a positive effect, or when the original paper reported a null effect, it was not possible to use all seven methods.”
Summarizing replications across five criteria
“To provide an overall picture, we combined the replication rates by five of these criteria, selecting criteria that could be meaningfully applied to both positive and null effects … The five criteria were: (i) direction and statistical significance (p < 0.05); (ii) original effect size in replication 95% confidence interval; (iii) replication effect size in original 95% confidence interval; (iv) replication effect size in original 95% prediction interval; (v) meta-analysis combining original and replication effect sizes is statistically significant (p < 0.05).”
“For replications of original positive effects, 13 of 97 (13%) replications succeeded on all five criteria, 15 succeeded on four, 11 succeeded on three, 22 failed on three, 15 failed on four, and 21 (22%) failed on all five (… Figure 6).”
“For original null effects, 7 of 15 (47%) replications succeeded on all five criteria, 2 succeeded on four, 3 succeeded on three, 0 failed on three, 2 failed on four, and 1 (7%) failed on all five.”
“… Combining positive and null effects, 51 of 112 (46%) replications succeeded on more criteria than they failed, and 61 (54%) replications failed on more criteria than they succeeded.”
“We explored five candidate moderators of replication success and did not find strong evidence to indicate that any of them account for variation in replication rates we observed in our sample. The clearest indicator of replication success was that smaller effects were less likely to replicate than larger effects … Research into replicability in other disciplines has also found that findings with stronger initial evidence (such as larger effect sizes and/or smaller p-values) is more likely to replicate (Nosek et al., 2021; Open Science Collaboration, 2015).”
“The present study provides substantial evidence about the replicability of findings in a sample of high-impact papers published in the field of cancer biology in 2010, 2011, and 2012. The evidence suggests that replicability is lower than one might expect of the published literature. Causes of non-replicability could be due to factors in conducting and reporting the original research, conducting the replication experiments, or the complexity of the phenomena being studied. The present evidence cannot parse between these possibilities…”
“Stakeholders from across the research community have been raising concerns and generating evidence about dysfunctional incentives and research practices that could slow the pace of discovery. This paper is just one contribution to the community’s self-critical examination of its own practices.”
“Science pursuing and exposing its own flaws is just science being science. Science is trustworthy because it does not trust itself. Science earns that trustworthiness through publicly, transparently, and continuously seeking out and eradicating error in its own culture, methods, and findings. Increasing awareness and evidence of the deleterious effects of reward structures and research practices will spur one of science’s greatest strengths, self-correction.”
To read the full article, click here.
References
Begley CG, Ellis LM. 2012. Drug development: Raise standards for preclinical cancer research. Nature 483:531–533.
Camerer C. F, Dreber A, Forsell E, Ho TH, Huber J, Johannesson M, Kirchler M, Almenberg J, Altmejd A, Chan T, Heikensten E, Holzmeister F, Imai T, Isaksson S, Nave G, Pfeiffer T, Razen M, Wu H. 2016. Evaluating replicability of laboratory experiments in economics. Science 351: 1433–1436.
Camerer CF, Dreber A, Holzmeister F, Ho T-H, Huber J, Johannesson M, Kirchler M, Nave G, Nosek BA, Pfeiffer T, Altmejd A, Buttrick N, Chan T, Chen Y, Forsell E, Gampa A, Heikensten E, Hummer L, Imai T, Isaksson S, et al. 2018. Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour 2: 637–644.
Ebersole CR, Mathur MB, Baranski E, Bart-Plange D-J, Buttrick NR, Chartier CR, Corker KS, Corley M, Hartshorne JK, IJzerman H, Lazarević LB, Rabagliati H, Ropovik I, Aczel B, Aeschbach LF, Andrighetto L, Arnal JD, Arrow H, Babincak P, Bakos BE, et al. 2020. Many Labs 5: Testing Pre-Data-Collection Peer Review as an Intervention to Increase Replicability. Advances in Methods and Practices in Psychological Science 3:309–331.
Ebersole CR, Atherton OE, Belanger AL, Skulborstad HM, Allen JM, Banks JB, Baranski E, Bernstein MJ, Bonfiglio DBV, Boucher L, Brown ER, Budiman NI, Cairo AH, Capaldi CA, Chartier CR, Chung JM, Cicero DC, Coleman JA, Conway JG, Davis WE, et al. 2016. Many Labs 3: Evaluating participant pool quality across the academic semester via replication. Journal of Experimental Social Psychology 67: 68–82.
Errington TM, Iorns E, Gunn W, Tan FE, Lomax J, Nosek BA. 2014. An open investigation of the reproducibility of cancer biology research. eLife 3: e04333.
Errington TM, Denis A, Perfito N, Iorns E, Nosek BA. 2021b. Challenges for assessing replicability in preclinical cancer biology. eLife 10: e67995.
Klein RA, Ratliff KA, Vianello M, Adams RB, Bahník Š, Bernstein MJ, Bocian K, Brandt MJ, Brooks B, Brumbaugh CC, Cemalcilar Z, Chandler J, Cheong W, Davis WE, Devos T, Eisner M, Frankowska N, Furrow D, Galliani EM, Nosek BA. 2014. Investigating Variation in Replicability: A “Many Labs” Replication Project. Social Psychology 45: 142–152.
Klein RA, Vianello M, Hasselman F, Adams BG, Adams RB, Alper S, Aveyard M, Axt JR, Babalola MT, Bahník Š, Batra R, Berkics M, Bernstein MJ, Berry DR, Bialobrzeska O, Binan ED, Bocian K, Brandt MJ, Busching R, Rédei AC, et al. 2018. Many Labs 2: Investigating Variation in Replicability Across Samples and Settings. Advances in Methods and Practices in Psychological Science 1: 443–490.
Mathur MB, VanderWeele TJ. 2019. Challenges and suggestions for defining replication “success” when effects may be heterogeneous: Comment on Hedges and Schauer (2019). Psychological Methods 24: 571–575.
Nosek BA, Hardwicke TE, Moshontz H, Allard A, Corker KS, Dreber A, Fidler F, Hilgard J, Struhl MK, Nuijten MB, Rohrer JM, Romero F, Scheel AM, Scherer LD, Schönbrodt FD, Vazire S. 2021. Replicability, Robustness, and Reproducibility in Psychological Science. Annual Review of Psychology 73: 114157.
Open Science Collaboration. 2015. Estimating the reproducibility of psychological science. Science 349:aac4716.
Perrin S. 2014. Preclinical research: Make mouse studies work. Nature 507: 423–425.
Prinz F, Schlange T, Asadullah K. 2011. Believe it or not: how much can we rely on published data on potential drug targets? Nature Reviews Drug Discovery 10: 712.
Steward O, Popovich PG, Dietrich WD, Kleitman N. 2012. Replication and reproducibility in spinal cord injury research. Experimental Neurology 233: 597–605.
Valentine JC, Biglan A, Boruch RF, Castro FG, Collins LM, Flay BR, Kellam S, Mościcki EK, Schinke SP. 2011. Replication in prevention science. Prevention Science 12: 103–117.
The SCORE project is entering its final phase of conducting reproductions (repeating the original analysis with original data) and replications (testing the same claim with new data) on a stratified random sample of claims from papers across the social-behavioral sciences.
Here is your opportunity to contribute to these efforts for non-human subjects research (non-HSR) work, meaning projects that use existing data and do not require any additional IRB review steps. We’d love to have your collaboration!
This spreadsheet contains all of the non-HSR projects available, with different tabs corresponding to different project categories.
This announcement outlines the different project categories at a high level. More details can be found in this explanation of our terminology, as well as specific instructions tailored for replication projects (DARs) and reproduction projects.
There are six project categories, depending on two factors: the type of data used and the number of claims selected.
Data types: (i) Datasets provided by the author (PBR/ADR), (ii) Original observations reconstructed from the underlying data sources (SDR), and (iii) New observations that were not already analyzed in the original article (DAR)
Number of claims: (i) Just one claim per article (singe-trace) and (ii) A minimum of five claims per article (bushel), unless fewer are available.
We have identified the most feasible projects in separate tabs. We highly encourage collaborators to review these projects first. Projects are highly feasible if we already have the data in hand, or if we’ve identified the data sources as relatively simple to obtain.
Finally, we have included a column in each tab labeled ‘high_incentive,’ coded as yes or no. If a project is coded as ‘yes,’ it means the payment for completing the project will be higher than other projects from the same category. The full set of payments can be found here.
Your next steps
Select one or more projects you anticipate being able to complete, by adding your name in the respective signup cell. Please consider whether you’ll be able to obtain the necessary materials before signing up.
After signing up but before submitting the commitment form, please confirm you can access all of the necessary materials to complete the project. If this requires author data that COS does not currently have but that you think could be made available, please do not reach out to the authors directly. Instead, please contact COS for assistance in getting in touch with the authors.
After you have obtained all of the materials necessary to begin your project, please complete the commitment form linked in the spreadsheet, after which someone from COS will provide you access to an OSF project and preregistration form.
Please keep the following privacy statement in mind as you complete these steps: Other teams are making predictions about the outcomes of many different studies, not knowing which studies have been selected for replication/reproduction. As a consequence, the success of this project requires full confidentiality of the research process. This includes privacy about which studies have been selected for replication and all aspects of the discussion about these replication designs.
[Excerpts are taken from the article “The hidden ‘replication crisis’ of finance”, by Robin Wigglesworth, published at Financial Times online.]
“It may sound like a low-budget Blade Runner rip-off, but over the past decade the scientific world has been gripped by a “replication crisis” — the findings of many seminal studies cannot be repeated, with huge implications. Is investing suffering from something similar?”
“That is the incendiary argument of Campbell Harvey, professor of finance at Duke University. He reckons that at least half of the 400 supposedly market-beating strategies identified in top financial journals over the years are bogus. Worse, he worries that many fellow academics are in denial about this.”
“Harvey is not some obscure outsider or performative contrarian attempting to gain attention through needless controversy. He is the former editor of the Journal of Finance, a former president of the American Finance Association, and an adviser to investment firms like Research Affiliates and Man Group. He has written more than 150 papers on finance, several of which have won prestigious prizes.”
“To understand what the ‘replication crisis’ is, how it has happened and its implications for finance, it helps to start at its broader genesis. In 2005, Stanford medical professor John Ioannidis published a bombshell essay titled “Why Most Published Research Findings Are False ”, which noted that the results of many medical research papers could not be replicated by other researchers. Subsequently, several other fields have turned a harsh eye on themselves and come to similar conclusions. The heart of the issue is a phenomenon that researchers call “p-hacking”.”
“P-hacking is when researchers overtly or subconsciously twist the data to find a superficially compelling but ultimately spurious relationship between variables. It can be done by cherry-picking what metrics to measure, or subtly changing the time period used. Just because something is narrowly statistically significant, does not mean it is actually meaningful. A trading strategy that looks golden on paper might turn up nothing but lumps of coal when actually implemented.”
“AQR, a prominent quant investment group, is also sceptical that there are hundreds of durable and successful factors that can help investors beat markets, but argues that the “replication crisis” brouhaha is overdone. Earlier this year it published a paper that concluded that not only could the majority of the studies it examined be replicated, they still worked “out of sample” — in actual live trading —and were actually further corroborated by international data.”
“Harvey is unconvinced by the riposte, and will square up to the AQR paper’s authors at the American Finance Association’s annual meeting in early January. “That’s going to be a very interesting discussion,” he promises.”
To read the full article, click here.
The SCORE team at the Center for Open Science (COS) is looking for committed individuals to help conduct data-analytic replications (DARs) and reproductions.
In general, DARs involve using new data and the same methodological and analytic approach that was used in the original study to replicate the claim identified by SCORE, producing the statistical evidence found in “claim 4” (one or more inferential tests or pieces of statistical evidence). For DARs, collaborators may use different data sources or the same data sources as the original study (e.g., longitudinal dataset, U.S. Census, etc), however the observations used in the replication must be distinct from the observations used in the original study (e.g., newer waves of the same longitudinal dataset, a newer version of the U.S. Census, etc).
Reproductions involve using the original data and the same analytic approach that was used in the original study to reproduce the inferential test(s) or statistical evidence identified by SCORE in “claim 4.”
Reproduction types
Within the SCORE program, there are 3 types of reproductions
1) Push Button Reproduction (PBR): Uses the original data and the original analytic code (either shared from the original authors or collected from an online repository/journal website). If a PBR fails to produce sensible output, you will conduct an Author-Data reproduction.
2) Author Data Reproduction (ADR): Uses the original data (either shared from the original authors or collected from an online repository/journal website) and new/revised analytic code generated by the SCORE collaborator.
3) Source Data Reproduction (SDR), applicable when the original study used existing data: The SCORE collaborator reconstructs the dataset used in the original analysis (by using information from the original paper and any additional information from the original author) and generates new analytic code.
The data-analytic replications and each of the three types of reproductions are further broken down based on the method of claim extraction:
– Single-trace papers: Only a single claim trace is extracted from the article which includes exactly one statistically significant inferential test result.
– Bushel papers: As many independent claim traces are extracted as possible, which may include non-inferential quantitative evidence, non-significant evidence, and multiple inferential test results in the same claim.
How to get involved
You will self select into a project-analysis type using the sign-up sheet linked below before completing a form to confirm your interest and timeline feasibility. You will see that the commitment form corresponding to each project is linked directly in the spreadsheet.
If you sign-up for a bushel reproduction/replication, you will commit to reproducing/replicating as many claims as possible, aiming for at least 5 unless fewer claims are included in the bushel claims spreadsheet.
If you are interested in executing a data-analytic replication (DAR) or a reproduction, please review the in-depth instructions linked below and claim papers using this SIGN-UP SHEET. When you claim a project, be sure to also complete the commitment form linked in the sign-up sheet. High priority projects are highlighted in green.
For bushel papers, the columns ‘has replication’ and ‘has reproduction’ indicate whether or not at least one analysis has already been performed within the context of the SCORE program. Those projects with ‘TRUE’ in this field will be easier to complete because we likely have relevant materials in hand.
If you would like to review what data and materials we’ve already collected for a given project, if anything, please let us know and we will provide a view-only link.
You may review general instructions and expectations for each project type linked below. When you complete the commitment form and are matched to a project, you will receive access to the corresponding OSF project, your preregistration form, and any other relevant materials.
Bushel claim spreadsheets can be found in the OSF project linked in the sign-up sheet. If you are interested, please follow the link, review the project wiki, and click the paper from among the full list of bushel papers included in the project.
Note that you should attempt to access all of the necessary data after signing up but before completing the commitment form. Please do not reach out to any of the original authors directly; if you suspect that the data is readily available but require assistance to access it (e.g., original author contact, funding to access the data, etc.) please reach out to us after you’ve added your name to the sign-up sheet.
Privacy Statement: Other teams are making predictions about the outcomes of many different studies, not knowing which studies have been selected for replication/reproduction. As a consequence, the success of this project requires full confidentiality of the research process. This includes privacy about which studies have been selected for replication and all aspects of the discussion about these replication designs.
The International Journal for Re-Views in Empirical Economics (IREE) is the only journal in economics solely dedicated to publishing replications. Recently, IREE was evaluated by TOP Factor. TOP Factor is an initiative launched by the Center for Open Science to assess journals according to “a values-aligned rating of journal policies as a counterweight to metrics that incentivize journals to publish exciting results regardless of credibility” (see here). The assessment of IREE‘s journal policies resulted in a journal score of 13 points. This puts IREE in 4th place among all 136 economic journals rated by TOP Factor, ahead of the American Economic Review, Econometrica, Plos One, and Science.
TOP Factor provides an alternative to metrics such as the journal impact factor (JIF). It constitutes a first step towards evaluating journals based on their quality of process and implementation of scholarly values. “Too often, journals are compared using metrics that have nothing to do with their quality,” says Evan Mayo-Wilson, Associate Professor in the Department of Epidemiology and Biostatistics at Indiana University School of Public Health-Bloomington. “The TOP Factor measures something that matters. It compares journals based on whether they require transparency and methods that help reveal the credibility of research findings.” (see COS announcement of TOP factor, 2020).
TOP Factor is based on the Transparency and Openness Promotion (TOP) Guidelines, a framework of eight standards that summarize behaviors that can improve transparency and reproducibility of research such as transparency of data, materials, code, and research design, preregistration, and replication.
Editor Martina Grunow announced that she was very pleased with this rating, as TOP Factor reflects exactly what IREE stands for: reducing the publication bias towards literally incredible and non-reproducible results and the resulting “publish-or-perish” culture. Like TOP Factor, IREE promotes the reproducibility and transparency of published results and scientific discourse in economics based on high-quality and credible research results.
You must be logged in to post a comment.