This is an invitation to join a project that aims to improve the quality of research in applied microeconomics by examining the choices that researchers make. We are hoping to recruit up to 200 participants. Full participants in this project will receive coauthorship on the final paper, as well as a $2,000 payment.
Participation is open broadly to anyone with a published or forthcoming paper in the applied microeconomics literature, or who holds a PhD and works in a job where they write non-academic reports using tools to estimate causal effects from applied microeconomics. In addition to academic faculty, this includes researchers in the non-academic private or public sectors, as well as graduate students who have a published or forthcoming paper. Researchers are encouraged to participate from any country, of any gender, race, ethnicity, or sexual or gender identity, and at any stage in their careers.
This project is a follow-up to Huntington-Klein et al. (2021), where multiple researchers each replicated the same studies (a “many-analyst study”). The study examined differences in the analytic and data-preparation choices that researchers made, and how those choices affected their results. This was the Economic Inquiry Paper of the Year in 2021.
In this new project, a larger number of researchers, up to 200, will independently complete the same research task. Then, there will be several rounds where you will revise your original results. These will follow either peer review, or a change in the research task that standardizes some of the choices made, for example providing a data set in which key variables are pre-prepared instead of having researchers create their own. Without revealing the specific hypotheses of the project, these multiple rounds of revision will allow us to better understand the parts of research that are least standardized across researchers, whether standardization of research methods is desirable, and what tools might be most effective in standardizing research decisions, if that is desirable.
Full participation in the project means completing all rounds of revision. The payment of $2,000, and coauthorship on the paper describing the results of the many-analyst study, are contingent on full participation. You will complete the research task several times, and may occasionally be asked to provide peer review. Your first replication task should take you about as long as you’d expect to spend on creating the results section for a short Letters-style publication. After that, the project is expected to take a several hours of your time every few months, concluding in early 2024.
This project is generously supported by the Alfred P. Sloan Foundation, and approved by the Seattle University IRB.
This blog is based on the book of the same name by Norbert Hirschauer, Sven Grüner, and Oliver Mußhoff that was published in SpringerBriefs in Applied Statistics and Econometrics in August 2022. Starting from the premise that a lacking understanding of the probabilistic foundations of statistical inference is responsible for the inferential errors associated with the conventional routine of null-hypothesis-significance-testing (NHST), the book provides readers with an effective intuition and conceptual understanding of statistical inference. It is a resource for statistical practitioners who are confronted with the methodological debate about the drawbacks of “significance testing” but do not know what to do instead. It is also targeted at scientists who have a genuine methodological interest in the statistical reform debate.
Data-based scientific propositions about the world are extremely important for sound decision-making in organizations and society as a whole. Think of climate change or the Covid-19 pandemic with questions such as of how face masks, vaccines or restrictions on people’s movements work. That said, it becomes clear that the debate on p-values and statistical significance tests addresses probably the most fundamental question of the data-based sciences: How can we learn from data and come to the most reasonable belief (proposition) regarding a real-world state of interest given the available evidence (data) and the remaining uncertainty? Answering that question and understanding when and how statistical measures (i.e., summary statistics of the given dataset) can help us evaluate the knowledge gain that can be obtained from a particular sample of data is extremely important in any field of science.
In 2016, the American Statistical Association (ASA) issued an unprecedented methodological warning on p-values that set out what p-values are, and what they can and can’t tell us. It also contained a clear statement that, despite the delusive “hypothesis-testing” terminology of conventional statistical routines, p-values can neither be used to determine whether a hypothesis is true nor whether a finding is important. Against a background of persistent inferential errors associated with significance testing, the ASA felt compelled to pursue the issue further. In October 2017, it organized a symposium on the future of statistical inference whose major outcome was a special issue “Statistical Inference in the 21st Century: A World Beyond p < 0.05” in The American Statistician. Expressing their hope that this special issue would lead to a major rethink of statistical inference, the guest editors concluded that it is time to stop using the term “statistically significant” entirely. Almost simultaneously, a widely supported call to retire statistical significance was published in Nature.
Empirically working economists might be perplexed by this fundamental debate. They are usually not trained statisticians but statistical practitioners. As such, they have a keen interest in their respective field of research but “only apply statistics” – usually by following the unquestioned routine of reporting p-values and asterisks. Due to the thousands of critical papers that have been written over the last decades, even most statistical practitioners will, by now, be aware of the severe criticism of NHST-practices that many used to follow much like an automatic routine. Nonetheless, all those without a methodological bent in statistics – and this is likely to be the majority of empirical researchers – are likely to be puzzled and ask the question: What is going on here and what should I do now?
While the debate is highly visible now, many empirical researchers are likely to ignore that fundamental criticism of NHST have been voiced for decades – basically ever since significance testing became the standard routine in the 1950s (see Key References below).
KEY REFERENCES: Reforming statistical practices
2022 – Why and how we should join the shift from significance testing to estimation: Journal of Evolutionary Biology (Berner and Amrhein)
2019 – Embracing uncertainty: The days of statistical significance are numbered: Pediatric Anesthesia (Davidson)
2019 – Call to retire statistical significance: Nature (Amrhein et al.)
2019 – Special issue editorial: “[I]t is time to stop using the term ‘statistically significant’ entirely. Nor should variants such as ‘significantly different,’ ‘p < 0.05,’ and ‘nonsignificant’ survive, […].” The American Statistician (Wasserstein et al.)
2018 – Statistical Rituals: The Replication Delusion and How We Got There: Advances in Methods and Practices in Psychological Science (Gigerenzer)
2016 – ASA warning: “The p-value can neither be used to determine whether a scientific hypothesis is true nor whether a finding is important.” The American Statistician (Wasserstein and Lazar)
2016 – Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations: European Journal of Epidemiology (Greenland et al.)
2015 – Editorial ban on using NHST: Basic and Applied Social Psychology (Trafimow and Marks)
2014 – The Statistical Crisis in Science: American Scientist (Gelman and Loken)
2011 – The Cult of Statistical Significance – What Economists Should and Should not Do to Make their Data Talk: Schmollers Jahrbuch (Krämer)
2008 – The Cult of Statistical Significance. How the Standard Error Costs Us Jobs, Justice, and Lives: University of Michigan Press (Ziliak and McCloskey).
2007 – Statistical Significance and the Dichotomization of Evidence: Journal of the American Statistical Association (McShane and Gal)
2004 – The Null Ritual: What You Always Wanted to Know About Significance Testing but Were Afraid to Ask: SAGE handbook of quantitative methodology for the social sciences (Gigerenzer et al.)
2000 – Null hypothesis significance testing: A review of an old and continuing controversy: Psychological Methods (Nickerson)
1996 – A task force on statistical inference of the American Psychological Association dealt with calls for banning p-values but rejected the idea as too extreme: American Psychologist (Wilkinson and Taskforce on Statistical Inference)
1994 – The earth is round (p < 0.05): “[A p-value] does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does!” American Psychologist (Cohen)
1964 – How should we reform the teaching of statistics? “[Significance tests] are popular with non-statisticians, who like to feel certainty where no certainty exists.” Journal of the Royal Statistical Society (Yates and Healy)
1960 – The fallacy of the null-hypothesis significance test: Psychological Bulletin (Rozeboom)
1959 – Publication Decisions and Their Possible Effects on Inferences Drawn from Tests of Significance–Or Vice Versa: Journal of the American Statistical Association (Sterling)
1951 – The Influence of Statistical Methods for Research Workers on the Development of […] Statistics: Journal of the American Statistical Association (Yates)
Already back in the 1950s, scientists started expressing severe criticisms of NHST and called for reforms that would shift reporting practices away from the dichotomy of significance tests to the estimation of effect sizes and uncertainty. It is safe to say that the core criticisms and reform suggestions remained largely unchanged over those last seven decades. This is because – unfortunately – the inferential errors that they address have remained the same. Nonetheless, after the intensified debate in the last decade and some institutional-level efforts, such as the revision of author guidelines in some journals, some see signs that a paradigm shift from testing to estimation is finally under way. Unfortunately, however, reforms seem to lag behind in economics compared to many other fields.
II. THE METHODOLOGICAL DEBATE IN A NUTSHELL
The present debate is concerned with the usefulness of p-values and statistical significance declarations for making inferences about a broader context based only on a limited sample of data. Simply put, two crucial questions can be discerned in this debate:
Question 1 – Transforming information: What we can extract – at best – from a sample is an unbiased point estimate (“signal”) of an unknown population effect size (e.g., the relationship between education and income) and an unbiased estimation of the uncertainty (“noise”), caused by random error, of that point estimation (i.e., the standard error). We can, of course, go through various mathematical manipulations. But why should we transform two intelligible and meaningful pieces of information – point estimate and standard error – into a p-value or even a dichotomous significance statement?
Question 2 – Drawing inferences from non-random samples: Statistical inference is based on probability theory and a formal chance model that links a randomly generated dataset to a broader target population. More pointedly, statistical assumptions are empirical commitments and acting as if one obtained data through random sampling does not create a random sample. How should we then make inferences about a larger population in the many cases where there is only a convenience sample, such as a group of haphazardly recruited survey respondents, that researchers could get hold of in one way or the other?
It seems that, given the loss of information and the inferential errors that are associated with the NHST-routine, no convincing answer to the first question can be provided by its advocates. Even worse prospects arise when looking at the second question. Severe assumptions violations as regards data generation are quite common in empirical research and, particularly, in survey-based research. From a logical point of view, using inferential statistical procedures for non-random samples would have to be justified by running sample selection models that remedy selection bias. Alternatively, one would have to postulate that those samples are approximately random samples, which is often a heroic but deceptive assumption. This is evident from the mere fact that other probabilistic sampling designs (e.g., cluster sampling) can lead to standard errors that are several times larger than the default which presumes simple random sampling. Therefore, standard errors and p-values that are just based on a bold assumption of random sampling – contrary to how data were actually collected – are virtually worthless. Or more bluntly: Proceeding with the conventional routine of displaying p-values and statistical significance even when the random sampling assumption is grossly violated is tantamount to pretending to have better evidence than one has. This is a breach of good scientific practice that provokes overconfident generalizations beyond the confines of the sample.
III. WHAT SHOULD WE DO?
Peer-reviewed journals should be the gatekeepers of good scientific practice because they are key to what is publicly available as body of knowledge. Therefore, the most decisive statistical reform is to revise journal guidelines and include adequate inferential reporting standards. Around 2015, six prestigious economics journals (Econometrica, the American Economic Review, and the four AE Journals) adopted guidelines that require authors to refrain from using asterisks or other symbols to denote statistical significance. Instead, they are asked to report point estimates and standard errors. It seems that it would also make sense for other journals to reform their guidelines based on the understanding that the assumptions regarding random data generation must be met and that, if they are met, reporting point estimates and standard errors is a better summary of the evidence than p-values and statistical significance declarations. In particular, the Don’ts and Do’s listed in the box below should be communicated to authors and reviewers.
Journal guidelines with inferential reporting standards similar to the ones in the box above would have several benefits: (1) They would effectively communicate necessary standards to authors. (2) They would help reviewers assess the credibility of inferential claims. (3) They would provide authors with an effective defense against unqualified reviewer requests. The latter is arguably be even the most important benefit because it would also mitigate publication bias that results from the fact that many reviewers still prefer statistically significant results and pressure researchers to report p-values and “significant novel discoveries” without even taking account of whether data were randomly generated or not.
Despite the previous focus on random sampling, a last comment on inferences from randomized controlled trials (RCTs) seems appropriate: reporting standards similar to those in the box above should also be specified for RCTs. In addition, researchers should be required to communicate that, in RCTs, the standard error deals with the uncertainty caused by “randomization variation.” Therefore, in common settings where the experimental subjects are not randomly drawn from a larger population, the standard error only quantifies the uncertainty associated with the estimation of the sample average treatment effect, i.e., the effect that the treatment produces in the given group of experimental subjects. Only if the randomized experimental subjects have also been randomly sampled from a population, statistical inference can be used as auxiliary tool for making inferences about that population based on the sample. In this case, and only in this case, the adequately estimated standard error can be used to assess the uncertainty associated with the estimation of the population average treatment effect.
Prof. Norbert Hirschauer, Dr. Sven Grüner, and Prof. Oliver Mußhoff are agricultural economists in Halle (Saale) and Göttingen, Germany. The authors are interested in statistical reforms that help shift inferential reporting practices away from the dichotomy of significance tests to the estimation of effect sizes and uncertainty.
Replication is one key to ensuring the credibility of and confidence in research findings. Yet replication papers are rare in political science and related fields. Research & Politics welcomes replications as regular submissions and is happy to announce a call for abstracts for a special issue on methods, practices and ethics for replication.
This special issue is in collaboration with the Institute for Replication and particularly welcomes three types of papers, without excluding other forms of replication.
First, we invite reproduction or replication of prominent research, especially studies that are frequently cited or used in policy making. We particularly encourage replication/reproduction manuscripts that advance theory or methods in some way beyond just the specific application. For example, papers of this type might examine whether new data and methods lead to new insights, and if so, why.
Second, we welcome multiple replications or reproductions on a given topic. This type of meta-replication may, for instance, contextualize the replicability of an individual study in the context of related studies. Another possibility is to combine studies using the same data (methods) with different methods (data) and investigate whether the difference in findings may be explained.
Third, we encourage theoretical papers that consider incentives, norms and ethics that can guide the practice of replication.
Abstracts should be submitted as a normal submission though the Research & Politics ScholarOne submission site. In the cover letter please indicate that you wish the abstract to be considered for the special issue on replication.
Abstracts should be less than 1 page long. Proposals that do not meet our editorial criteria will receive only a brief reply. Proposals that appear to have potential for publication will receive more detailed feedback and may be invited to submit a complete paper.
If selected to this second stage, full papers will be peer reviewed and managed by the guest editors.
The submission deadline is September 15, 2022.
For more information, contact Abel Brodeur, firstname.lastname@example.org.
The Multi100 project is a crowdsourced empirical project aiming to estimate how robust published results and conclusions in social and behavioral sciences are to analysts’ analytical choices. The project will involve more than 200 researchers.
The Center for Open Science is currently looking to finalize teams of researchers to undertake the work. They are particularly interested in recruiting grad and post-grad data analysts in Economics, International Relations, and Political Science to join the project.
And in case contributing to science isn’t reward enough, analysts can become co-authors and receive compensation.
For more info, click here.
[* EiR = Econometrics in Replications, a feature of TRN that highlights useful econometrics procedures for re-analysing existing research.]
NOTE: This blog uses Stata for its estimation. All the data and code necessary to reproduce the results in the tables below are available at Harvard’s Dataverse: click here.
Missing data is ubiquitous in economics. Standard practice is to drop observations for which any variables have missing values. At best, this can result in diminished power to identify effects. At worst, it can generate biased estimates. Old-fashioned ways to address missing data assigned values using some form of interpolation or imputation. For example, time series data might fill in gaps in the record using linear interpolation. Cross-sectional data might use regression to replace missing values with their predicted values. These procedures are now known to be flawed (Allison, 2001; Enders, 2010).
The preferred way to deal with missing data is to use maximum likelihood (ML) or multiple imputation (MI), assuming the data are “missing at random”. Missing at random (MAR) essentially means that the probability a variable is missing is independent of the value of that variable. For example, if a question about illicit drug use is more likely to go unanswered for respondents who use drugs, then those data would not be MAR. Assuming that the data are MAR, both ML and MI will produce estimates that are consistent and asymptotically efficient.
ML is in principle the easiest to perform. In Stata, one can use the structural equation modelling command (“sem”) with the option “method(mlmv)”. That’s it! Unfortunately, the simplicity of ML is also its biggest disadvantage. For linear models, ML simultaneously estimates means, variances, and covariances while also accounting for the incomplete records associated with missing data. Not infrequently, this causes convergence problems. This is particularly a problem for panel data where one might have a large number of fixed effects.
In this blog, we illustrate how to apply both ML and MI to a well-cited study on mortality and inequality by Andrew Leigh and Christopher Jencks (Journal of Health Economics, 2007). Their analysis focused on the relationship between life expectancy and income inequality, measured by the share of pre-tax income going to the richest 10% of the population. Their data consisted of annual observations from 1960-2004 for Australia, Canada, France, Germany, Ireland, the Netherlands, New Zealand, Spain, Sweden, Switzerland, the UK, and the US. We use their study both because their data and code are publicly available, and because much of the original data were missing.
The problem is highlighted in TABLE 1, which uses a reconstruction of L&J’s original dataset. The full dataset has 540 observations. The dependent variable, “Life expectancy”, has approximately 11 percent missing values. The focal variable, “Income share of the richest 10%”, has approximately 24 percent missing values. The remaining control variables vary widely in their missingness. Real GDP has no missing values. Education has the most missing values, with fully 80% of the variable’s values missing. This is driven by the fact that the Barro and Lee data used to measure education only reports values at five-year intervals.
In fact, the problem is more serious than TABLE 1 indicates. If we run the regression using L&J’s specification (cf. Column 7, Table 4 in their study), we obtain the results in Column (1) of TABLE 2. The estimates indicate that a one-percentage point increase in the income share of the richest 10% is associated with an increase in life expectancy of 0.003 years, a negligible effect in terms of economic significance, and statistically insignificant. Notably, this estimate is based on a mere 64 observations (out of 540).
In fact, these are not the results that L&J reported in their study. No doubt because of the small number of observations, they used linear interpolation on some (but not all) of their data to fill in missing values. Applying their approach to our data yields the results in Column (2) of Table 2 below. There are two problems with using their approach.
First, for various reasons, L&J did not fill in values for all the missing values. The ended up using only 430 out of a possible 540 observations. As a result, their estimates did not exploit all the information that was available to them. Second, interpolation replaces missing values with their predicted values without accounting for the randomness that occurs in real data. This biases standard errors, usually downwards. ML and MI allow one to do better.
ML is the easiest method to apply. To estimate the regression in Table 2 requires a one-line command:
sem (le <- ts10 gdp gdpsq edu phealth thealth id2-id12 year2-year45), method(mlmv) vce(cluster id)
The “sem” command calls up Stata’s structural equation modelling procedure. The option “method(mlmv)” tells Stata to use maximum likelihood to accommodate missing values. If this option is omitted from the above, then the command will produce results identical to those in Column 1 of Table 1, except that the standard errors will be slightly smaller.
While the simplicity of ML is a big advantage, it also introduces complications. Specifically, ML estimates all the parameters simultaneously. The inclusion of 11 country fixed effects and 44 year dummies makes the number of elements in the variance-covariance matrix huge. This, in combination with the fact that ML simultaneously integrates over distributions of variables to account for missing values creates computational challenges. The ML procedure called up by the command above did not converge after 12 hours. As a result, we next turn to MI.
Unlike ML, MI fills in missing values with actual data. The imputed values are created to incorporate the randomness that occurs in real data. The most common MI procedure assumes that all of the variables are distributed multivariate normal. It turns out that this is a serviceable assumption even if the regression specification includes variables that are not normally distributed, like dummy variables (Horton et al., 2003; Allison, 2006).
As the name suggests, MI creates multiple datasets using a process of Monte Carlo simulation. Each of the datasets produces a separate set of estimates. These are then combined to produce one overall set of estimation results. Because each data set is created via a simulation process that depends on randomness, each dataset will be different. Furthermore, unless a random seed is set, different attempts will produce different results. This is one disadvantage of MI versus ML.
A second disadvantage is that MI requires a number of subjective assessments to set key parameters. The key parameters are (i) the “burnin”, the number of datasets that are initially discarded in the simulation process; (ii) the “burnbetween”, the number of intervening datasets that are discarded between retained datasets to maintain dataset independence; and (iii) the total number of imputed datasets that are used for analysis.
The first two parameters are related to the properties of “stationarity” and “independence”. The analogue to convergence in estimated parameters in ML is convergence in distributions in MI. To assess these two properties we first do a trial run of imputations.
The command “mi impute mvn” identifies the variables with missing values to the left of the “=” sign, while the variables to the right are identified as being complete.
mi impute mvn le ts10 edu phealth thealth im = gdp gdpsq id2-id12 year2-year45, prior(jeffreys) mcmconly rseed(123) savewlf(wlf, replace)
The option “mcmconly” lets Stata know that we are not retaining the datasets for subsequent analysis, but only using them to assess their characteristics.
The option “rseed(123)” ensures that we will obtain the same data every time we run this command.
The option “prior(jeffreys)” sets the posterior prediction distribution used to generate the imputed datasets as “noninformative”. This makes the distribution used to impute the missing values solely determined by the estimates from the last regression.
Lastly, the option “savewlf(wlf, replace)” creates an aggregate variable called the “worst linear function” that allows one to investigate whether the imputed datasets are stationary and independent.
Note that Stata sets the default values for “burnin” and “burnbetween” at 100 and 100.
The next set of key commands are given below.
use wlf, clear
tsline wlf, ytitle(Worst linear function) xtitle(Burn-in period) name(stable2,replace)
ac wlf, title(Worst linear function) ytitle(Autocorrelations) note(“”) name(ac2,replace)
The “tsline” command produces a “time series” graph of the “worst linear function” where “time” is measured by number of simulated datasets. We are looking for trends in the data. That is, do the estimated parameters (which includes elements in the variance-covariance matrix) tend to systematically depart from the overall mean.
The graph above is somewhat concerning because it appears to first trend up and then trend down. As a result, we increase the “burnin” value to 500 from its default value of 100 with the following command. Why 500? We somewhat arbitrarily choose a number that is substantially larger than the previous “burnin” value.
mi impute mvn le ts10 edu phealth thealth im = gdp gdpsq id2-id12 year2-year45, prior(jeffreys) mcmconly burnin(500) rseed(123) savewlf(wlf, replace)
use wlf, clear
tsline wlf, ytitle(Worst linear function) xtitle(Burn-in period) name(stable2,replace)
ac wlf, title(Worst linear function) ytitle(Autocorrelations) note(“”) name(ac2,replace)
This looks a lot better. The trending that is apparent in the first half of the graph is greatly reduced in the second half. We subjectively determine that this demonstrates sufficient “stationarity” to proceed. Note that there is no formal test to determine stationarity.
The next thing is to check for independence. The posterior distributions used to impute the missing values rely on Bayesian updating. While our use of the Jeffrys prior reduces the degree to which contiguous imputed datasets are related, there is still the opportunity for correlations across datasets. The “ac” command produces a correlogram of the “worst linear function” that allows us to assess independence. This is produced below.
This correlogram indicates that as long as we retain imputed datasets that are at least “10 datasets apart”, we should be fine. The default value of 100 for “burnbetween” is thus more than sufficient.
The remaining parameter to be set is the total number of imputed datasets to use for analysis. For this we use a handy, user-written Stata (and SAS) command from von Hippel (2020) called “how_many_imputations”.
The problem with random data is that it produces different results each time. “how_many_imputations” allows us to set the number of imputations so that the variation in estimates will remain within some pre-determined threshold value. The default value is to set the number of imputations so that the coefficient of variation of the standard error of the “worst linear function” is equal to 5%.
It works like this. First we create a small initial set of imputed datasets. The command below imputes 10 datasets (“add(10)”).
mi impute mvn le ts10 edu phealth thealth im = gdp gdpsq id2-id12 year2-year45, prior(jeffreys) burnin(500) add(10) rseed(123)
We then estimate a fixed effects regression for each of the 10 datasets. Note that we use the standard Stata command for “xtreg, fe” after “mi estimate:”
mi xtset id year
mi estimate: xtreg le ts10 gdp gdpsq edu phealth thealth year2-year45, fe vce(cluster id)
The command “how_many_imputations” determines the number of imputed datasets calculated to produce standard errors with a coefficient of variation for the standard errors equal to 5%. In this particular case, the output is given by:
The output says to create 182 more imputed datasets.
We can feed this number directly into the “mi impute” command using the “add(`r(add_M)’)” option:
mi impute mvn le ts10 edu phealth thealth im = gdp gdpsq id2-id12 year2-year45, prior(jeffreys) burnin(500) add(`r(add_M)’)
After running the command above, our stored data now consists of 104,220 observations: The initial set of 540 observations plus 192 imputed datasets × 540 observations/dataset. To combine the individual estimates from each dataset to get an overall estimate, we use the following command:
mi xtset id year
mi estimate: xtreg le ts10 gdp gdpsq edu phealth thealth year2-year45, fe vce(cluster id)
The results are reported in Column (3) of Table 2. A further check re-runs the program using different seed numbers. These show little variation, confirming the robustness of the results.
In calculating standard errors, Table 2 follows L&J’s original procedure of estimating standard errors clustered on countries. One might want to improve on this given that their sample only included 12 countries.
An alternative to the “mi estimate” command above is to use a user-written program that does wild cluster bootstrapping. One such package is “wcbregress”. While not all user-written programs can be accommodated by Stata’s “mi estimate”, one can use wcb by modifying the “mi estimate” command as follows:
mi estimate, cmdok: wcbregress le ts10 gdp gdpsq edu phealth thealth year2-year45, group(id)
A comparison of Columns (3) and (1) reveals what we have to show for all our work. Increasing the number of observations substantially reduced the sizes of the standard errors. The standard error of the focal variable, “Income share of the richest 10%”, decreased from 0.051 to 0.035.
While the estimated coefficient remained statistically insignificant for this variable, the smaller standard errors boosted two other variables into significance: “Real GDP per capital squared” and “Log public health spending pc”. Furthermore, the larger sample provides greater confidence that the estimated coefficients are representative of the population from which we are sampling.
Overall, the results provide further support for Leigh & Jencks (2007)’s claim that the relationship between inequality and mortality is small and statistically insignificant.
Given that ML and MI estimation procedures are now widely available in standard statistical packages, they should be part of the replicator’s standard toolkit for robustness checking of previously published research.
Weilun (Allen) Wu is a PhD student in economics at the University of Canterbury. This blog covers some of the material that he has researched for his thesis. Bob Reed is Professor of Economics and the Director of UCMeta at the University of Canterbury. They can be contacted at email@example.com and firstname.lastname@example.org, respectively.
Allison, P. D. (2001). Missing data. Sage publications.
Enders, C. K. (2010). Applied missing data analysis. Guilford press.
Leigh, A., & Jencks, C. (2007). Inequality and mortality: Long-run evidence from a panel of countries. Journal of Health Economics, 26(1), 1-24.
Horton, N. J., Lipsitz, S. R., & Parzen, M. (2003). A potential for bias when rounding in multiple imputation. The American Statistician, 57(4), 229-232.
Allison, P. (2006, August). Multiple imputation of categorical variables under the multivariate normal model. In Annual Meeting of the American Sociological Association, Montreal.
Von Hippel, P. T. (2020). How many imputations do you need? A two-stage calculation using a quadratic rule. Sociological Methods & Research, 49(3), 699-718.
[Excerpts are taken from the article “Does Losing Lead to Winning? An Empirical Analysis for Four Sports” by Bouke Klein Teeselink, Martijn J. van den Assem, and Dennie van Dolder, forthcoming in Management Science.]
“In an influential paper, Berger and Pope (2011, henceforth BP) argue that lagging behind halfway through a competition does not necessarily imply a lower likelihood of winning, and that being slightly behind can actually increase the chance of coming out on top.”
“To test this hypothesis, BP analyze more than sixty thousand professional and collegiate basketball matches. Their main analyses focus on the score difference at half-time because the relatively long break allows players to reflect on their position relative to their opponent.”
“BP find that National Basketball Association (NBA) teams that are slightly behind are between 5.8 and 8.0 percentage points more likely to win the match than those that are slightly ahead.”
“The present paper…extends the analysis of BP to large samples of Australian football, American football, and rugby matches, and then revisits the analysis of basketball.”
“In our main analyses, the running variable is the score difference at half-time and the cutoff value is zero. We estimate the following regression model:”
“where Yi is an indicator variable that takes the value of 1 if team i wins the match, and Xi is the half-time score difference between team i and the opposing team.”
“The treatment variable Ti takes the value of 1 if team i is behind at half-time. The coefficient τ represents the discontinuity in the winning probability at a zero score difference. This coefficient is positive under the hypothesis that being slightly behind improves performance. The interaction term Ti × Xi allows for different slopes above and below the cutoff.”
“If the assumption of a piecewise linear relationship between the winning probability and the half-time score difference is violated, then the regression model will generate a biased estimate of the treatment effect. Hahn et al. (2001) propose the use of local-linear regression to solve this problem.”
“Even if the true relationship is non-linear, a linear specification can provide a close approximation within a limited bandwidth around the cutoff. A downside of this solution is that it reduces the effective number of observations and therefore the precision of the estimate.”
“To strike the appropriate balance between bias and precision, we use the local-linear method proposed by Calonico et al. (2014). This method selects the bandwidth that minimizes the mean squared error, corrects the estimated treatment effect for any remaining non-linearities within the bandwidth, and linearly downweights observations that are farther away from the cutoff.”
“We find no supporting evidence for [BP’s result that marginally trailing improves the odds of winning in Australian football, American football, and rugby]: the estimated effects are sometimes positive and sometimes negative, and statistically always insignificant.”
“We then also revisit the phenomenon for basketball. We replicate the finding that half-time trailing improves the chances of winning in NBA matches from the period analyzed in BP, but consistently find null results for NBA matches from outside this period, for the sample of NCAA matches analyzed in BP, for more recent NCAA matches, and for WNBA matches.”
“Moreover, our high-powered meta-analyses across the different sports and competitions cannot reject the hypothesis of no effect of marginally trailing on winning, and the confidence intervals suggest that the true effect, if existent at all, is likely relatively small.”
“In our view, the performance-enhancing effect documented in BP is most likely a chance occurrence.”
To read the full article, click here.
[Excerpts are taken from the article “Investigating the replicability of preclinical cancer biology” by Errington et al., published in eLife.]
“Large-scale replication studies in the social and behavioral sciences provide evidence of replicability challenges (Camerer et al., 2016; Camerer et al., 2018; Ebersole et al., 2016; Ebersole et al., 2020; Klein et al., 2014; Klein et al., 2018; Open Science Collaboration, 2015).”
“In psychology, across 307 systematic replications and multisite replications, 64% reported statistically significant evidence in the same direction and effect sizes 68% as large as the original experiments (Nosek et al., 2021).”
“In the biomedical sciences, the ALS Therapy Development Institute observed no effectiveness of more than 100 potential drugs in a mouse model in which prior research reported effectiveness in slowing down disease, and eight of those compounds were tried and failed in clinical trials costing millions and involving thousands of participants (Perrin, 2014).”
“Of 12 replications of preclinical spinal cord injury research in the FORE-SCI program, only two clearly replicated the original findings – one under constrained conditions of the injury and the other much more weakly than the original (Steward et al., 2012).”
“And, in cancer biology and related fields, two drug companies (Bayer and Amgen) reported failures to replicate findings from promising studies that could have led to new therapies (Prinz et al., 2011; Begley and Ellis, 2012). Their success rates (25% for the Bayer report, and 11% for the Amgen report) provided disquieting initial evidence that preclinical research may be much less replicable than recognized.”
“In the Reproducibility Project: Cancer Biology, we sought to acquire evidence about the replicability of preclinical research in cancer biology by repeating selected experiments from 53 high-impact papers published in 2010, 2011, and 2012 (Errington et al., 2014). We describe in a companion paper (Errington et al., 2021b) the challenges we encountered while repeating these experiments. … These challenges meant that we only completed 50 of the 193 experiments (26%) we planned to repeat. The 50 experiments that we were able to complete included a total of 158 effects that could be compared with the same effects in the original paper.”
“There is no single method for assessing the success or failure of replication attempts (Mathur and VanderWeele, 2019; Open Science Collaboration, 2015; Valentine et al., 2011), so we used seven different methods to compare the effect reported in the original paper and the effect observed in the replication attempt…”
“…136 of the 158 effects (86%) reported in the original papers were positive effects – the original authors interpreted their data as showing that a relationship between variables existed or that an intervention had an impact on the biological system being studied. The other 22 (14%) were null effects – the original authors interpreted their data as not showing evidence for a meaningful relationship or impact of an intervention.”
“Furthermore, 117 of the effects reported in the original papers (74%) were supported by a numerical result (such as graphs of quantified data or statistical tests), and 41 (26%) were supported by a representative image or similar. For effects where the original paper reported a numerical result for a positive effect, it was possible to use all seven methods of comparison. However, for cases where the original paper relied on a representative image (without a numerical result) as evidence for a positive effect, or when the original paper reported a null effect, it was not possible to use all seven methods.”
Summarizing replications across five criteria
“To provide an overall picture, we combined the replication rates by five of these criteria, selecting criteria that could be meaningfully applied to both positive and null effects … The five criteria were: (i) direction and statistical significance (p < 0.05); (ii) original effect size in replication 95% confidence interval; (iii) replication effect size in original 95% confidence interval; (iv) replication effect size in original 95% prediction interval; (v) meta-analysis combining original and replication effect sizes is statistically significant (p < 0.05).”
“For replications of original positive effects, 13 of 97 (13%) replications succeeded on all five criteria, 15 succeeded on four, 11 succeeded on three, 22 failed on three, 15 failed on four, and 21 (22%) failed on all five (… Figure 6).”
“For original null effects, 7 of 15 (47%) replications succeeded on all five criteria, 2 succeeded on four, 3 succeeded on three, 0 failed on three, 2 failed on four, and 1 (7%) failed on all five.”
“… Combining positive and null effects, 51 of 112 (46%) replications succeeded on more criteria than they failed, and 61 (54%) replications failed on more criteria than they succeeded.”
“We explored five candidate moderators of replication success and did not find strong evidence to indicate that any of them account for variation in replication rates we observed in our sample. The clearest indicator of replication success was that smaller effects were less likely to replicate than larger effects … Research into replicability in other disciplines has also found that findings with stronger initial evidence (such as larger effect sizes and/or smaller p-values) is more likely to replicate (Nosek et al., 2021; Open Science Collaboration, 2015).”
“The present study provides substantial evidence about the replicability of findings in a sample of high-impact papers published in the field of cancer biology in 2010, 2011, and 2012. The evidence suggests that replicability is lower than one might expect of the published literature. Causes of non-replicability could be due to factors in conducting and reporting the original research, conducting the replication experiments, or the complexity of the phenomena being studied. The present evidence cannot parse between these possibilities…”
“Stakeholders from across the research community have been raising concerns and generating evidence about dysfunctional incentives and research practices that could slow the pace of discovery. This paper is just one contribution to the community’s self-critical examination of its own practices.”
“Science pursuing and exposing its own flaws is just science being science. Science is trustworthy because it does not trust itself. Science earns that trustworthiness through publicly, transparently, and continuously seeking out and eradicating error in its own culture, methods, and findings. Increasing awareness and evidence of the deleterious effects of reward structures and research practices will spur one of science’s greatest strengths, self-correction.”
To read the full article, click here.
Begley CG, Ellis LM. 2012. Drug development: Raise standards for preclinical cancer research. Nature 483:531–533.
Camerer C. F, Dreber A, Forsell E, Ho TH, Huber J, Johannesson M, Kirchler M, Almenberg J, Altmejd A, Chan T, Heikensten E, Holzmeister F, Imai T, Isaksson S, Nave G, Pfeiffer T, Razen M, Wu H. 2016. Evaluating replicability of laboratory experiments in economics. Science 351: 1433–1436.
Camerer CF, Dreber A, Holzmeister F, Ho T-H, Huber J, Johannesson M, Kirchler M, Nave G, Nosek BA, Pfeiffer T, Altmejd A, Buttrick N, Chan T, Chen Y, Forsell E, Gampa A, Heikensten E, Hummer L, Imai T, Isaksson S, et al. 2018. Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour 2: 637–644.
Ebersole CR, Mathur MB, Baranski E, Bart-Plange D-J, Buttrick NR, Chartier CR, Corker KS, Corley M, Hartshorne JK, IJzerman H, Lazarević LB, Rabagliati H, Ropovik I, Aczel B, Aeschbach LF, Andrighetto L, Arnal JD, Arrow H, Babincak P, Bakos BE, et al. 2020. Many Labs 5: Testing Pre-Data-Collection Peer Review as an Intervention to Increase Replicability. Advances in Methods and Practices in Psychological Science 3:309–331.
Ebersole CR, Atherton OE, Belanger AL, Skulborstad HM, Allen JM, Banks JB, Baranski E, Bernstein MJ, Bonfiglio DBV, Boucher L, Brown ER, Budiman NI, Cairo AH, Capaldi CA, Chartier CR, Chung JM, Cicero DC, Coleman JA, Conway JG, Davis WE, et al. 2016. Many Labs 3: Evaluating participant pool quality across the academic semester via replication. Journal of Experimental Social Psychology 67: 68–82.
Errington TM, Iorns E, Gunn W, Tan FE, Lomax J, Nosek BA. 2014. An open investigation of the reproducibility of cancer biology research. eLife 3: e04333.
Errington TM, Denis A, Perfito N, Iorns E, Nosek BA. 2021b. Challenges for assessing replicability in preclinical cancer biology. eLife 10: e67995.
Klein RA, Ratliff KA, Vianello M, Adams RB, Bahník Š, Bernstein MJ, Bocian K, Brandt MJ, Brooks B, Brumbaugh CC, Cemalcilar Z, Chandler J, Cheong W, Davis WE, Devos T, Eisner M, Frankowska N, Furrow D, Galliani EM, Nosek BA. 2014. Investigating Variation in Replicability: A “Many Labs” Replication Project. Social Psychology 45: 142–152.
Klein RA, Vianello M, Hasselman F, Adams BG, Adams RB, Alper S, Aveyard M, Axt JR, Babalola MT, Bahník Š, Batra R, Berkics M, Bernstein MJ, Berry DR, Bialobrzeska O, Binan ED, Bocian K, Brandt MJ, Busching R, Rédei AC, et al. 2018. Many Labs 2: Investigating Variation in Replicability Across Samples and Settings. Advances in Methods and Practices in Psychological Science 1: 443–490.
Mathur MB, VanderWeele TJ. 2019. Challenges and suggestions for defining replication “success” when effects may be heterogeneous: Comment on Hedges and Schauer (2019). Psychological Methods 24: 571–575.
Nosek BA, Hardwicke TE, Moshontz H, Allard A, Corker KS, Dreber A, Fidler F, Hilgard J, Struhl MK, Nuijten MB, Rohrer JM, Romero F, Scheel AM, Scherer LD, Schönbrodt FD, Vazire S. 2021. Replicability, Robustness, and Reproducibility in Psychological Science. Annual Review of Psychology 73: 114157.
Open Science Collaboration. 2015. Estimating the reproducibility of psychological science. Science 349:aac4716.
Perrin S. 2014. Preclinical research: Make mouse studies work. Nature 507: 423–425.
Prinz F, Schlange T, Asadullah K. 2011. Believe it or not: how much can we rely on published data on potential drug targets? Nature Reviews Drug Discovery 10: 712.
Steward O, Popovich PG, Dietrich WD, Kleitman N. 2012. Replication and reproducibility in spinal cord injury research. Experimental Neurology 233: 597–605.
Valentine JC, Biglan A, Boruch RF, Castro FG, Collins LM, Flay BR, Kellam S, Mościcki EK, Schinke SP. 2011. Replication in prevention science. Prevention Science 12: 103–117.
The SCORE project is entering its final phase of conducting reproductions (repeating the original analysis with original data) and replications (testing the same claim with new data) on a stratified random sample of claims from papers across the social-behavioral sciences.
Here is your opportunity to contribute to these efforts for non-human subjects research (non-HSR) work, meaning projects that use existing data and do not require any additional IRB review steps. We’d love to have your collaboration!
This spreadsheet contains all of the non-HSR projects available, with different tabs corresponding to different project categories.
This announcement outlines the different project categories at a high level. More details can be found in this explanation of our terminology, as well as specific instructions tailored for replication projects (DARs) and reproduction projects.
There are six project categories, depending on two factors: the type of data used and the number of claims selected.
Data types: (i) Datasets provided by the author (PBR/ADR), (ii) Original observations reconstructed from the underlying data sources (SDR), and (iii) New observations that were not already analyzed in the original article (DAR)
Number of claims: (i) Just one claim per article (singe-trace) and (ii) A minimum of five claims per article (bushel), unless fewer are available.
We have identified the most feasible projects in separate tabs. We highly encourage collaborators to review these projects first. Projects are highly feasible if we already have the data in hand, or if we’ve identified the data sources as relatively simple to obtain.
Finally, we have included a column in each tab labeled ‘high_incentive,’ coded as yes or no. If a project is coded as ‘yes,’ it means the payment for completing the project will be higher than other projects from the same category. The full set of payments can be found here.
Your next steps
Select one or more projects you anticipate being able to complete, by adding your name in the respective signup cell. Please consider whether you’ll be able to obtain the necessary materials before signing up.
After signing up but before submitting the commitment form, please confirm you can access all of the necessary materials to complete the project. If this requires author data that COS does not currently have but that you think could be made available, please do not reach out to the authors directly. Instead, please contact COS for assistance in getting in touch with the authors.
After you have obtained all of the materials necessary to begin your project, please complete the commitment form linked in the spreadsheet, after which someone from COS will provide you access to an OSF project and preregistration form.
Please keep the following privacy statement in mind as you complete these steps: Other teams are making predictions about the outcomes of many different studies, not knowing which studies have been selected for replication/reproduction. As a consequence, the success of this project requires full confidentiality of the research process. This includes privacy about which studies have been selected for replication and all aspects of the discussion about these replication designs.