Nature Human Behaviour: “Replications Do Not Fail”

[Excerpts taken from the editorial, “Replications do not fail” published by Nature Human Behaviour]
“For a very long time, replication studies had a lowly status in the hierarchy of publishable research. The equation of scientific advance with novelty and discovery relegated replication studies to second-class status in the scientific world. Over the past few years, this has fortunately been fast changing.”
“Since the launch of this journal, we have wanted to redefine what constitutes a significant scientific advance (https://www.nature.com/articles/s41562-016-0033). Science moves forward not only through discovery, but also through confirmation or disconfirmation of existing findings that have shaped a field. We placed replication studies on the same level as studies that report new findings and adopted the following principle: if the original study was highly influential, then a study that replicates its methodology is of equal value.”
“Still, replication studies have been hard to find in our pages—they constitute a very small proportion of the submissions we receive. Nonetheless, the current issue, by design, features four replication studies, all of high value in terms of their contribution to the scientific record.”
“This issue is a celebration of replication, and we hope that, by featuring these contributions together and writing this editorial, other scientists will also be inspired to devote their efforts in replication projects and more funders will be inclined to prioritize the funding of replication research. We are certainly looking forward to receiving and publishing many more replications in our pages.”
To read the full editorial, click here.

Pre-Analysis Plans in Economics: A Help or a Hindrance for Publication Success and Impact?

[Excerpts taken from the article “Do Pre-analysis Plans Hamper Publication?” by George Ofosu and Daniel Posner, published in the AER: Papers and Proceedings]
“Pre-analysis plans (PAPs) have been criticized… that PAPs generate dull, lab-report-style papers that are disfavored by reviewers and journal editors and thus hampered in the publication process.”
“To the extent that scholars who register and adhere to PAPs are disadvantaged in publishing their papers, researchers may be disincentivized from preregistration. This risks undermining the benefits for research credibility that the broader adoption of PAPs is thought to offer.”
I. Publication Outcomes of NBER Working Papers with and without PAPs
“We analyze papers issued as NBER working papers between 2011 and 2018, the period corresponding with the rise of preregistration in the economics discipline.”
“During this time span, NBER issued 8,706 working papers, of which 973 (11 percent) were experimental and thus were plausible candidates for preregistration. Fifty-three percent of these experimental working papers were subsequently published in peer-reviewed journals, with 13 percent landing in top-five outlets.”
“To assess whether PAPs affect the likelihood of publication, we coded whether each of these papers mentioned a PAP. We then calculated the publication rates of papers with and without PAPs.”
“Papers reporting the results of studies that followed PAPs were 10 percentage points less likely to be published by December 2019 than papers that did not mention a PAP (44 percent versus 54 percent; p < 0.1). However, conditional on being published, papers with PAPs were 39 percentage points more likely to land in a top-five journal (61 percent versus 22 percent; p < 0.01).”
II. Do Studies with PAPs Generate More Citations?
“…we collected data from Google Scholar on the number of citations to the 82 experimental NBER working papers that mention a PAP and a sample of 100 of the 477 published and 100 of the 414 unpublished experimental NBER working papers that do not mention a PAP.”
“Controlling for the number of years since being issued as an NBER working paper, whether the paper was published, and whether it was published in a top-five outlet…we estimate that having a PAP is associated with 14 additional citations…This represents more than a 40 percent increase over the 32 citations achieved by the median NBER working paper in our sample.”
III. Conclusion
“In keeping with the concerns of some PAP critics, who worry that fidelity to a PAP will lead to an uninteresting, mechanical paper that will be disadvantaged in the review process, we find that papers with PAPs are in fact slightly less likely to be published. However, we also find that, conditional on being published, papers with PAPs are more likely to land in top-five journals and are more likely to be cited.”
To read the article, click here.

Predicting Reproducibility: Rise of the Machines

[Excerpts are taken from the article “Estimating the deep replicability of scientific findings using human and artificial intelligence” by Yang Yang, Wu Youyou, and Brian Uzzi, published in PNAS]
“…we trained an artificial intelligence model to estimate a paper’s replicability using ground truth data on studies that had passed or failed manual replication tests, and then tested the model’s generalizability on an extensive set of out-of-sample studies.”
“Three methods to estimate a study’s risk of passing or failing replication have been assessed: the statistics reported in the original study (e.g., P values and effect size, also known as “reviewer metrics”), prediction markets, and surveys.”
“Our model estimates differences in predictive power according to the information used to construct the model—the paper’s narrative, paper’s reported statistics, or both.”
“After calibrating the model with the training data, we test it on hundreds of out-of-sample, manually replicated studies. Table 1 presents the study’s data sources, which include 2 million abstracts from scientific papers…and 413 manually replicated studies from 80 journals.”
TRN1(20200611)
“Our methodology involves three stages of development, and in stage 1 of this analysis, we obtained the manuscripts of all 96 [RPP] studies and stripped each paper of all nontext content…In stage 2, a neural-network-based method—word2vec (29) with standard settings—was used to quantitatively represent a paper’s narrative content by defining the quantitative relationship (co-occurrence) of each word with every other word in the corpus of words in the training set.”
“In stage 3, we predicted a study’s manually replicated outcome (pass or fail) from its paper-level vector using a simple ensemble model of bagging with random forests and bagging with logistic… This simple ensemble model generates predictions of a study’s likelihood of replicating [0.0, 1.0] using threefold repeated cross-validation. It is trained on a random draw of 67% of the papers to predict outcome for the remaining 33%, and the process is repeated 100 times with different random splits of training sets vs. test sets.”
“A typical standard for evaluating accuracy, which is assessed relative to a threshold selected according to the evaluator’s interest, is the base rate of failure in the ground-truth data. At the base rate threshold we chose, 59 of 96 studies failed manual replication (61%). We found that the average accuracy of the machine learning narrative-only model was 0.69…In other words, on average, our model correctly predicted the true pass or fail outcome for 69% of the studies, a 13% increase over the performance of a dummy predictor with 0.61 accuracy…”
“A second standard approach to assessing predictive accuracy is top-k precision, which measures the number of actual failures among the k lowest-ranked studies based on a study’s average prediction. When k is equal to the true failure base rate, the machine learning model’s top-k precision is 0.74…”
“To compare the accuracy of the narrative-only model with conventionally used reviewer metrics, we designed a statistics-only model using the same procedure used to design the narrative only model…The reviewer metrics-only model achieved an average accuracy and top-k precision of 0.66…and 0.72…, respectively.”
“To investigate whether combining the narrative-only and reviewer metrics-only models provides more explanatory power than either model alone, we trained a model on a paper’s narrative and reviewer metrics. The combined narrative and reviewer-metrics model achieved an average accuracy of 0.71…and top-k precision of 0.76…The combined model performed significantly better in terms of accuracy and top-k precision than either the narrative or reviewer metrics model alone…”
“We ran robustness tests of our machine learning model on five currently available out-of-sample datasets that report pass or fail outcomes…Fig. 3 summarizes the out-of-sample testing results for narrative (blue), reviewer metrics (green), and narrative-plus-reviewer metrics (orange) models.”
TRN2(20200611)
“Test set I…consists of eight similarly conducted published psychology replication datasets (n = 117). The machine learning model generated an out-of-sample accuracy and top-k precision of 0.69 and 0.76, respectively.”
“Test set II…consists of one set of 57 psychology replications done primarily by students as class projects, suggesting more noise in the “ground truth” data. Under these conditions of relatively high noise in the data, the machine learning model yielded an out-of-sample accuracy and top-k precision of 0.65 and 0.69, respectively.”
“Test sets III and IV are notable because they represent out-of-sample tests in the discipline of economics, a discipline that uses different jargon and studies different behavioral topics than does psychology—the discipline on which the model was trained…Test set III includes 18 economics experiments…Test set IV includes 122 economics studies compiled by the Economics Replication Wiki, The accuracy scores were 0.78 and 0.66, and the top-k precision scores were 0.71 and 0.73, respectively.”
“In another type of out-sample-test, we compared the machine learning model with prediction markets…To construct our test, we collected the subsample of 100 papers from test sets I to IV that were included in prediction markets and ranked papers from least to most likely to replicate per the reported results of each prediction market and each associated survey. We then ranked the machine learning model’s predictions of the same papers from least to most likely to replicate.
“In comparing prediction markets, survey, and our machine learning model, we operated under the assumption that the most important papers to correctly identify for manual replication tests are the papers predicted to be least and most likely to replicate.”
“With respect to the 10 most confident predictions of passing, the machine learning model predicts 90% of the studies correctly; the market or survey methods correctly classify 90% of the studies. Among the 10 most confident predictions of failing, the market or survey methods correctly classify 100% of the studies, and the machine learning model correctly classifies 90% of the studies.”
“Machine learning appears to have the potential to aid the science of replication. Used alone, it offers accurate predictions at levels similar to prediction markets or surveys.”
“Though the findings should be taken as preliminary given the necessarily limited datasets with ground-truth data, our out-of-sample tests offer initial results that machine learning produces consistent predictions across studies having diverse methods, topics, disciplines, journals, replication procedures, and periods.”
To read the article, click here.

Assess Business and Economics Claims. Help Science! Make Money!

The repliCATS project is part of a research program called SCORE, funded by DARPA, that eventually aims to build automated tools that can rapidly and reliably assign confidence scores to social science research claims.
Our method – the IDEA protocol – harnesses the power of structured group discussion in predicting likely replicability. We have built a custom web-based platform to do this.
What we ask you to do is to evaluate the replicability of a claim. That is, if one were to follow the methods of the original study with a high degree of similarity do you think one would find similar results to the original study?
We are now in Round 6, which runs from 1 June to 30 June. Round 6 focuses on business and economics* claims.
Prizes? Yes there are prizes!
– US$500: for the person who completes the most**
– US$250: for the people who complete the 2nd and 3rd most business or economics claims**
Claims must be submitted by midnight, AEST.
Interested in participating or learning more? Click here.
* Claims are considered Business or Economics claims if they were from the journals specified the table in FAQs under “From which journals are the 3000 claims being chosen?”
** A completed claim counts as submitting a final assessment.

 

How Much Power Does the Average Social Science Study Have? Less Than You Think

[Excerpts taken from the blog, “No, average statistical power is not as high as you think: Tracing a statistical error as it spreads through the literature”, by Andrew Gelman, posted at Statistical Modelling]
“I was reading this recently published article by Sakaluk et al. and came across a striking claim:”
“Despite recommendations that studies be conducted with 80% power for the expected effect size, recent reviews have found that the average social science study possesses only a 44% chance of detecting an existing medium-sized true effect (Szucs & Ioannidis, 2017).”
“I noticed this not because the claimed 44% was so low but because it was so high! I strongly doubt that the average social science study possesses a power of anything close to 44%. Why? Because 44% is close to 50%, and a study will have power of 50% if the true effect is 2 standard errors away from zero. I doubt that typical studies have such large effects.”
“…if researchers come into a study with the seemingly humble expectation of 44% power, then they’ll expect that they’ll get “p less than 0.05” about half the time, and if they don’t they’ll think that something went wrong. Actually, though, the only way that researchers have been having such a high apparent success rate in the past is from forking paths. The expectation of 44% power has bad consequences.”
To read the full blog, click here.

When Peer Review is Too Slow – How About a Red Team?

[Excerpts taken from the article, “Pandemic researchers — recruit your own best critics” by Daniël Lakens, published in Nature]
“As researchers rush to find the best ways to quell the COVID-19 crisis, they want to get results out ultra-fast. Preprints — public but unvetted studies — are getting lots of attention…To keep up the speed of research and reduce sloppiness, scientists must find ways to build criticism into the process. scientific ideal, but it is rarely scientific practice.”
“An initial version of a recent preprint by researchers at Stanford University in California estimated that COVID-19’s fatality rate was 0.12–0.2% (E. Bendavid et al. Preprint at medrXiv http://doi.org/dskd; 2020). This low estimate was removed from a subsequent version, but it had already received widespread attention and news coverage. Many immediately pointed out flaws in how the sample was obtained and the statistics were calculated. Everyone would have benefited if the team had received this criticism before the data were collected and the results were shared.”
“It is time to adopt a ‘red team’ approach in science that integrates criticism into each step of the research process. A red team is a designated ‘devil’s advocate’ charged to find holes and errors in ongoing work and to challenge dominant assumptions, with the goal of improving project quality.”
“With research moving faster than ever, scientists should invest in reducing their own bias and allowing others to transparently evaluate how much pushback their ideas have been subjected to. A scientific claim is as reliable as only the most severe criticism it has been able to withstand.”
To read the article, click here.

Debunking Three Common Claims of Scientific Reform

[Excerpts are taken from the article, “The case for formal methodology in scientific reform” by Berna Devezer, Danielle Navarro, Joachim Vandekerckhove, and Erkan Buzbas, posted at bioRxiv]
“Methodologists have criticized empirical scientists for: (a) prematurely presenting unverified research results as facts; (b) overgeneralizing results to populations beyond the studied population; (c) misusing or abusing statistics; and (d) lack of rigor in the research endeavor that is exacerbated by incentives to publish fast, early, and often. Regrettably, the methodological reform literature is affected by similar practices.”
“In this paper we advocate for the necessity of statistically rigorous and scientifically nuanced arguments to make proper methodological claims in the reform literature. Toward this aim, we evaluate three examples of methodological claims that have been advanced and well-accepted (as implied by the large number of citations) in the reform literature:”
1. “Reproducibility is the cornerstone of, or a demarcation criterion for, science.”
2. “Using data more than once invalidates statistical inference.”
3. “Exploratory research uses “wonky” statistics.”
“Each of these claims suffers from some of the problems outlined earlier and as a result, has contributed to methodological half-truths (or untruths). We evaluate each claim using statistical theory against a broad philosophical and scientific background.”
Claim 1: Reproducibility is the cornerstone of, or a demarcation criterion for, science.
“A common assertion in the methodological reform literature is that reproducibility is a core scientific virtue and should be used as a standard to evaluate the value of research findings…This view implies that if we cannot reproduce findings…we are not practicing science.”
“The focus on reproducibility of empirical findings has been traced back to the influence of falsificationism and the hypothetico-deductive model of science. Philosophical critiques highlight limitations of this model. For example, there can be true results that are by definition not reproducible…science does—rather often, in fact—make claims about non-reproducible phenomena and deems such claims to be true in spite of the non-reproducibility.”
“We argue that even in scientific fields that possess the ability to reproduce their findings in principle, reproducibility cannot be reliably used as a demarcation criterion for science because it is not necessarily a good proxy for the discovery of true regularities. To illustrate this, consider the following two unconditional propositions: (1) reproducible results are true results and (2) non-reproducible results are false results.”
True results are not necessarily reproducible
“Our first proposition is that true results are not always reproducible…for finite sample studies involving uncertainty, the true reproducibility rate must necessarily be smaller than one for any result. This point seems trivial and intuitive. However, it also implies that if the uncertainty in the system is large, true results can have reproducibility close to 0.”
False results might be reproducible
“In well-cited articles in methodological reform literature, high reproducibility of a result is often interpreted as evidence that the result is true…The rationale is that if a result is independently reproduced many times, it must be a true result. This claim is not always true. To see this, it is sufficient to note that the true reproducibility rate of any result depends on the true model and the methods used to investigate the claim.”
“We consider model misspecification under a measurement error model in simple linear regression…The blue belt in Figure 2 shows that as measurement error variability grows with respect to sampling error variability, effects farther away from the true effect size become perfectly reproducible. At point F in figure Figure 2, the measurement error variability is ten times as large as the sampling error variability, and we have perfect reproducibility of a null effect when the true underlying effect size is in fact large.”
TRN1(20200515)
Claim 2: Using data more than once invalidates statistical inference
“A well-known claim in the methodological reform literature regards the (in)validity of using data more than once, which is sometimes colloquially referred to as double-dipping or data peeking…This rationale has been used in reform literature to establish the necessity of preregistration for “confirmatory” statistical inference. In this section, we provide examples to show that it is incorrect to make these claims in overly general terms.”
“The key to validity is not how many times the data are used, but appropriate application of the correct conditioning…”
“Figure 4 provides an example of how conditioning can be used to ensure that nominal error rates are achieved. We aim to test whether the mean of Population 1 is greater than the mean of Population 2, where both populations are normally distributed with known variances. An appropriate test is an upper-tail two-sample z-test. For a desired level of test, we fix the critical value at z, and the test is performed without performing any prior analysis on the data. The sum of the dark green and dark red areas under the black curve is the nominal Type I error rate for this test.”
“Now, imagine that we perform some prior analysis on the data and use it only if it obeys an exogenous criterion: We do not perform our test unless “the mean of the sample from Population 1 is larger than the mean of the sample from Population 2.” This is an example of us deriving our alternative hypothesis from the data. The test can still be made valid, but proper conditioning is required.”
“If we do not condition on the information given within double quotes and we still use z as the critical value, we have inflated the observed Type I error rate by the sum of the light green and light red areas because the distribution of the test statistic is now given by the red curve. We can, however, adjust the critical value from z to z* such that the sum of the light and dark red areas is equal to the nominal Type I error rate, and the conditional test will be valid.”
TRN2(20200515)
“We have shown that using data multiple times per se does not present a statistical problem. The problem arises if proper conditioning on prior information or decisions is skipped.”
Claim 3: Exploratory Research Uses “Wonky” Statistics
“A large body of reform literature advances the exploratory-confirmatory research dichotomy from an exclusively statistical perspective. Wagenmakers et al. (2012) argue that purely exploratory research is one that finds hypotheses in the data by post-hoc theorizing and using inferential statistics in a “wonky” manner where p-values and error rates lose their meaning: ‘In the grey area of exploration, data are tortured to some extent, and the corresponding statistics is somewhat wonky.’”
“Whichever method is selected for EDA [exploratory data analysis]; …it needs to be implemented rigorously to maximize the probability of true discoveries while minimizing the probability of false discoveries….repeatedly misusing statistical methods, it is possible to generate an infinite number of patterns from the same data set but most of them will be what Good (1983, p.290) calls a kinkus—‘a pattern that has an extremely small prior probability of being potentially explicable, given the particular context’.”
“The above discussion should make two points clear, regarding Claim 3: First, exploratory research cannot be reduced to exploratory data analysis and thereby to the absence of a preregistered data analysis plan, and second, when exploratory data analysis is used for scientific exploration, it needs rigor. Describing exploratory research as though it were synonymous with or accepting of “wonky” procedures that misuse or abuse statistical inference not only undermines the importance of systematic exploration in the scientific process but also severely handicaps the process of discovery.”
Conclusion
“Simple fixes to complex scientific problems rarely exist. Simple fixes motivated by speculative arguments, lacking rigor and proper scientific support might appear to be legitimate and satisfactory in the short run, but may prove to be counter-productive in the long run.”
To read the article, click here.

Letter in Nature Announces New Reproduction and Replication Journal

[Excerpt taken from “New journal for reproduction and replication results”, correspondence by Etienne Roesch and Nicolas Rougier published in Nature]
“More incentive is needed to spur investigation into replication issues and null results (see Nature 578, 489–490; 2020). For example, experienced scientists could encourage junior researchers to allocate part of their time to verifying other researchers’ results, which would also provide them with essential insights into the scientific method. To support such ventures, we are launching ReScienceX, a free-to-publish and free-to-read, peer-reviewed journal that will be devoted to reproduction and replication experiments.”
To read the rest of the letter, click here.

Pre-Registration as a Severe Testing Device

[Excerpts are taken from the preprint, “The Value of Preregistration for Psychological Science: A Conceptual Analysis” by Daniël Lakens, posted at PsyArXiv Preprints]
What is Preregistration For?
“If the only goal of a researcher is to prevent bias, it suffices to verbally agree upon the planned analysis with collaborators as long as everyone will perfectly remember the agreed upon analysis. In the conceptual analysis presented here, researchers preregister to allow future readers of the preregistration (which might include the researchers themselves) to evaluate whether the research question was tested in a way that could have falsified the prediction.”
“Mayo (1996) carefully develops arguments for the role that prediction plays in science and arrives at an error statistical philosophy based on a severity requirement.”
Severe Tests
“A test is severe when it is highly capable of demonstrating a claim is false.”
“Figure 1A visualizes a null hypothesis test, where only one specific state of the world (namely an effect of exactly zero) will falsify our prediction. All other possible states of the world are in line with our prediction.”
“Figure 1B represents a one-sided null-hypothesis test, where differences larger than zero are predicted, and the prediction is falsified when the difference is either equal to zero, or smaller than zero. This prediction is slightly riskier than a two-sided test, in that there are more ways in which our prediction could be wrong, because 50% of all possible outcomes falsify the prediction, and 50% corroborate it.”
“Finally, Figure 1C visualized a range prediction where only differences between 0.5 and 2.5 support the prediction. Since there are many more ways this prediction could be wrong, it is an even more severe test.”
“If we observe a difference of 1.5, with a 95% confidence interval from 1 to 2, all three predictions are confirmed with an alpha level of 0.05, but the prediction in Figure 1C has passed the most severe test since it was confirmed in a test that had a higher capac-ity of demonstrating the prediction is false. Note that the three tests differ in severity even when they are tested with the same Type 1 error rate.”Capture
“As far as I am aware, Mayo’s severity argument currently provides one of the few philosophies of science that allows for a coherent conceptual analysis of the value of preregistration.”
Examples of Practices that Reduce the Severity of Tests
“One example of such a practice is optional stopping, where researchers collect data, analyze their data, and continue the data collection only if the result is not statistically significant. In theory, a researcher who is willing to continue collecting data indefinitely will always observe a statistically significant result. By repeatedly looking at the data, the Type 1 error rate can inflate to 100%. In this extreme case the prediction can no longer be falsified, and the test has no severity.”
“The severity of a test can also be compromised by selecting a hypothesis based on the observed results. In this practice, known as Hypothesizing After the Results are Known (HARKing, Kerr, 1998) researchers look at their data, and then select a prediction. This reversal of the typical hypothesis testing procedure makes the test incapable of demonstrating the claim was false.”
“As a final example…think about the scenario [where a] researcher makes multitudes of observations and selects out of all these tests only those that support their prediction. Choosing to selectively report tests from among many tests that were performed strongly reduces the capability of a test to demonstrate the claim was false.”
“…a preregistration document should give us all the information that allows future readers to evaluate the severity of the test. This includes the theoretical and empirical basis for predictions, the experimental design, the materials, and the analysis code. Having access to this information should allow readers to see whether any choices were made during the research process that reduced the severity of a test.”
“Researchers should also specify when they will conclude their prediction is not supported. As De Groot (1969) writes: ‘The author of a theory should himself state…what potential outcomes would, if actually found, lead him to regard his theory as disproven.’”
Preregistration Makes it Possible to Evaluate the Severity of a Test
“The severity of a test could in theory be unrelated to whether it is preregistered. However, in practice there will almost always be a correlation between the ability to transparently evaluate the severity of a test and preregistration, both because researchers can often selectively report results, use optional stopping, or come up with a plausible hypothesis after the results are known, and because theories rarely completely constrain the test of predictions.”
“As this conceptual analysis of preregistration makes clear, the practice of specifying the design, data collection, and planned analyses in advance is based on a philosophy of science that values tests of predictions and puts more trust in claims that have passed severe tests (Lakatos, 1978; Mayo, 2018; Meehl, 1990; Platt, 1964; Popper, 1959).”
To read the article, click here.

REED: EiR* – What’s Supporting that Fixed Effects Estimate?

[* EiR = Econometrics in Replications, a feature of TRN that highlights useful econometrics procedures for re-analysing existing research.]
NOTE: All the data and code  necessary to produce the results in the tables below are available at Harvard’s Dataverse: click here.
Fixed effects estimators are often used when researchers are concerned about omitted variable bias due to unobserved, time-invariant variables. These can prove insightful if there is much within-variation to support the fixed effects estimate. However, they can be misleading when there is not.
Stata has several commands that can help the researcher gauge the extent of within-variation. In this example, we use the “wagepan” dataset that is bundled with Jeffrey Wooldridge’s text, “Introductory Econometrics: A Modern Approach, 6e”. The dataset consists of annual observations of 545 workers over the years 1980-1987. It is described here.
In this example we use fixed effects to regress log(wage) on education, labor market experience, labor market experience squared, dummy variables for marital and union status, and annual time dummies.
The table below reports the fixed effects (within-estimate) for the “married” variable. For the sake of comparision, it also reports the between-estimate for “married” (calculated used the Mundlak version of the Random Effects Within Between estimator (Bell, Fairbrother, and Jones, 2019).
TRN1(20200425)
The within-estimate of the marriage premium is smaller than the between-estimate. This is consistent with marital status being positively associated with unobserved, time-invariant productivity characteristics of the worker. However, we want to know how much variation there is in marital status for the workers in our sample. If it is just a few workers who are changing marital status over time, then our estimate may not be representative of the effect of marriage in the population.
Stata provides two commands that can be helpful in this regard. The command xttab reports, among other things, a measure of variable stability across time periods. In the table below, among workers who ever reported being unmarried, they were unmarried for an average of 64.8% of the years in the sample.
Among workers who ever reported being married, they were married for 62.5% of the years in the sample. In this case, changes in marital status are somewhat common. Note that a time-invariant variable would have a “Within Percent” value of 100%.
TRN2(20200425)
Stata provides another command, xttrans, that gives detail about year-to-year variable transitions.
TRN3(20200425)
The rows represent the values in year t, with the columns representing the values in the following year. In this case, 86% of observations that were unmarried at time t, were also unmarried at time t+1. 14% of observations that were unmarried at time t changed status to “married” at time t+1.
Among other things, the xttrans command provides a reminder that the fixed effects estimate of the marriage premium includes the effect of transitioning from married to unmarried: 5% of observations that were married at time t were unmarried at time t+1. The implied assumption is that the effect of marriage on wages is symmetric, something that could be further explored in the data.
While these analyses are useful, they are based on observations, not workers. If there is concern about sample selection biasing the fixed effects estimates (so that “movers” are different from “stayers”), it would be useful to know how many of the 545 workers had experienced a marital status change, since it is the changes that support the fixed effects estimate.
The following set of commands calculate the min and the max values of the explanatory variables for all the workers in the sample. It then creates a dummy variable with the prefix “change” that takes the value 1 anytime the max and min values differ. Finally, it collapses the dataset so that there is one observation per worker, and then takes averages of the change variables.
TRN4(20200425)
The results below indicate that 56.9% of the workers changed their marital status during the sample period. Whether this is a sufficient number of “changers” to represent population “changers” is an open question. However, if the number were only 5 or 10% of workers, the argument for representativeness would be much weaker.
TRN5(20200425)
What does this have to do with replication? Oftentimes treatments are administered over time in panel datasets (say microcredit loans). Fixed effects estimates may be used to identify causal estimates of the treatment. Sample statistics, when they are reported, typically only report the percent of observations receiving treatment. Consider the two samples below.

TRN6(20200425)

In both samples, 30% of the observations are treatment observations. Thus a table of sample statistics would show identical means for the treatment variable in the two samples.
However, in the first sample, 100% of the workers received treatment, and 75% of year-to-year transitions involved a change in treatment status. In the second sample, only 50% of the workers experienced treatment, and 25% of year-to-year transitions involved a change in treatment status.
These are the kinds of differences that the procedures described above can be used to identify.
Bob Reed is a professor of economics at the University of Canterbury in New Zealand. He is also co-organizer of the blogsite The Replication Network. He can be contacted at bob.reed@canterbury.ac.nz.