Old Boys Network: 1; Open Science: 0.

[Excerpts taken from the article “The Stewart Retractions: A Quantitative and Qualitative Analysis”, by Justin Pickett, published in Econ Journal Watch]
“This study analyzes the recent retraction of five articles from three sociology journals—Social Problems, Criminology, and Law & Society Review.”
“The only coauthor on all five retracted articles was Dr. Eric Stewart. He was the data holder and analyst for each article. I coauthored one of the retracted articles…”
“I organize my analysis of the quantitative and qualitative data into three sections: (1) what happened in the articles, (2) what happened among the coauthors, and (3) what happened at the journals. Everything—data, code, emails, text messages, Excel files, drafts, and university documents—needed to verify my claims is provided online.”
What happened in the five articles
— “Consistently incorrect means and standard deviations”
— “Non-uniform terminal-digit distributions”
— “Unverifiable surveys”
— “Identical statistics after changes in…everything else”
— “Inexplicable sample sizes and statistics”
— “Unreported, implausible county clusters”
What happened among the coauthors
“The coauthors of the five retracted articles include two past editors of Criminology, the flagship journal of the American Society of Criminology (ASC), as well as three ASC Fellows and two ASC vice presidents.”
“Two coauthors, Brian Johnson and Eric Baumer, have written articles about the importance of research ethics (e.g., “What Scholars Should Know about ‘Self-Plagiarism’” (Lauritsen et al. 2019); “Salami-Slicing, Peek-a-Boo, and LPUS: Addressing the Problem of Piecemeal Publication” (Gartner et al. 2012)).”
“To my knowledge, none of the coauthors have spoken publicly about what happened in the retracted articles, except to insist in the retraction notices that the irregularities resulted from “coding mistakes” and “transcription errors” (Law & Society Review 2020; Criminology 2020a; b), and to defend the accuracy of the retracted findings (Law & Society Review 2020).”
“Scientific fraud occurs all too frequently—approximately 1 in 50 scientists admit to fabricating or falsifying data (Fanelli 2009)—and I believe it is the most likely explanation for the data irregularities in the five retracted articles…The retraction notices say honest error, not fraud, is the explanation. Fortunately, if that is true, Dr. Stewart could easily prove it: recreate the original sample (N = 1,184) that produces the findings in Johnson et al. (2011) and then publicly explain how he did it.”
“…many authors are reluctant to share data publicly, and sometimes there are legitimate privacy concerns or externally imposed restrictions. Sharing data with coauthors, however, should be uncontroversial and feasible.”
“Yet without institutional support, coauthors may feel uncomfortable requesting data. For example, once irregularities were identified in their articles, Dr. Stewart’s coauthors were reluctant to press him for the data, probably because of concerns related to friendship and loyalty.”
What happened at the journals
“None of the editors followed COPE’s guidelines when alerted to the irregularities in Dr. Stewart’s articles. One editor seemingly tried to coordinate a collective response of ignoring the allegations, even though she recognized their potential seriousness…At Criminology, what seems to have driven how the co-editors responded was sympathy for some of the authors and a low opinion of critics.”
“Connections between the co-editors and authors are likely to blame; Dr. Johnson, the lead author of one article, was a co-editor, and Dr. Stewart, the lead author of the other, was to become a co-editor.”
Conclusion and recommendations
“The Stewart scandal took place over five months, and it required considerable time and effort from the editors involved. The editors corresponded extensively with each other and with other parties…the Criminology co-editors wrote multiple public statements about the steps they were taking to address the problems. Why did they not simply ask Dr. Stewart for his data?”
To read the article, click here.
NOTE FROM TRN: The excerpts above do not do justice to the article. The article should be read!

No Respect for Replication Even on the Big Bang Theory

[Excerpts taken from the transcript to Series 2, Episode 15 of the Big Bang Theory – “The Maternal Capacitance”]
LEONARD’S MOTHER: Leonard, it’s one o’clock, weren’t you going to show me your laboratory at one o’clock?
LEONARD: There’s no hurry, Mother…
LEONARD’S MOTHER: But it’s one o’clock, you were going to show me your laboratory at one o’clock.
SHELDON: Her reasoning is unassailable. It is one o’clock.
LEONARD: Fine. Let’s go. I think you’ll find my work pretty interesting. I’m attempting to replicate the dark matter signal found in sodium iodide crystals by the Italians.
LEONARD’S MOTHER: So, no original research?
LEONARD’S MOTHER: Well, what’s the point of my seeing it? I could just read the paper the Italians wrote.
To read the full transcript, click here.

Is It a Replication? A Reproduction? A Robustness Check? You’re Asking the Wrong Question

[Excerpts taken from the article, “What is replication?” by Brian Nosek and Tim Errington, published in PloS Biology]
“Credibility of scientific claims is established with evidence for their replicability using new data. This is distinct from retesting a claim using the same analyses and same data (usually referred to as reproducibility or computational reproducibility) and using the same data with different analyses (usually referred to as robustness).”
“Prior commentators have drawn distinctions between types of replication such as “direct” versus “conceptual” replication and argue in favor of valuing one over the other…By contrast, we argue that distinctions between “direct” and “conceptual” are at least irrelevant and possibly counterproductive for understanding replication and its role in advancing knowledge.”
“We propose an alternative definition for replication that is more inclusive of all research and more relevant for the role of replication in advancing knowledge. Replication is a study for which any outcome would be considered diagnostic evidence about a claim from prior research. This definition reduces emphasis on operational characteristics of the study and increases emphasis on the interpretation of possible outcomes.”
“To be a replication, 2 things must be true: outcomes consistent with a prior claim would increase confidence in the claim, and outcomes inconsistent with a prior claim would decrease confidence in the claim.”
“Because replication is defined based on theoretical expectations, not everyone will agree that one study is a replication of another.”
“Because there is no exact replication, every replication test assesses generalizability to the new study’s unique conditions. However, every generalizability test is not a replication…there are many conditions in which the claim might be supported, but failures would not discredit the original claim.”
“This exposes an inevitable ambiguity in failures-to-replicate. Was the original evidence a false positive or the replication a false negative, or does the replication identify a boundary condition of the claim?”
“We can never know for certain that earlier evidence was a false positive. It is always possible that it was “real,” and we cannot identify or recreate the conditions necessary to replicate successfully…Accumulating failures-to-replicate could result in a much narrower but more precise set of circumstances in which evidence for the claim is replicable, or it may result in failure to ever establish conditions for replicability and relegate the claim to irrelevance.”
“The term “conceptual replication” has been applied to studies that use different methods to test the same question as a prior study. This is a useful research activity for advancing understanding, but many studies with this label are not replications by our definition.”
“Recall that “to be a replication, 2 things must be true: outcomes consistent with a prior claim would increase confidence in the claim, and outcomes inconsistent with a prior claim would decrease confidence in the claim.”
“Many “conceptual replications” meet the first criterion and fail the second…“conceptual replications” are often generalizability tests. Failures are interpreted, at most, as identifying boundary conditions. A self-assessment of whether one is testing replicability or generalizability is answering—would an outcome inconsistent with prior findings cause me to lose confidence in the theoretical claims? If no, then it is a generalizability test.”
To read the article, click here.

Raise Your Hand If You’ve Messed Up

[Excerpts taken from the article “When We’re Wrong, It’s Our Responsibility as Scientists to Say So” by  Ariella Kristal et al., published in Scientific American.]
“What simple, costless interventions can we use to try to reduce tax fraud? As behavioral scientists, we tried to answer this question using what we already know from psychology: People want to see themselves as good.”
“…we thought that by reminding people of being truthful before reporting their income, they would be more honest. Building on this idea, in 2012, we came up with a seemingly costless simple intervention: Get people to sign a tax or insurance audit form before they reported critical information (versus after, the common business practice).”
“While our original set of studies found that this intervention worked in the lab and in one field experiment, we no longer believe that signing before versus after is a simple costless fix….Seven years and hundreds of citations and media mentions later, we want to update the record.”
“Based on research we recently conducted—with a larger number of people—we found abundant evidence that signing a veracity statement at the beginning of a form does not increase honesty compared to signing at the end.”
“Why are we updating the record? In an attempt to replicate and extend our original findings, three people on our team (Kristal, Whillans and Bazerman) found no evidence for the observed effects across five studies with 4,559 participants.”
”We brought the original team together and reran an identical lab experiment from the original paper (Experiment 1). The only thing we changed was the sample size: we had 20 times more participants per condition. And we found no difference in the amount of cheating between signing at the top of the form and signing at the bottom.”
“This matters because governments worldwide have spent considerable money and time trying to put this intervention into practice with limited success.”
“We also hope that this collaboration serves as a positive example, whereby upon learning that something they had been promoting for nearly a decade may not be true, the original authors confronted the issue directly by running new and more rigorous studies, and the original journal was open to publishing a new peer-reviewed article documenting the correction.”
“We believe that incentives need to continue to change in research, such that researchers are able to publish what they find and that the rigor and usefulness of their results, not their sensationalism, is what is rewarded.”
To read the article, click here.

Announcing a New Replication Column at Data Colada (Albeit a Little Late)

[From the blog “[81] Data Replicada” by Joe Simmons and Leif Nelson, posted in December at Data Colada]
“With more than mild trepidation, we are introducing a new column called Data Replicada. In this column, we will report the results of exact (or close) preregistered replications of recently published findings.”
“…why does the world need Data Replicada? Allow us to justify our existence.”
“Even though it is much easier than ever to publish exact replications, it is still extremely hard to publish exact replications. We know for a fact that many high-quality replications remain unreported. You can’t publish failures to replicate unless you convince a review team both that the failure is robust and that it is really important. And you can’t publish a successful replication, well, ever…”
“Many published replications focus on studies that were conducted a long time ago, or at least before the field became aware of the perils of p-hacking and the importance of better research practices. We will focus on trying to replicate recently published findings, so as to get a sense for whether non-obvious research published after the “renaissance” is in fact reliable.”
“…replications have not caught on in our subfield of behavioral marketing. So at least at the outset, we intend to focus our efforts on trying to replicate studies recently published in the two top behavioral marketing journals: the Journal of Consumer Research and the Journal of Marketing Research.”
“Conducting replications is a great way to learn about the messy business of scientific discovery, and we hope to communicate what we learn to you.”
“We are hoping to make Data Replicada a regular, monthly feature of Data Colada. But since it will be effortful and expensive, we only feel comfortable committing to doing it for six months or so, at which point we will re-assess.”
To read the blog, click here.

REED: EiR* — Replications and DAGs

[* EiR = Econometrics in Replications, a feature of TRN that highlights useful econometrics procedures for re-analysing existing research.]
In recent years, DAGs (Directed Acyclic Graphs) have received increased attention in the medical and social sciences as a tool for determining whether causal effects can be estimated. A brief introduction can be found here. While DAGs are commonly used to guide model specification, they can also be used in the post-publication assessment of studies.
Despite widespread recognition of the dangers of drawing causal inferences from observational studies, and with general, nominal acknowledgement that “correlation does not imply causation”, it is still standard practice for researchers to discuss estimated relationships from observational studies as if they represent causal effects.
In this blog, we show how one can apply DAGs to previously published studies to assess whether implied claims of causal effects are justified. For our example, we use the Mincer earnings regression, which has appeared in hundreds, if not thousands, of economic studies. The associated wage equation relates individuals’ observed wages to a number of personal characteristics:
 ln(wage) = b0 + b1 Educ + b2 Exp + b3 Black + b4 Female + error,
where ln(wage) is the natural log of wages, Educ is a measure of years of formal education, Exp is a measure of years of labor market experience, and Black and Female are dummy variables indicating an individual’s race (black) and sex.
The parameters b1 and b2 are comonly interpreted as the rate of return to education and labor market experience, respectively. The coefficients on Black and Female are commonly interpreted as measuring labor market discrimination against blacks and women.
Suppose one came across an estimated Mincer wage regression like the one above in a published study. Suppose further that the author of that study attached causal interpretations to the respective estimated parameters. One could use DAGs to determine whether those interpretations were justified.
To do that, one would first hypothesize a DAG that summarized all the common cause relationships between the variables. By way of illustration, consider the DAG in the figure below, where U is an unobserved confounder.1
TRN (20200310)
In this DAG, Educ affects Wage through a direct channel, Educ -> Wage, and an indirect channel, Educ -> Exp -> Wage. The Mincerian regression specification captures the first of these channels. However, it omits the second because the inclusion of Exp in the specification blocks the indirect channel. Assuming both channels carry positive associations, the estimated rate of return to education in the Mincerian wage regression will be downwardly biased.
We can use the same DAG to assess the other estimated parameters. Consider the estimated rate of return on labor market experience. The DAG identifies both a direct causal path (Exp -> Wage) and a number of non-causal paths. Exp <- Female -> Wage is one non-causal path, as is Exp <- Educ -> Wage. Including the variables Educ and Female in the regression equation blocks these non-causal paths. As a result, the specification solely estimates the direct causal effect, and thus provides an unbiased estimate of the rate of return of labor market experience on wages.
In a similar fashion, one can show that given the DAG above, one cannot interpret the estimated values of b3 and b4 as estimates of the causal effects of labor market and sex discrimination.
 DAGs also have the benefit of suggesting tests that allow one to assess the validity of a given DAG. In particular, the DAG above implies the following independences:2
1) Educ Female
2) Exp Black | Educ
3) Female Black
Rejection of one or more of these would indicate that the DAG is not supported by the data.
In practice, there are likely to be many possible DAGs for a given estimated equation. If a replicating researcher can obtain the data and code for an original study, he/she could then posit a variety of DAGs that seemed appropriate given current knowledge about the subject.
For each DAG, one could determine whether the conditions exist such that the estimated specification allows for a causal interpretation of the key parameters. If so, one could then use the model implications to assess whether the DAG was “reasonable”, as evidenced by non-conflicting data.
If no DAGs can be found that support a causal interpretation, or if adequacy tests cause one to eliminate all such DAGs, one could then request that the original author provide a DAG that would support their causal interpretations. In this fashion, existing studies could be assessed to determine if there is an evidentiary basis for causal interpretation of the estimated effects.
1 This DAG is taken from Felix Elwert’s course, Directed Acyclic Graphs for Causal Inference, taught through Statistical Horizons.
2 A useful, free online tool for drawing and assessing DAGs, is DAGitty, which can be found here.
Bob Reed is a professor of economics at the University of Canterbury in New Zealand. He is also co-organizer of the blogsite The Replication Network. He can be contacted at bob.reed@canterbury.ac.nz.

Suppose Researchers Were Offered Money for Replications and Null Results. Suppose Few Took Up the Offer.

[Excerpts taken from the article “In praise of replication studies and null results”, an editorial published in Nature]
“The Berlin Institute of Health last year launched an initiative with the words, ‘Publish your NULL results — Fight the negative publication bias! Publish your Replication Study — Fight the replication crises!”
“The institute is offering its researchers €1,000 (US$1,085) for publishing either the results of replication studies — which repeat an experiment — or a null result, in which the outcome is different from that expected.’”
“…Twitter, it seems, took more notice than the thousands of eligible scientists at the translational-research institute. The offer to pay for such studies has so far attracted only 22 applicants — all of whom received the award.”
“Replication studies are important….But publishing this work is not always a priority for researchers, funders or editors — something that must change.”
“Aside from offering cash upfront, the Berlin Institute of Health has an app and advisers to help researchers to work out which journals, preprint servers and other outlets they should be contacting to publish replication studies and data.” 
“…more journals need to emphasize to the research community the benefits of publishing replications and null results.” 
“At Nature, replication studies are held to the same high standards as all our published papers. We welcome the submission of studies that provide insights into previously published results; those that can move a field forwards and those that might provide evidence of a transformative advance.”
“Not all null results and replications are equally important or informative, but, as a whole, they are undervalued. If researchers assume that replications or null results will be dismissed, then it is our role as journals to show that this is not the case. At the same time, more institutions and funders must step up and support replications — for example, by explicitly making them part of evaluation criteria.”
“We can all do more. Change cannot come soon enough.”
To read the article, click here.

Center for Open Science Funding Research on the DARPA SCORE Dataset

[Excerpts taken from the RFP “Data Enhancement of the DARPA SCORE Claims Dataset” posted at the Center for Open Science website]
“The DARPA SCORE Dataset contains claims from about 3,000 empirical papers published between 2009 and 2018 in approximately 60 journals in the social and behavioral sciences.”
“We seek proposals from research teams to enhance the Dataset with information about the papers that may be relevant to assessing the credibility of the coded claims. Such enhancements could include information such as:”
– “Extraction of statistical variables or reporting errors in the original papers.”
– “Identification of the public availability of data, materials, code, or preregistrations associated with the studies reported in the papers.”
– “Citations or other altmetrics associated with the papers.”
– “Identification of replications or meta-analytic results of findings associated with the paper or claims.”
– “Indicators of credibility, productivity, or other features of the authors of the papers.”
– “Extraction of design features, reporting styles, quality indicators, or language use from the original papers.”
“Possible data enhancements for the Database are not limited to these examples.”
“The key criteria for proposed data enhancements are:”
– “Relevance of the proposed variable additions to assessment of credibility of the papers and claims.”
– “Potential applicability of credibility/validity variable additions to other/broader targets or levels of abstraction besides claims or papers. For example, averaging or aggregation of claim or paper-level scores to generate “average”/expected credibility of authors, journals, or sub-disciplines.”
– “The proportion of papers that are likely to benefit from the data enhancement (i.e., minimization of missing data).”
– “The extent to which the data enhancement is automated.”
– “Evidence that extraction or identification of new data is valid, reliable, and easy to integrate with the Dataset.”
– “Description of data quality control practices that will be employed in proposed data-generation activity (e.g., interrater agreement tests, data sample auditing/revision processes).”
– “Cost for conducting the work.”
– “Completing the proposed work by June 30, 2020.”
“A total of $100,000 is available for data enhancement awards and we expect to make 4 to 15 total awards. Proposals should be no more than 2 pages and address the selection criteria above.”
To learn more, click here.

Which Journals in Your Discipline Score Best on the TOP Criteria? There’s an App for That

[Excerpts taken from the article “New Measure Rates Quality of Research Journals’ Policies to Promote Transparency and Reproducibility”, published by the Center for Open Science at their website.]
“Today, the Center for Open Science launches TOP Factor, an alternative to journal impact factor (JIF) to evaluate qualities of journals. TOP Factor assesses journal policies for the degree to which they promote core scholarly norms of transparency and reproducibility.”
“TOP Factor is based primarily on the Transparency and Openness Promotion (TOP) Guidelines, a framework of eight standards that summarize behaviors that can improve transparency and reproducibility of research such as transparency of data, materials, code, and research design, preregistration, and replication.”
“Journals can adopt policies for each of the eight standards that have increasing levels of stringency. For example, for the data transparency standard, a score of 0 indicates that the journal policy fails to meet the standard, 1 indicates that the policy requires that authors disclose whether data are publicly accessible, 2 indicates that the policy requires authors to make data publicly accessible unless it qualifies for an exception (e.g., sensitive health data, proprietary data), and 3 indicates that the policy includes both a requirement and a verification process for the data’s correspondence with the findings reported in the paper.”
“TOP Factor also includes indicators of whether journals offer Registered Reports, a publishing model that reduces publication bias of ignoring negative and null results, and badging to acknowledge open research practices to facilitate visibility of open behaviors.”
“At the TOP Factor website, users can filter TOP Factor scores by discipline, publisher, or by subsets of the standards to see how journal policies compare.”
“‘Disciplines are evolving in different ways toward improving rigor and transparency,’ noted Brian Nosek, Executive Director of the Center for Open Science. ‘TOP Factor makes that diversity visible and comparable across research communities. For example, economics journals are at the leading edge of requiring transparency of data and code whereas psychology journals are among the most assertive for promoting preregistration.’”
“So far, over 250 journal policies have been evaluated and are presented on the TOP Factor website. …Journals will be added continuously over time.”
“Editors and community members can complete a journal evaluation form on the TOP Factor website to accelerate the process. Center for Open Science staff review those submissions and confirm with the journal’s publicly posted policies before posting the scores to the TOP Factor website.”
To read the article and access TOP Factor, click here.

Have Registered Reports Uncovered Massive Publication Bias? Evidence from Psychology

[Excerpts taken from the preprint, “An excess of positive results: Comparing the standard Psychology literature with Registered Reports” by Anne Scheel, Mitchell Schijen, and Daniël Lakens, posted at PsyArXiv]
“Registered Reports (RRs) are a new publication format…Before collecting data, authors submit a study protocol containing their hypotheses, planned procedures, and analysis pipeline…to a journal. The protocol undergoes peer review, and, if successful, receives ‘in-principle acceptance’, meaning that the journal commits to publishing the final article following data collection, regardless of the statistical significance of the results.”
“The authors then collect and analyse the data and complete the final report. The final report undergoes another round of peer review, but this time only to ensure that the authors adhered to the registered plan and did not draw unjustified conclusions…”
“Registered Reports thus combine an antidote to QRPs (preregistration) with an antidote to publication bias, because studies are selected for publication before their results are known.”
“The goal of our study was to test if Registered Reports in Psychology show a lower positive result rate than articles published in the traditional way (henceforth referred to as ‘standard reports’, SRs), and to estimate the size of this potential difference.”
“For standard reports we downloaded a current version of the Essential Science Indicators (ESI) database…and used Web of Science to search for articles published between 2013 and 2018 with a Boolean search query containing the phrase ‘test* the hypothes*’ and the ISSNs of all 633 journals listed in the ESI Psychiatry /Psychology category. Using the same sample size as Fanelli (2010), we randomly selected 150 papers…”
“For Registered Reports we aimed to include all published Registered Reports in the field of Psychology that tested at least one hypothesis, regardless of whether or not they used the phrase ‘test* the hypothes*’. We downloaded a database of published Registered Reports curated by the Center for Open Science…and excluded papers published in journals that were listed in categories other than ‘Psychiatry/Psychology’ or ‘Multidisciplinary’ in the ESI.”
“Of the 151 entries in the COS Registered Reports database, 55 were excluded because they belonged to a non-Psychology discipline, 12 because we could not verify that they were Registered Reports, and 13 because they did not test hypotheses or contained insuffcient information, leaving 71 Registered Reports for the final analysis.”
“146 out of 152 standard reports and 31 out of 71 Registered Reports had positive results…see Fig. 2…this difference…was statistically significant…p < .001.”
“We thus accept our hypothesis that the positive result rate in Registered Reports is lower than in standard reports.”
“To explain the 52.39% gap between standard reports and Registered Reports, we must assume some combination of differences in bias, statistical power, or the proportion of true hypotheses researchers choose to examine.”
“Figure 3 visualises the combinations of statistical power and proportion of true hypotheses that would produce the observed positive result rates if the literature were completely unbiased.”
“For example, assuming no publication bias and no QRPs, even if all hypotheses authors of standard reports tested were true, their study designs would need to have more than 90% power for the true effect size. This is highly unlikely, meaning that the standard literature is unlikely to reflect reality.”
“It is a-priori plausible that Registered Reports are currently used for a population of hypotheses that are less likely to be true: For example, authors may use the format strategically for studies they expect to yield negative results (which would be difficult to publish otherwise).”
“However, assuming over 90% true hypotheses in the standard literature is neither realistic, nor would it be desirable for a science that wants to advance knowledge beyond trivial facts. We thus believe that this factor alone is not sufficient to explain the gap between the positive result rates in Registered Reports and standard reports. Rather, the numbers strongly suggest a reduction of publication bias and/or Type-1 error inflation in the Registered Reports literature.”
To read the article, click here.