[From the abstract of a new working paper by DANIEL SIMONS, YUICHI SHODA, and D. STEPHEN LINDSAY entitled “Constraints on Generality (COG): A Proposed Addition to All Empirical Papers”]
“A cumulative science depends on accurately characterizing the generality of findings, but current publishing standards do not require authors to constrain their inferences, leaving readers to assume the broadest possible generalizations. We propose that the discussion section of all primary research articles specify Constraints on Generality (a “COG” statement), identifying and justifying target populations for the reported findings. Explicitly defining the target populations will help other researchers to sample from the same populations when conducting a direct replication, and it will encourage follow-up studies that test the boundary conditions of the original finding. Universal adoption of COG statements would change publishing incentives to favor a more cumulative science.”
[From the article, “The ASA’s p-value statement, one year on”, which appeared in the online journal Significance, a publication of the American Statistical Association]
“A little over a year ago now, in March 2016, the American Statistical Association (ASA) took the unprecedented step of issuing a public warning about a statistical method. …From clinical trials to epidemiology, educational research to economics, p-values have long been used to back claims for the discovery of real effects amid noisy data. By serving as the acid test of “statistical significance”, they have underpinned decisions made by everyone from family doctors to governments. Yet according to the ASA’s statement, p-values and significance testing are routinely misunderstood and misused, resulting in “insights” which are more likely to be meaningless flukes. … Yet a year on, it is not clear that the ASA’s statement has had any substantive effect at all.”
The journal Economics: The Open Access, Open Assessment E-Journal is publishing a special issue on “The Practice of Replication.” This is how the journal describes it:
“The last several years have seen increased interest in replications in economics. This was highlighted by the most recent meetings of the American Economic Association, which included three sessions on replications (see here, here, and here). Interestingly, there is still no generally acceptable procedure for how to do a replication. This is related to the fact that there is no standard for determining whether a replication study “confirms” or “disconfirms” an original study. This special issue is designed to highlight alternative approaches to doing replications, while also identifying core principles to follow when carrying out a replication.”
“Contributors to the special issue will each select an influential economics article that has not previously been replicated, with each contributor selecting a unique article. Each paper will discuss how they would go about “replicating” their chosen article, and what criteria they would use to determine if the replication study “confirmed” or “disconfirmed” the original study.”
“Note that papers submitted to this special issue will not actually do a replication. They will select a study that they think would be a good candidate for replication; and then they would discuss, in some detail, how they would carry out the replication. In other words, they would lay out a replication plan.”
“Submitted papers will consist of four parts: (i) a general discussion of principles about how one should do a replication, (ii) an explanation of why the “candidate” paper was selected for replication, (iii) a replication plan that applies these principles to the “candidate” article, and (iv) a discussion of how to interpret the results of the replication (e.g., how does one know when the replication study “replicates” the original study).”
“The contributions to the special issue are intended to be short papers, approximately Economics Letters-length (though there would not be a length limit placed on the papers).”
“The goal is to get a fairly large number of short papers providing different approaches on how to replicate. These would be published by the journal at the same time, so as to maintain independence across papers and approaches. Once the final set of articles are published, a summary document will be produced, the intent of which is to provide something of a set of guidelines for future replication studies.”
Despite all the attention that economics, and other disciplines, have devoted to research transparency, data sharing, open science, reproducibility, and the like, much remains to be done on best practice guidelines for doing replications. Further, there is much confusion about how one should interpret the results from replications. Perhaps this is not surprising. There is still much controversy about how to interpret tests of hypotheses! At the very least, it is helpful to have a better understanding of the current state of replication practice, and how replicators understand their own research. It is hoped that this special issue will help to progress our understanding on these subjects.
To read more about the special issue, and how to contribute, click here.
[From an interview with Christopher Chartier and Randy McCarthy at Retraction Watch]
“Do researchers need a new “Craigslist?” We were recently alerted to a new online platform called StudySwap by one of its creators … The platform creates an “online marketplace” that previous researchers have called for, connecting scientists with willing partners – such as a team looking for someone to replicate its results, and vice versa.”
[From the article “Stop binning negative results, researchers told” at www.timeshighereducation.com] “A new Europe-wide code of research conduct has ordered academics and journals to treat negative experimental results as being equally worthy of publication as positive ones. … The new European Code of Conduct for Research Integrity frames the bias against negative results as an issue of research conduct, stipulating that ‘authors and publishers [must] consider negative results to be as valid as positive findings for publication and dissemination’.”
Reproducibility is not just an issue in economics. In a new book, “Rigor Mortis: How Sloppy Science Creates Worthless Cures, Crushes Hope, and Wastes Billions”, NPR science reporter Richard Harris describes and elaborates on the problem of irreproducibility for medical science. The website Retraction Watch caught up with Richard Harris to discuss his new book and his thoughts on the larger problem of “sloppy science.” To read the interview, click here.
[NOTE: This entry is based on the book “Corrupt Research: The Case for Reconceptualizing Empirical Management and Social Science” by Raymond Hubbard]
Psychology’s “reproducibility crisis” (Open Science Collaboration, 2015) has drawn attention to the need for replication research. However, focusing on the reproducibility of findings, while clearly important, is a much too narrow interpretation of replication’s role in the scientific enterprise. This account outlines some additional roles.
Based on the two dimensions of (1) data sources and (2) research methods, the table below lists six different kinds of replications, each with its own part to play.
(1) Checking of Analysis: Determining the Accuracy of Results
Independent reexaminations of the original data, using the same methods of analysis. Are the results error-free?
(2) Reanalysis of Data: Determining Whether Results Hold Up Using Different Analytical Methods
Independent reexaminations of the original data, using different methods of analysis. Are the results the “same”?
Using the above approaches, many “landmark” results—e.g., the Hawthorne effect, J.B. Watson’s conditioning of Little Albert, Sir Cyril Burt’s “twins” research, and Durkheim’s theory of suicide—have been found to be invalid.
I do not consider (1) and (2) to be authentic forms of replication. They clearly, however, play a vital role in protecting the integrity of the empirical literature.
(3) Exact Replications: Determining Whether Results are Reproducible
An authentic form of replication, one which most people see as THE definition of replication. Here, we follow as closely as possible the same procedures used in the earlier study on a new sample drawn from the same population. This was the approach adopted by the Open Science Collaboration (2015) project.
(4) Conceptual Extensions: Determining Whether Results Hold Up When Constructs and Their Interrelationships are Measured/Analyzed Differently
These differences lie in how theoretical constructs are measured, and how they interrelate with other constructs. Conceptual extensions address the issue of the construct validity of the entities involved. This can only be done by replications assessing a construct’s (a) Convergent, (b) Discriminant, and (c) Nomological validities.
Otherwise expressed, replication research is crucial to theory development. First, it is replication research which is essential to the initial measurement, and further refinement, of the theoretical constructs themselves. Second, it is replication research which is responsible for monitoring the linkages (theoretical consistency) between these constructs. Third, it is replication research which judges the adequacy of this system of constructs for explaining some of what we see in the world around us.
(5) Empirical Generalizations: Determining Whether Results Hold Up in New Domains
Here the focus is on the external validity, or generalizability, of results when changes in persons, settings, treatments, outcomes, and time periods are made (Shadish, Cook, and Campbell, 2002). For example, Helmig, et al.’s (2012) successful replication using Swiss data of Jacobs and Glass’s (2002) U.S. study on media publicity and nonprofit organizations.
(6) Generalizations and Extensions: Determining Whether Results Hold Up in New Domains and With New Methods of Measurement and/or Analysis
Typically, these do not constitute authentic replications. Many of them are mainstream studies dealing with theory testing. That is, the emphasis is on theory extension, and not on extensions to previous empirical findings (Hubbard and Lindsay, 2002, p. 399).
Replication and Validity Generalization
Replication research underlies the validity generalization process.
Exact Replications allow appraisal of the internal validity of a study. They also enable the establishment of facts and the causal theories underlying them.
Conceptual Replications extend the development of causal theory by examining the validity of hypothetical constructs and their interrelationships. Specifically, they make possible the evaluation of a construct’s convergent, discriminant, and nomological validity. What could be more important than this?
Empirical Generalizations permit investigations of whether the same (similar) findings hold up across (sub)populations so addressing the neglected topic of a study’s external validity.
It is for good reason that replication research is said to be at the heart of scientific progress.
Raymond Hubbard is Thomas F. Sheehan Distinguished Professor of Marketing, Emeritus, at Drake University. Correspondence about this blog should be addressed to firstname.lastname@example.org.
Helmig, B., Spraul, K., & Tremp, K. (2012). Replication studies in nonprofit research: A generalization and extension of findings regarding the media publicity of nonprofit organizations. Nonprofit and Voluntary Sector Quarterly, 41, 360‑385.
Hubbard, R. (2016). Corrupt Research: The Case for Reconceptualizing Empirical Management and Social Science. (2016). Sage Publications: Thousand Oaks, CA.
Hubbard, R. & Lindsay, R.M. (2002). How the emphasis on “original” empirical marketing research impedes knowledge development. Marketing Theory, 2, 381‑402.
Jacobs, R.N. & Glass, D.J. (2002). Media publicity and the voluntary sector: The case of nonprofit organizations in New York City. Voluntas: International Journal of Voluntary and Nonprofit Organizations, 13, 235‑252.
Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349, aac4716‑1‑8.
Shadish, W.R., Cook, T.D., & Campbell, D.T. (2002). Experimental and quasi-experimental designs for generalized causal inference. Houghton Mifflin: Boston, MA.
Tsang, E.W.K. & Kwan, K.-M. (1999). Replication and theory development in organizational science: A critical realist perspective. Academy of Management Review, 24, 759‑780.
It is well known that there is a bias towards publication of statistically significant results. In fact, we have known this for at least 25 years since the publication of De Long and Lang (JPE 1992):
“Economics articles are sprinkled with very low t-statistics – marginal significance levels very close to one – on nuisance coefficients. […] Very low t-statistics appear to be systematically absent – and therefore null hypotheses are overwhelmingly false – only when the universe of null hypotheses considered is the central themes of published economics articles. This suggests, to us, a publication bias explanation of our findings.” (pp. 1269-1270)
While statistically insignificant results are less “sexy”, they are often not less important. Failure to reject the null hypothesis can be interesting in itself, is a valuable data point in meta-analyses, or can indicate to future researchers where they are unlikely to find an effect. As McCloskey (2002) famously puts it:
“[…] statistical significance is neither necessary nor sufficient for a result to be scientifically significant.” (p. 54)
This problem is not unique to Economics but several other disciplines have moved faster than us to try and address it. For example, the following disciplines already have journals dedicated to publishing “insignificant” results:
Is it time for Economics to catch up? I suggest it is and I know that I am not alone in this view. In fact, a number of prominent Economists have endorsed this idea (even if they are not ready to pioneer the initiative). So, imagine… a call for papers along the following lines:
Series of Unsurprising Results in Economics (SURE)
Is the topic of your paper interesting, your analysis carefully done, but your results are not “sexy”? If so, please consider submitting your paper to SURE. An e-journal of high-quality research with “unsurprising” findings.
How does it work:
— We accept papers from all fields of Economics…
— Which have been rejected at a journal indexed in EconLit…
— With the ONLY important reason being that their results are statistically insignificant or otherwise “unsurprising”.
To document that your paper meets the above eligibility criteria, please send us all referee reports and letters from the editor from the journal where your paper has been rejected. Two independent referees will read these reports along with your paper and evaluate whether they indicate that: 1. the paper is of high quality and 2. the only important reason for rejection was the insignificant/unsurprising nature of the results. Submission implies that you (the authors) give permission to the SURE editor to contact the editor of the rejecting journal regarding your manuscript.
SURE benefits writers by:
— Providing an outlet for interesting, high-quality, but “risky” (in terms of uncertain results) research projects;
— Decreasing incentives to data-mine, change theories and hypotheses ex post, exclusively focus on provocative topics.
SURE benefits readers by:
— Mitigating the publication bias and thus complementing other journals in an effort to provide a complete account of the state of affairs;
— Serving as a repository of potential (and tentative) “dead ends” in Economics research.
Even authors of published articles in a top statistics journal are often confused about the meaning of p-values, especially by treating 0.05, or the range 0.05–0.15, as the location of a threshold. The underlying problem seems to be deterministic thinking. To put it another way, applied researchers and also statisticians are in the habit of demanding more certainty than their data can legitimately supply. The problem is not just that 0.05 is an arbitrary convention; rather, even a seemingly wide range of p-values such as 0.01–0.10 cannot serve to classify evidence in the desired way.
In our article, John and I discuss some natural solutions that won’t, on their own, work:
– Listen to the statisticians, or clarity in exposition
– Confidence intervals instead of hypothesis tests
– Bayesian interpretation of one-sided p-values
– Focusing on “practical significance” instead of “statistical significance”
– Bayes factors
You can read our article for the reasons why we think the above proposed solutions won’t work.
From our summary:
We recommend saying No to binary conclusions . . . resist giving clean answers when that is not warranted by the data. . . . It will be difficult to resolve the many problems with p-values and “statistical significance” without addressing the mistaken goal of certainty which such methods have been used to pursue.
P.S. Along similar lines, Stephen Jenkins sends along the similarly-themed article, “‘Sing Me a Song with Social Significance’: The (Mis)Use of Statistical Significance Testing in European Sociological Research,” by Fabrizio Bernardi, Lela Chakhaia, and Liliya Leopold.
In a recent article in PLOS One, Don van Ravenzwaaij and John Ioannidis argue that Bayes factors should be preferred to significance testing (p-values) when assessing the effectiveness of new drugs. At his blogsite The 20% Statistician, Daniel Lakens argues that Bayes factors suffer from the same problems as p-values. Namely, the combination of small effect sizes and sample sizes leads to inconclusive conclusions no matter whether one uses p-values or Bayes factors. The real challenge facing decision-making from statistical studies comes from publication bias and underpowered studies. Both significance testing and Bayes factors are relatively powerless (pun intended) to overcome these more fundamental problems. To read more, click here.