Failed Replications: Crisis? Or Proof that Science is Working?

This item is a twofer.  In a New York Times op-ed piece, LISA FELDMAN BARRETT argues that failed replications are exactly what you should expect when science is doing its job.  Rather than a cause of concern, it is proof that the state of science is healthy.  To read Barrett’s article, click here.  In his blog, ANDREW GELMAN responds that that could be the case in a world where researchers draw the proper conclusions from failed replications.  However, that is not the world we are currently living in.  To read Gelman’s response, click here.

A Brief History of Replication Efforts

FROM THE ARTICLE: “Worries about irreproducibility – when researchers find it impossible to reproduce the results of an experiment when it is rerun under the same conditions – came to the fore again last week when a landmark effort to reproduce the findings of 100 recent papers in psychology failed in more than half the cases.  But the concerns are not new.”  The article goes on to offer a brief history of replication efforts.  Two takeaways: (i) Interest in replications has accelerated in recent years.  (ii) Economics as a field is not even mentioned in the article.  This despite the fact that many of the obstacles to undertaking replications (cost of lab experiments, etc.) are largely irrelevant in economics. To read more, click here.

But Not as Good as Psychologists?

The website FiveThirtyEight has a thoughtful post reflecting on the recent study in psychology in which 100 experimental studies were replicated . At least two points are worth highlighting.  First, despite the concern about p-hacking, it turns out that one of the best predictors of replicability in that study was p-level.  So while 5 percent may be too generous, there is valuable information in the p-value.  The second point is that psychology is confronting its replication demons straight on.  Economics?  Hmmm.  To read more, click here.

EDWARD LEAMER: On Econometrics in the Basement, the Bayesian Approach, and the Leamer-Rosenthal Prize

(THIS BLOG IS REPOSTED FROM THE BITSS WEBSITE) I became interested in methodological issues as a University of Michigan graduate student from 1967 to 1970, watching the economics faculty build an econometric macro model in the basement of the building (The Michigan Model), and comparing how these same faculty members described what they were doing when they taught econometric theory on the top floor of the building.  Though the faculty in the basement and on the top floor to outward appearances were the very same people, ascending or descending the stairs seemed to alter their inner intellectual selves completely.
The words “specification search” in my 1978 book Specification Searches refers to the search for a model to summarize the data in the basement where the dirty work is done, while the theory of pristine inference taught on the top floor presumes the existence of the model before the data are observed. This assumption of a known model may work in an experimental setting in which there are both experimental controls and randomized treatments, but for the non-experimental data that economists routinely study, much of the effort is an exploratory search for a model, not estimation with a known and given model. The very wide model search that usually occurs renders econometric theory suspect at best, and possibly irrelevant.  Things like unbiased estimators, standard errors and t-statistics lose their meaning well before you get to your 100th trial model.
Looking at what was going on, it seemed to me essential to make theory and practice more compatible, by changing both practice and theory.   An essential but fortuitous accident in my intellectual life had me taking courses in Bayesian statistics in the Math Department. The Bayesian philosophy seemed to offer a logic that would explain the specification searches that were occurring in the basement and that were invalidating the econometric theory taught in the top floor, and also a way of bringing the two floors closer together.
The fundamental message of the Bayesian approach is that, when the data are weak, the context matters, or more accurately the analyst’s views about the context matter.  The same data set can allow some to conclude legitimately that executions deter murder and also allow others to conclude that there is no deterrent effect, because they see the context differently.  While it’s not the only kind of specification search, per my book, an “interpretative search” combines the data information with the analyst’s ambiguous and ill-defined understanding of the context.  The Bayesian philosophy offers a perfect hypothetical solution to the problem of pooling the data information with the prior contextual information – one summarizes the contextual information in the form of a previous hypothetical data set.
A HUGE hypothetical benefit of a Bayesian approach is real transparency both to oneself and to the audience of readers.  Some people think that transparency can be achieved by requiring researchers to record and to reveal all the model exploration steps they take, but if we don’t have any way to adjust or to discount conclusions from these specification searches, this is transparency without accountability, without consequence.   What is really appealing about the Bayesian approach is that the prior information of the analyst is explicitly introduced into the data analysis and “straightforwardly” revealed both to the analyst and to her audience.   This is transparency with consequence.  We can see why some think executions deter murders and others see no deterrent effect.
The frustratingly naïve view that often meets this proposal is that “science doesn’t make up data.”   When I hear that comment, I just walk away.  It isn’t worth the energy to try to discuss how inferences from observational data are actually made, and for that matter how experiments are interpreted as well.   We all make up the equivalent of previous data sets, in the sense of allowing the context to matter in interpreting the evidence.   It’s a matter of how, not if.  Actually, I like to suggest that the two worst people to study data sets are a statistician who doesn’t understand the context, and a practitioner who doesn’t understand the statistical subtleties.
However, we remain far from a practical solution, Bayesian or otherwise, and current practice is more or less the same as it was when punch cards were fed into computers back in the 1960s.  The difference is that with each advance in technology from counting on fingers to Monroe calculators to paper tapes to punch cards to mainframes to personal computers to personal digital assistants, we have made it less and less costly to compute new estimates from the same data set, and the supply of alternative estimated models has greatly increased, though almost all of these are hidden on personal hard drives or in Rosenthal’s File Drawers.
The classical econometrics that is still taught to almost all economists has no hope of remedying this unfortunate situation, since the assumed knowledge inputs do not come close to approximating the contextual information that is available. But the Bayesian priests who presume the existence of a prior distribution that describes the context are not so different from the econometric theorists who presume the existence of a model.  Both are making assumptions about how the dirty work of data analysis in the basement is done or should be done, but few of either religious persuasion leave their offices and classrooms on the top floor and descend into the basement to analyze data sets.  Because of the impossibility of committing to any particular prior distribution, the Bayesian logic turns the search for a model into a search for a prior distribution. My solution to the prior-ambiguity problem has been to design tools for sensitivity analysis to reveal how much the conclusions change as the prior is altered, some local perturbations (point to point mapping) and some global ones (correspondences between sets of priors and sets of inferences).
As I read what I have just written, I think this is hugely important and highly interesting.  But I am reminded of the philosophical question:  When Leamer speaks and no one listens, did he say anything?   None of the tenured faculty in Economics at Harvard took any interest in this enterprise, and they gave me the Donald Trump treatment: You’re fired.   My move to UCLA was to some extent a statement of approval for my book, Specification Searches, but my pursuit of useful sensitivity methods remained a lonely one.  The sincerest form of admiration is copying, but no one pursued my interest in these sensitivity results. I did gain notoriety if not admiration with the publication of a watered down version of my ideas in “Let’s take the con out of econometrics.” But not so long after that, finding that I was not much affecting the economists around me, and making less progress producing sensitivity results that I found amusing, I moved onto the study of International Economics, and later I took the professionally disreputable step of forecasting the US macro economy on a quarterly basis, back to my Michigan days.   I memorialized that effort with the book titled Macroeconomic Patterns and Stories, which is an elliptical comment that we don’t do science, we do persuasion with patterns and stories.  And more recently, I have tried again to reach my friends by offering context-minimal measures of model ambiguity which I have called s-values (s for sturdy) to go along with t-values and p-values.    This one-more attempt illustrates what is the fundamental problem – we don’t have the right tools.
It is my hope that the Leamer-Rosenthal prize will bring some added focus on these deep and persistent problems with our profession, stimulating innovations that can produce real transparency by which I mean ways of studying data and reporting the results that allow both the analyst and the audience to understand the meaning of the data being studied, and how that depends on the contextual assumptions.
This whole thing reminds me of the parable of the Emperor’s New Clothes.  Weavers (of econometric theory) offer the Emperor a new suit of clothes, which are said to be invisible to incompetent economists and visible only to competent ones.  No economist dares to comment until a simple-minded one hollers out “He isn’t wearing any clothes at all.”   The sad consequence is that everyone thinks the speaker both impolite and incompetent, and the Emperor continues to parade proudly in that new suit, which draws repeated compliments from the weavers:  Elegant, very elegant.
OK, it’s delusional.  I know.

We Have Met the Enemy, and He is Us

FROM THE ARTICLE: “We studied publication bias in the social sciences by analyzing a known population of conducted studies—221 in total—in which there is a full accounting of what is published and unpublished. … Strong results are 40 percentage points more likely to be published than are null results and 60 percentage points more likely to be written up. We provide direct evidence of publication bias and identify the stage of research production at which publication bias occurs: Authors do not write up and submit null findings.” To read more, click here.

NOTE:  This article isn’t “new”, but it is newsworthy.

Results from a Massive Study on Replication of Psychology Research

FROM THE ARTICLE: “We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. Replication effects were half the magnitude of original effects, representing a substantial decline. Ninety-seven percent of original studies had statistically significant results. Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result.”  To read more, click here.

At Least We’re Better Than Sociologists?

In a recent blog on orgtheory.net, Cristobal Young reports: “We conducted a small field experiment as part of a graduate course in statistical analysis. Students selected sociological articles that they admired and wanted to learn from, and asked the authors for a replication package. Out of the 53 sociologists contacted, only 15 of the authors (28 percent) provided a replication package.” To read the blog, click here.  This compares to a recent study of economists that found that 44 percent provided data upon request.  To read that study, click here.

Berkeley Initiative for Transparency in the Social Sciences (BITSS) Announces Large Cash Prizes

BITSS announces the Leamer-Rosenthal Prizes for Open Social Science.  There are two prize groups.  The first is for young researchers who either evidence transparency in their own research or have researched on the subject of transparency.  The second group is for educators who have furthered the teaching of research transparency in academic coursework.  The prizes are financially large.  The deadline is Wednesday September 30, 2015. To learn more, click here.

On p-Hacking, Retractions, and the Difficult Enterprise of Science

This article in FiveThirtyEighty.com is a great read for lots of reasons.  The leitmotiv is that while science has its share of fraudsters and academic scammers, the underlying problem is that the scientific enterprise is inherently very, very difficult.  To prove the point, the article includes an online, interactive data analysis that studies the relationship between political parties and economic performance.  Words cannot do it justice.  Do the “research” and believe.  To read the article, click here.

B.D.MCCULLOUGH: The Reason so Few Replications Get Published Is….

When preparing to give a talk at a conference recently, I decided to update some information I had published a few years ago.  In McCullough (2009), I estimated that 16 economics journals had a mandatory data/code archive (archives that require only data do not support replication — see McCullough, McGeary and Harrison (2008)).  Vlaeminck (2013) counted 26 journals with a mandatory data/code archive. This is a non-trivial increase, since in 2004 only four economics journals had such a policy.  One might think that this increase bodes well for replicability in the economic science, but such is not the case.  It is all well and good to make data and code available for replications, but if there is no place for researchers to publish these replications, then all the mandatory data/code archives in the world will amount to only so much window dressing.

The problem is that editors do not want to admit that they publish unreplicable research, nor do they want to be bothered ensuring that the research they publish is replicable.  The fact is that very few journals will publish replications and the top-ranked journals only publish an infinitesimal number of replications.  Consequently, any editor is largely immune to the embarrassment that would arise if several of the articles he published were found to be not replicable.  Hence, editors have no incentive either to ensure the replicability of the articles they publish or to publish replications of the articles they do publish.  If researchers can’t get their replication articles published in decent journals, they won’t write the articles in the first place.  And this seems to be the present state of equilibrium, sub-optimal though it may be.  Worse, there seems to be a tacit collusion between the editors, in that one editor will not publish an article that exposes another editor as publishing unreplicable research.

Prima facie evidence of this sad state of affairs is the fact that Liebowitz’s failed replication of the JPE paper by Oberholzer-Gee and Strumpf still hasn’t been published, not by the the JPE and not be any other journal.  Anyone interested in replication should go to SRRN and read the papers by Liebowitz on this topic.  In “How Reliable is the Oberholzer-Gee and Strumpf Paper on File Sharing”, Liebowitz capably demonstrates fatal flaws in the data handling and analysis of the Oberholzer-Gee and Strumpf paper.  Actually, time is precious; just take my word for it so that you don’t have to read it: Liebowitz demolishes the Oberholzer-Gee/Strumpf paper.  In “Sequel to Liebowitz’s Comment on the Oberholzer-Gee and Strumpf Paper on File Sharing”, Liebowitz describes his efforts to get his paper published in the JPE.  This is the paper to be read. So Kafkaesque was Liebowitz’s ordeal that journalist Norbert Haring, writing in the German financial newspaper Handelsblatt (the German equivalent of the Wall Street Journal), said, “Steven Levitt, Editor of the Journal of Political Economy, uses a questionable tactic to block an undesired comment.   The subject of the criticised article was a hot topic.  On closer look, everything about the case was unusual.”  One might think that another journal with an interest in file sharing would publish Liebowitz’s paper….

No one can read these papers by Liebowitz and think that “truth will out” in the economics journals.  Yet there is cause for hope.

Third party organizations dedicated to replication have emerged in the past few years, such as 3ie (International Initiative for Impact Evaluation) and BITSS (Berkeley Initiative for Transparency in the Social Sciences) and EDAWAX (European Data Watch).  These organizations support replication without a necessary prospect of publication.  If these organizations can demonstrate that top journals are publishing non-replicable research, then the top journals might be embarrassed into admitting that their efforts to ensure replicability are insufficient.  And then Liebowitz’s article might finally get published.

References
===========
N. Haring, Handelsblatt, 23.06.2008

B. D. McCullough
“Open Access Economics Journals and the Market for Reproducible Economic Research”
Economic Analysis and Policy 39(1), 117-126, 2009

B. D. McCullough, Kerry Anne McGeary and Teresa D. Harrison
“Do Economics Journal Archives Promote Replicable Research?”
Canadian Journal of Economics 41(4), 1406-1420, 2008

Vlaeminck, Sven, 2013. “Data Management in Scholarly Journals and Possible Roles for Libraries – Some Insights from EDaWaX,” EconStor Open Access Articles, ZBW – German National Library of Economics, pages 49-79.