RICHARD PALMER-JONES: Replication Talk Costs Lives: Why are economists so concerned about the reputational effects of replications?
Michael Clemens’ recent working paper “The Meaning of Failed Replications: A Review and Proposal” echoes concerns expressed by some replicatees and economists more generally Ozler, 2014, for example, about the potentially damaging effects of a claim of failed replication on reputations (or similar concerns in social psychology). Some of these concerns have been expressed in relation to 3ie’s program of funding for replications of prominent works in development economics (Jensen and Oster, 2013; Dercon et al., 2015). The responses of some, not necessarily these, replicatees can be characterised as “belligerent” (Camfield and Palmer-Jones, 2015), adequately matching any giant slaying potential of the replicators.
Further attention (Cremer does refers to some of the main contributions in economics and social science more widely to this them, but not other) to the desirability of and need for more replications in economics is welcome (Camfield and Palmer-Jones, 2013; Duvendack and Palmer-Jones, 2013; Duvendack, Palmer-Jones and Read, 2015 forthcoming; see also the earlier work of Bruce McCullough and associates).
While there is much in Clemens’ paper that could be debated, I will address here what I consider to be the core argument. Clemens argues that replication should be distinguished from robustness testing in part because a failed replication bears implications of error or wrong doing or some other deficiency of the replicatees, while robustness testing (and extension) are practices that reflect legitimate disagreements among competent professionals. He attributes the new concerns with replication in large part to the computational turn in economics, wrongly I think (concern with the provision of data sets to allow replication is expressed by Ragnar Frisch in the editorial to the first issue of Econometrica, for example (Frisch, 1933). In doing so he refers to Collins, 1991, and to Camfield and Palmer-Jones, 2013, to support the claim that a failed replication implies a moral failure among the replicatees. Neither reference is correct. Clemens quotes the latter: “replication speaks to ethical professional practice” (1612), and the former that “replication is a matter of establishing ought, not Is [emphasis in original]”. I know the latter were not suggesting that replicatees’ behaviour was morally reprehensible and I don’t believe Collins was either. Rather they refer to the ethics of the profession. Both sources use replication to cover both checking, or pure (Hammermesh, 2007) or exact replication (Duvendack, et al., 2015), and what Clemens terms robustness testing.
What Camfield and Palmer-Jones intend the reader to understand is that replication should be a quotidian practice of economics, promoted by professional institutions (teaching, appointment, promotion, journals, conferences, and so on), and that claims to economic knowledge should be based on replicable and replicated studies. In this context I mean the more extensive understanding of replication which means that the findings are not only free from the sorts of error that can be identified by checking, but also stand up to comparison with the results of using alternative relevant datasets, estimation methods, variable constructions, and models, and, as far as is possible, to alternative explanations of the same phenomena drawn from different theoretical frameworks. Collins writes; “[H]ere we see the difference between checking and replication. Checking often involves repetition as a technique, but it is not replication. …. Replication is the establishment of a new and contested result by agreement over what counts as a correctly performed series of experiments.” (Collins, 1991: 131-2).
And I think this is consistent with understandings of replication in laboratory and other sciences, and with computational science (Peng, 2011). Thus natural science replicates an experiment with what are supposed to be identical materials, methods and conditions (and hopefully different actors trained in putatively the same methods but at different locations, with different perspectives, or priors if you like). These qualifications on the training of the actors stem from the role of tacit knowledge or practices, and different interests, that characterise the decentralised practices of science and are apparently often crucial. A failure to replicate at the checking stage can indicate unknown, or unreported, differences in any of these factors. Detective work to find out where the unrecognised differences leading to different outcomes lie is part of the work of scientists. These differences need not lie in moral failures but in the realities of life – that complete description is almost always impossible, that to err is human, and that some choices are margin or perhaps judgement calls.
This is where things elide seamlessly into robustness testing. For a simple example, consider the choice as to what constitutes a comparable sample for an experiment or a survey, or even from an existing secondary data source, is generally validated by comparing descriptive statistics. These will never be identical between the original and subsequent sample. (Value) judgements are required as to whether the samples are sufficiently similar.
What is interesting is that Cremer sees merit in flogging this dead horse. Hardly any replication practitioner sees merit in restricting the practice to “checking” (and Clemens provides citations in support of this), and consider that checking hardly provides sufficient motivation for the work involved, given the low status and poor publication prospects (Dewald et al., 1986; Duvendack and Palmer-Jones, 2013). But I disagree with the view that checking has little value (I am reminded of Oscar Wilde’s aphorism about knowing the price of everything but the value of nothing). “Checking” helps understand what the authors have done as a basis for extension, and so on, and has revealed plenty of problems in economics, from Dewald et al., 1986, through McCullough and Vinod, 2003. The anodyne results of the work reported by Glandon, 2011, apart from lacking details, might perhaps have revealed more if the checking had extended beyond reproducing tables and graphs from estimation data sets with provided code, to included data, variable and sample preparation, and re-writing estimation code from scratch, perhaps in a different computer language. And perhaps been undertaken by people with more status and authority than that of a “graduate student”.
However, there clearly is support for the view that replication should be restricted to checking and perhaps minimal pre-specified robustness testing, in parts of the economics establishment (e.g. Ozler, 2014). I do not here expand on the issue of pre-defined replication plans (see also Kahneman, 2014). Since replicated authors have been able to vary data sets, samples, variable constructions, estimation models and methods and so on, so, within reasonable, similar limits, should replicators be allowed to test the robustness of results. Sauce for the goose, should be sauce for the gander.
Is potential loss of reputation due to unanswerable botched or malign replication, or just the mere mention of replication? If so, why should reputation hang on such a fragile thread, especially nowadays when social media provide ample opportunity for prompt and extensive response from replicatees who consider themselves maligned?
What is it about the replication word that gets to these economists? Elsewhere, co-authors and I have suggested the answer may lie in the nature of professional economics as a policy science (Duvendack and Palmer-Jones, 2013; Camfield et al., 2014). Following Ioannidis (2005) and related work suggesting the fragility of statistical analyses (see Manniadis et al., 2014), we argue that emphasis on “originality” and statistical significance is associated with data mining, model polishing (p-hacking), and HARKing (hypothesising after results are known), in often underpowered studies. These result in far too many false positives, which proper replication will unveil in a process of both checking and robustness testing. We have also suggested that ideological and career interests are involved in the race to produce significant and surprising policy relevant results, behaviour promoted by the institutions of economics, even when producing these results requires practices which may contravene well known principles of statistical estimation and testing. This may result in in cognitive dissonance. I don’t elaborate these arguments here; they are touched on in some of the references already mentioned. The response to cognitive dissonance is not usually to change behaviour producing the dissonance but to ignore the problem, or to justify the practices. We can see these behaviours in the texts protesting against the extension of replication to include robustness testing (or what I prefer to term statistical and scientific testing).
Clemens’s proposal to restrict replication to checking would amount to a “public lie” (Geisser, 2012), a wilful blindness (Heffernan, 2012), or contrived ignorance (Luban, 2007, chapter 6), evidence of a state of denial (Cohen, 1992). Reject this proposal; support proper replication.
Camfield, L., Palmer-Jones, R., 2013. Three “Rs” of Econometrics: Repetition, Reproduction and Replication. Journal of Development Studies 49, 1607–1614.
Camfield, L., Palmer-Jones, R.W., 2015. Ethics of analysis and publication in international development, in “Social Science Research Ethics for a Globalizing World: Interdisciplinary and Cross-Cultural Perspectives, in: Nakray, N., Alston, M., Wittenbury, K. (Eds.), Social Science Research Ethics for a Globalizing World: Interdisciplinary and Cross-Cultural Perspectives. Routledge, New York.
Camfield, L., Duvendack, M., Palmer-Jones, R., 2014. Things you Wanted to Know about Bias in Evaluations but Never Dared to Think. IDS Bulletin 45, 49–64.
Cohen, S., 2001. States of Denial: Knowing About Atrocities and Suffering. Polity Press, Cambridge.
Collins, H.M., 1991. The Meaning of Replication and the Science of Economics. History of Political Economy 23, 123–142.
Collins, H.M., 1985. Changing Order: Replication and Induction in Scientific Practice. University of Chicago Press, Chicago.
Dercon, S., Gilligan, D.O., Hoddinott, J., Woldehanna, T., 2014. The Impact of Agricultural Extension and Roads on Poverty and Consumption Growth in Fifteen Ethiopian Villages: Response to William Bowser, 3ie, New Delhi, available http://www.3ieimpact.org/media/filer_public/2015/02/06/original_author_response_rps_4.pdf, accessed 14/4/2015
Dewald, W. G., Thursby, J. G., Anderson, R. G. 1986. Replication in Empirical Economics: the Journal of Money, Credit and Banking Project. American Economic Review, 76, 587-603.
Duvendack, M., Palmer-Jones, R.W., 2013. Replication of Quantitative work in development studies: Experiences and suggestions. Progress in Development Studies 13, 307–322.
Duvendack, M., Palmer-Jones, R.W., 2014. Replication of quantitative work in development studies: experiences and suggestions, in: Camfield, L., Palmer-Jones, R.W. (Eds.), As Well as the Subject: Additional Dimensions in Development Research Ethics. Palgrave, London.
Duvendack, M., Palmer-Jones, R.W., Reed, W.R., 2015. Replications in Economics: a Progress Report. Econ Watch Journal, forthcoming.
Geissler, P.W., 2013. Public secrets in public health: Knowing not to know while making scientific knowledge. American Ethnologist 40, 13–34.
Glandon, P., 2011. Report on the American economic review data availability compliance project. American Economic Review 101, 695–699.
Hamermesh, D.S., 2007. Viewpoint: Replication in Economics. Canadian Journal of Economics 40, 715–733.
Hamermesh, D.S., 1997. Some Thoughts on Replications and Reviews. Labour Economics 4, 107–109.
Heffernan, M., 2012. Willful Blindness: Why We Ignore the Obvious at Our Peril, Reprint edition. Walker & Company, New York.
Ioannidis, J.P.A., 2005. Why Most Published Research Findings Are False. PLoS Med 2, e124.
Jensen, R., Oster, E., 2009. The Power of TV: Cable Television and Women’s Status in India. The Quarterly Journal of Economics 124 (3), 1057-1094.
Luban, D., Legal Ethics and Human Dignity, Cambridge University Press, Cambridge.
McCullough, B.D., Vinod, H.D., 2003. Verifying the Solution from a Nonlinear Solver: A Case Study. The American Economic Review 93, 873–892.
Maniadis, Z., Tufano, F., List, J.A., 2014. One Swallow Doesn’t Make a Summer: New Evidence on Anchoring Effects – Online appendix. American Economic Review 104, 277–290.
Peng, R.D., 2011. Reproducible Research in Computational Science. Science, 334, 1226–1227.