Podcast from Science Friday on Replication

This short podcast, just 12 minutes, is worth a listen.  It is an interview on Science Friday with Dan Simons, Professor of Psychology at the University of Illinois, and Barbara Spellman, former editor at Perspectives on Psychological Science, on a new initiative for replications in psychology called “Registered Replication Reports,” and the evolving attitudes of psychology journals towards replication.  And did we mention that’s it only 12 minutes long?  To listen, click here.

 

LAMPACH & MORAWETZ: A Primer on How to Replicate Propensity Score Matching Studies

Propensity Score Matching (PSM) approaches have become increasingly popular in empirical economics. These methods are intuitively appealing.  PSM procedures are available in well-known software packages such as R or Stata.
The fundamental idea behind PSM is that treated observations are compared with un-treated control observations only if the two observations are otherwise identical. This frees the researcher from having to specify a functional form explicitly relating outcomes to control variables.  In its place, PSM requires a matching algorithm. The choice of the best matching algorithm is, however an ongoing debate.
A quick tour through the literature is provided by a pair of articles from Dehejia and Wahba (2002, 1999), along with replications by Smith and Todd (2005) and Diamond and Sekhon (2013), which highlights different matching approaches.  As these studies use data from a randomized controlled trial (RCT) by LaLonde (1986), they provide an illuminating comparison between different variants of PSM. Other comparisons using data RCTs can be found in Peikes et al. (2008) and Wilde and Hollister (2007).  Huber et al. (2013) uses Monte Carlo experiments to compare the performance of different matching methods.
Two studies that replicate PSM research are Duvendack and Palmer (2012) and our own recent study, Lampach and Morawetz (2016).  Both reach a similar conclusion: the key issue is identification. Without the appropriate research design, matching will be misleading.
A good replication needs to do more than just check if the results are robust to an alternative matching algorithm. 
How can one determine whether a given research design is appropriate? We find Chapter 1.2 “Cochran’s Basic Advice” in the classic book by Paul Rosenbaum (2010) helpful. He distinguishes between “Better observational studies” and “Poorer observational studies” by stressing the importance of four main points:
— Clearly defined treatments (including the starting point of a treatment), covariates and outcomes
— The treatment should be close to random
— Good comparability of treatment and control observations
— Explicit testing of plausible alternatives explanations for the measured effect.
Starting from here, researchers will also find helpful the guidelines by Caliendo and Kopeinig (2008) and Imbens (2015).
Researchers interested in replicating PSM studies may also find helpful our recent paper in Applied Economics (Lampach and Morawetz, 2016).  We provide a step-by-step guide for how to undertake a PSM study in the context of a replication by following Caliendo and Kopeinig (2008).  PSM studies are particularly rewarding studies to replicate because they incorporate many decisions during the process of implementing the research (even given an appropriate research design).  A replication of PSM studies will be illuminating both because it allows one to better appreciate the many decisions that must be made, and because it allows one to determine the robustness of the results to alternative choices in research design.
We learned a lot from our replication experience and are grateful to the authors of the original work to provide us with data and code, the authors who wrote the useful guidelines, the journal which made it possible to publish the article, and finally to the organizers of The Replication Network for inviting us to write this blog.
REFERENCES:
Caliendo, M., Kopeinig, S., 2008. Some Practical Guidance for the Implementation of Propensity Score Matching. J. Econ. Surv. 22, 31–72. doi:10.1111/j.1467-6419.2007.00527.x
Chemin, M., 2008. The Benefits and Costs of Microfinance: Evidence from Bangladesh. J. Dev. Stud. 44, 463–484. doi:10.1080/00220380701846735
Dehejia, R.H., Wahba, S., 2002. Propensity Score-Matching Methods for Nonexperimental Causal Studies. Rev. Econ. Stat. 84, 151–161. doi:10.1162/003465302317331982
Dehejia, R.H., Wahba, S., 1999. Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs. J. Am. Stat. Assoc. 94, 1053–1062. doi:10.1080/01621459.1999.10473858
Diamond, A., Sekhon, J.S., 2013. Genetic Matching for Estimating Causal Effects: A General Multivariate Matching Method for Achieving Balance in Observational Studies. Rev. Econ. Stat. 95, 932–945. doi:10.1162/REST_a_00318
Duvendack, M., Palmer-Jones, R., 2012. High Noon for Microfinance Impact Evaluations: Re-investigating the Evidence from Bangladesh. J. Dev. Stud. 48, 1864–1880. doi:10.1080/00220388.2011.646989
Huber, M., Lechner, M., Wunsch, C., 2013. The performance of estimators based on the propensity score. J. Econom. 175, 1–21. doi:10.1016/j.jeconom.2012.11.006
Imbens, G.W., 2015. Matching Methods in Practice: Three Examples. J. Hum. Resour. 50, 373–419. doi:10.3368/jhr.50.2.373
Jena, P.R., Chichaibelu, B.B., Stellmacher, T., Grote, U., 2012. The impact of coffee certification on small-scale producers’ livelihoods: a case study from the Jimma Zone, Ethiopia. Agric. Econ. 43, 429–440. doi:10.1111/j.1574-0862.2012.00594.x
LaLonde, R.J., 1986. Evaluating the Econometric Evaluations of Training Programs with Experimental Data. Am. Econ. Rev. 76, 604–620.
Lampach, N., Morawetz, U.B., 2016. Credibility of propensity score matching estimates. An example from Fair Trade certification of coffee producers. Appl. Econ. 48, 4227–4237. doi:10.1080/00036846.2016.1153795
Peikes, D.N., Moreno, L., Orzol, S.M., 2008. Propensity Score Matching. Am. Stat. 62, 222–231. doi:10.1198/000313008X332016
Rosenbaum, P.R., 2010. Design of observational studies. Springer, New York.
Smith, J.A., Todd, P.E., 2005. Does matching overcome LaLonde’s critique of nonexperimental estimators? J. Econom., Experimental and non-experimental evaluation of economic policy and models 125, 305–353. doi:10.1016/j.jeconom.2004.04.011
Wilde, E.T., Hollister, R., 2007. How close is close enough? Evaluating propensity score matching using data from a class size reduction experiment. J. Policy Anal. Manage. 26, 455–477. doi:10.1002/pam.20262

Netherlands Spending 3 Million Euros to Fund Replications

[From the website of the Netherlands Organisation for Scientific Research (NWO)] :”NWO is making 3 million euros available for a Replication Studies pilot programme. In this programme, scientists will be able to repeat research that has been carried out by others. The pilot focuses on replicating studies that have a large impact on science, government policy or the public debate. This is the first special funding programme in the world for the repetition of such ‘cornerstone research’. With this initiative NWO wants to facilitate innovation in science and encourage researchers to carry out replication research.” To read more, click here.

3ie Wants You To Do a Replication for Them — And They’re Willing to Pay

[From 3ie — International Initiative for Impact Evaluation]: “3ie requests expressions of interest from researchers interested in conducting replication studies under 3ie’s Replication Window 4: Financial Services for the Poor…Funding is available to conduct internal replications of seven highly influential impact evaluations of interventions of financial services for the poor.”  The deadline for submitting expressions of interest is 2 August, with formal proposals to come later.  To learn more, click here.

Tales from the (Psychology) Crypt

This story about academic negligence, if not outright fraud, has many similarities with previous posts about “data mistakes,” though there is enough unique in the story to make it interesting in its own right.  To paraphrase Tolstoy, “each unhappy article is unhappy in its own way”:   A PhD student discovers a mistake in a famous researcher’s study.  After years of persistent attempts, the mistake is eventually revealed, the famous researcher retracts his study(ies), and the PhD student is vindicated.  Kind of.  To read more, click here.

PPT Slides for Garret Christensen’s Presentation on Research Transparency in Economics

Recently, GARRET CHRISTENSEN, project scientist at BITSS, reviewed the literature on  research transparency at a talk given at the Western Economic Association meetings.  You can access his slides here.

Tales from the (Economics) Crypt

Recently, ANDREW GELMAN blogged about a communication he received from Per Pettersson-Lidbom, an economist at Stockholm University. Petterson shared three stories of “scientific fraud” in papers published in top economics journals.  Gelman writes, “… I’m sharing Pettersson’s stories, neither endorsing nor disputing their particulars but as an example of how criticisms in scholarly research just hang in the air, unresolved. Scientific journals are set up to promote discoveries, not to handle corrections.”  To read Gelman’s blog, click here.
(WARNING: Conflict of interest ahead!) TRN notes that two economics journals are worth highlighting in this context.  Public Finance Review and Economics – The Open Access, Open Assessment E-Journal have replication sections that publish both positive and negative replications of studies.  To read more about their replication policies, click on the immediately preceding links.  

BOB REED: Replications and Peer Review

“Weekend Reads”, the weekly summary by IVAN ORANSKY of Retraction Watch, recently listed two articles on Peer Review.  One, a blog by George Borjas, concerns the recent imbroglio at the American Economic Review involving an editor who oversaw the review of an article by one of her coauthors (read here).  The other, a comment in Nature entitled “Let’s make peer review scientific” (read here) reviews 30 years of progress, and lack of progress, in peer reviewing.  Both articles underscore the obvious to anybody who has even minimal experience with the reviewing process — peer review is a flawed process.  
What does this have to do with replications?  The real problem with peer review is thinking that it is the final arbiter of a paper’s value.  Peer review is but one step in a lengthy process.  It follows the circulation of a working paper for comments and the presentation of one’s research at seminars and conferences.  But the publication of a paper should not be the final stage in a paper’s review.  
If a paper is important and makes a valuable contribution, that research should be examined further.  Were the data handled correctly?  Would alternative formulations of the research question have given similar results? Were the results robust to reasonable perturbations in experimental design?  These are things that are difficult for reviewers to address, because they generally do not have access to a researcher’s data and code.
Even when a journal requires data and code to be made available, rarely do reviewers have access to these when they are doing their review.  They are only available after a paper has been accepted for publication.  And it is only after researchers have been able to go through a paper’s data and code that they can judge for themselves whether the paper’s conclusions are fragile or robust.
The problem with peer review is not so much a problem with peer review.  The problem with peer review is the scientific community’s elevation of peer review in the review process.  Peer review should be thought of as an intermediate stage in the review of a paper.  As one part of the gauntlet that a paper needs to run to establish its scientific worth.  Until it becomes the norm for authors to provide their data and code when submitting their research to journals, it will inevitably be the case that the real “review” will have to be done in the post publication phase of a paper’s life.  Through replication.

 

Progress Report on Open Science in Psychology

ETIENNE LEBEL, in a blog for BITSS, gives a brief but wide-ranging summary of the status of “open science” in psychology.  Topics include: (i) the use of “badges” to encourage provision of research materials, (ii) pre-registration, (iii) reproducibility, (iv) replications, (v) peer review, and (vi) meta-analysis, among other topics. To read more, click here.

IN THE NEWS: The Economist (June 18, 2016)

[From the article “Come Again”]: “The GRIM test, short for granularity-related inconsistency of means, is a simple way of checking whether the results of small studies of the sort beloved of psychologists (those with fewer than 100 participants) could be correct, even in principle. … “
To understand the GRIM test, consider an experiment in which participants were asked to assess something (someone else’s friendliness, say) on an integer scale of one to seven. The resulting paper says there were 49 participants and the mean of their assessments was 5.93. It might appear that multiplying these numbers should give an integer product—ie, a whole number—since the mean is the result of dividing one integer by another. If the product is not an integer (as in this case, where the answer is 290.57), something looks wrong.”
When the authors of the GRIM test took their simple test to analyse 71 papers in three leading psychology journals, they found that over half the papers failed the test.  To read more, click here.