[From the working paper, “Sound Inference in Complicated Research: A Multi-Strategy Approach” by Sanjay Srivastava, posted at PsyArXiv Preprints]
“Preregistration is effective because it creates decision independence: analytic decisions are the same regardless of the specific and potentially spurious features of the data being analyzed, instead of being overfit to them. But preregistration can be difficult in practice for some complicated research paradigms, including longitudinal studies, statistical modeling, machine learning, and other intensively multivariate designs and analyses where key decisions may be difficult to anticipate or make in advance.”
“When simple preregistration is not practical, other strategies that create decision independence can be used to augment it or in place of it. This manuscript discusses standardization, blind analyses, data partitioning, supporting studies, coordinated analysis, and multiverse analyses as additional strategies that can be used to create adaptive preregistrations or to handle decisions that were not anticipated.”
[From a post by Andrew Gelman at his blog, Statistical Modeling, Causal Inference, and Social Science]
“If researchers and policy makers continue to view results of impact evaluations as a black box and fail to focus on mechanisms, the movement toward evidence-based policy making will fall far short of its potential for improving people’s lives.”
“I agree with this quote from Bates and Gellenerst, and I think the whole push-a-button, take-a-pill, black-box attitude toward causal inference has been a disastrous mistake. I feel particularly bad about this, given that econometrics and statistics textbooks, including my own, have been pushing this view for decades.”
“Stepping back a bit, I agree with Vivalt that, if we want to get a sense of what policies to enact, it can be a mistake to try to be making these decisions based on the results of little experiments. There’s nothing wrong with trying to learn from demonstration studies (as here), but generally I think realism is more important than randomization. And, when effects are highly variable and measurements are noisy, you can’t learn much even from clean experiments.”
[From the blog, “Gazing into the Abyss of P-Hacking: HARKing vs. Optional Stopping” by Angelika Stefan and Felix Schönbrodt, posted at Felix Schönbrodt’s website at http://www.nicebread.de]
“Now, what does a researcher do when confronted with messy, non-significant results? According to several much-cited studies (for example John et al., 2012; Simmons et al., 2011), a common reaction is to start sampling again (and again, and again, …) in the hope that a somewhat larger sample size can boost significance. Another reaction is to wildly conduct hypothesis tests on the existing sample until at least one of them becomes significant (see for example: Simmons et al., 2011; Kerr, 1998 ). These practices, along with some others, are commonly known as p-hacking, because they are designed to drag the famous p-value right below the mark of .05 which usually indicates statistical significance. Undisputedly, p-hacking works (for a demonstration try out the p-hacker app).
“… P-Hacking exploits alpha error accumulation and fosters the publication of false positive results which is bad for science. However, we want to take a closer look at how bad it really is. In fact, some p-hacking techniques are worse than others (or, if you like the unscrupulous science villain perspective: some p-hacking techniques work better than others).”
“As a showcase, we want to introduce two researchers: The HARKer takes existing data and conducts multiple independent hypothesis tests (based on multiple uncorrelated variables in the data set) with the goal to publish the ones that become significant.”
“… the Accumulator uses optional stopping. This means that he collects data for a single research question test in a sequential manner until either statistical significance or a maximum sample size is reached. “
“… To conclude, we have shown how two p-hacking techniques work and why their application is bad for science. We found out that p-hacking techniques based on multiple testing typically end up with higher rates of false positive results than p-hacking techniques based on optional stopping, if we assume the same number of hypothesis tests.”
[From Felix Schönbrodt’s Twitter account, @nicebread303]
“The p-hacker app just UNLOCKED the most requested PRO FEATURE: Subgroup analyses!!!”
“Check if you can find the effect for, say, young males. This is soo theoretically interesting. Now you can get the p-hacking skills of the experts FOR FREE!!!”
To check out the app and have hours of p-hacking fun, click here.
[From the article, “Psychology’s Replication Crisis Is Running Out of Excuses” By Ed Yong, published in The Atlantic]
“The Many Labs 2 project was specifically designed to address these criticisms. With 15,305 participants in total, the new experiments had, on average, 60 times as many volunteers as the studies they were attempting to replicate. The researchers involved worked with the scientists behind the original studies to vet and check every detail of the experiments beforehand. And they repeated those experiments many times over, with volunteers from 36 different countries, to see if the studies would replicate in some cultures and contexts but not others.”
“…Despite the large sample sizes, and the blessings of the original teams, the team failed to replicate half of the studies they focused on.”
“…Likewise, Many Labs 2 ‘was explicitly designed to examine how much effects varied from place to place, from culture to culture,’ says Katie Corker, the chair of the Society for the Improvement of Psychological Science. ‘And here’s the surprising result: The results do not show much variability at all.’ If one of the participating teams successfully replicated a study, others did, too. If a study failed to replicate, it tended to fail everywhere.”
“It’s worth dwelling on this, because it’s a serious blow to one the most frequently cited criticisms of the ‘reproducibility crisis’ rhetoric. Surely, skeptics argue, it’s a fantasy to expect studies to replicate everywhere. ‘There’s a massive deference to the sample,’ says Nosek. ‘Your replication attempt failed? It must be because you did it in Ohio and I did it in Virginia, and people are different. But these results suggest that we can’t just wave those failures away very easily.’”
[From the article “Making economics transparent and reproducible” by Tyler Smith, published on the American Economic Association’s website]
“The AEA spoke with Miguel about the replication problem in economics and how the next generation of researchers is embracing new tools to make the profession more credible.”
…
“AEA: There’s a growing debate about the credibility of the social sciences, what some are calling a “replication crisis.” What kind of challenge do you think the social sciences and economics are facing?”
“Miguel: There’s been increasing evidence really across the social sciences, and economics is no exception, that publication bias is a real concern.”
…
“AEA: What’s the one problem you think we really need to tackle?”
“Miguel: I do think that an absolute, fundamental improvement which has already occurred to a large degree has been sharing of data and statistical code. … The concern around that is that more and more people are using proprietary data. So the trend towards open data and data sharing, which has been so strong, is sort of stalling out a little bit right now. And it’s not clear exactly how to solve that problem…”
To read the article and/or listen to a full-length audio version of the interview, click here.
Two weeks ago, on Halloween, I wrote a post about how to conduct a replication study using an approach that emphasizes which tests might be run in order to avoid the perception of a witch hunt. The post is based on my paper with Benjamin D.K. Wood, which I recently presented at the “Reproducibility and Integrity in Scientific Research” workshop at the University of Canterbury. When Ben and I first submitted the paper to Economics E-journal, we received some great referee comments (all of which are public) including requests by an anonymous referee and Andrew Chang to include in the paper a list of what not to do – a list of don’ts.
We spent some time thinking about this request. We realized that what the referees wanted was a list of statistical and econometric no-nos, especially drawing on the most controversial replication studies funded by the International Initiative for Impact Evaluation (3ie) while we were both there. However, our role at 3ie was to be a neutral third party, at least as much as possible, and we didn’t want to abandon that now.
At the same time, we did learn a lot of lessons about conducting replication research while at 3ie, and we agreed that some of those lessons would be appropriate don’ts. So we added a checklist of don’ts to the paper that was ultimately published. Here I summarize three of these don’ts. I should note that I’m talking here about internal replication studies, which is when the replication researcher uses the original data from a publication to check whether the published findings can be exactly reproduced and are robust, particularly those findings supporting conclusions and recommendations.
When conducting a replication study, don’t confuse critiques of the original research with the replication tests or findings. Certainly, critiques of the original research can motivate the choice of replication exercises, and it is fine to present critiques in that context. But often there are critiques that are separate from what can be explored with the data. For example, a replication researcher might be concerned that the fact that treatment and controls groups were unblinded may mean the published findings are biased.
This concern about the original research design may be valid, but it is not something that can be tested through replication exercises. Simply identifying this concern is not a replication finding. We saw many examples where replication researchers interspersed their critiques of the motivation or design of the original research with their replication exercises and results. Mixing these two types of analysis contributed to some of the biggest controversies that we witnessed.
Don’t conduct measurement and estimation analysis (which some call robustness testing) before conducting a pure replication. (See here and here for more on terminology and the 3ie replication program.) Often replication researchers begin a study motivated by questions of robustness and may even take for granted that a pure replication (which is applying the published analysis methods to the original data) would reproduce the published results.
While skipping the pure replication may seem like a way to save time, conducting the pure replication often has the benefit of saving time. The pure replication is the best way for the replication researcher to familiarize herself with the data, methods, and findings of the original publication, and missing a problem at the pure replication stage is only going to confuse the measurement and estimation analysis.
Even more to the point, some consider pure replication the only stage of the research that should be called “replication”, and therefore the only results that should be reported as replication results. It is important for a replication researcher to be able to make a clear statement about the results at this stage.
Don’t present, post or publish replication results without first sharing them with the original authors. Replication research is, unfortunately, often a contentious undertaking. Replication researchers are advised to take the high road and communicate with original authors about their work – ideally from the beginning, even if the data are already publicly available. We saw cases where the replication researchers made mistakes that the original authors caught, so communication can save face on both sides.
There is a real concern about the original authors scooping a replication study by posting a correction without citing the replication researchers. We have seen this happen. Some approaches to addressing it include publicly posting the replication plan in advance. This research transparency approach serves multiple purposes, but one is putting a name and timestamp on the work that might lead to corrections. Another approach is to document the dates and subjects of communications with original authors and include this information, as an acknowledgement or footnote, in the replication study.
Perhaps one of our most important don’ts is don’t label the difference between a published result and a replication study results an “error” or “mistake” without identifying the source of the error. Just because the second estimate is different than the first does not make the second right. Ben and I already blogged about this don’t recommendation here on the World Bank Development Impact blog.
Recent revelations, such as last month’s report of the retraction of 15 articles by well-known Cornell food researcher, Brian Wansink, remind us that replication research is as important to the advancement of the natural and social sciences as ever. My hope is that more researchers accept the responsibility of conducting replication research as part of their contribution to science. The advice presented in the which tests paper and summarized in my last post and this one is intended to help them get started.
Annette N. Brown, PhD is Principal Economist at FHI 360, where she leads efforts to increase and enhance evidence production and use across all sectors and regions. She previously worked at 3ie, where she directed the research transparency programs, including the replication program.
BBC Radio just produced an interesting and balanced program about the replication crisis, with a focus on psychology.
Interviewees include John Bargh, Susan Fiske, John Ioannidis, Brian Nosek, Stephen Reicher, Diederik Stapel and Simine Vazire. One of the highlights is hearing Diederik Stapel say, in his own voice, “I made up the data.”
And I like this closing from the narrator: ““Ironically, psychology may now be ahead of the other sciences in putting its house in order”.
The program is about 30 minutes long. To listen, click here.
[From the preprint “A Model-Centric Analysis of Openness, Replication, and Reproducibility”, by Bert Baumgaertner, Berna Devezer, Erkan Buzbas, and Luis Nardin, posted at arXiv.org]
“In order to clearly specify the conditions under which we may or may not obtain reproducible results, we present a formalization of the concepts of replication and reproducibility. Our main insight from this formalization is that there are some impediments to obtaining reproducible results that precede many of the erroneous practices often cited in the literature as causes of the reproducibility crisis.”
“Hitherto, the literature on the reproducibility crisis has lacked a formal analysis of the conditions under which we can expect to have reproducible results. We show that some aspects of the scientific process, particularly those related to openness of critical components of an experiment, cannot prevent irreproducibility, even if other sources of error are absent. Thus, a failure to reproduce a scientific result is not necessarily due to some methodological or cultural practice which deviates from some ideal notion of science.”
To read the article, click here.
Here are some things you might want to be aware of when thinking about doing a replication.
1) You want to do a replication? Great, you are doing a great service to science! As shown in Reed (2018), ‘a single replication, if it confirms the original study, can have a large impact on the probability that the estimated effect is real.’[1] At the workshop, Bob Reed (University of Canterbury Business School) gave an overview of the current state of replication in economics, you can read the blog based on Bob’s presentation on The Replication Network.
2) That being said, Jeff Miller (University of Otago, Psychology) pointed out that, on purely statistical grounds, the chance you replicate a significant result (the ‘aggregate replication probability’) is only 36%. Hence, failure to replicate should not come as a surprise. Moreover, for purely statistical reasons, ‘If your effect is real, you will get about 60% significant results – if not, you will get about 5% significant results’. Hence, there is a sizeable chance that your failed replication, in fact would point us away from the truth rather than closer to the truth.
3) As the first step on your replication journey, you might want to try a Push Button Replication (PBR) – Benjamin Wood (Integra) pointed us to 3ie’s protocol for PBRs which can be found here .
4) But are you sure you want to do replication? Be aware that replication is just one way of trying to find the truth. Brian Haig (University of Canterbury, Psychology) suggested to consider ‘methodological triangulation’ as an alternative. ‘methodological triangulation involves the use of multiple independent methods in order to detect the same property or phenomenon’.
5) With all the focus on statistical significance in replication studies, one might forget that there are alternatives to that too: Philip Schluter (University of Canterbury, Health Sciences) focused on the Bayesian approach, making it clear that subjectivity can be ok, as long as one makes clear what ones prior is.
6) In a similar vein, Arin Basu (University of Canterbury, Health Sciences) focused on the importance of causality. After all, what one really wants is checking causality, that a correlation is replicable might not be that interesting, if it’s not clear what causes the correlation. And hey, maybe you’d want to consider doing a meta-analysis?
7) You still want to do a replication? Practical advice can be found in Annette Brown (FHI360) ’s paper – it provides a battery of tests than can be done to check the robustness of the paper you are trying to replicate. In fact, this paper is a great guide for anybody who wants to make replicable research – indeed, rather than waiting for others to replicate your paper, these tests could/should be done by the authors of the original papers! This is another good reason to do a replication: it will make you a better researcher! A summary of Annette’s paper can be found in her blog on The Replication Network.
8) But are you really sure you want to do a replication? Thomas Pfeiffer (Massey, Biology)’s work shows that betting markets can help to predict which paper is replicable. So you might want to consider a betting market about the paper you want to replicate rather than doing the actual replications.
9) Finally, to avoid that results drive your research, Eric Vanman (University of Queensland, Psychology) suggested to pre-register what one plans to do (you can do this here ). This applies to replications too: a replication plan will help you to avoid the temptation to continue to search for issues until you can show a paper cannot be replicated.
Tom Coupé is an Associate Professor of Economics at the University of Canterbury, New Zealand.
[1] An example from Reed (2018) :’ The values in the table show the original PSP of 0.20 gets updated depending on whether the replication was unsuccessful or successful. Following an unsuccessful replication, the post‐replication probability that a relationship exists falls from 0.20 to 0.12. However, a successful replication raises the probability from 0.20 to 0.71.’
You must be logged in to post a comment.