MURPHY: Quantifying the Role of Research Misconduct in the Failure to Replicate
[NOTE: This blog is based on the article “HARKing: How Badly Can Cherry-Picking and Question Trolling Produce Bias in Published Results?” by Kevin Murphy and Herman Aguinis, recently published in the Journal of Business and Psychology.]
The track record for replications in the social sciences is discouraging. There have been several recent papers documenting and commenting on the failure to replicate studies in economics and psychology (Chang & Li, 2015; Open Science Collaboration, 2015; Ortman, 2015; Pashler & Wagenmakers, 2012). This “reproducibility crisis” has simulated a number of excellent methodological papers documenting the many reasons for the failure to replicate (Braver, Thoemmes & Rosenthal, 2014; Maxwell, 2014). In general, this literature has shown that a combination of low levels of statistical power and a continuing reliance on null hypothesis testing have contributed substantially to the apparent failure of many studies to replicate, but there is a lingering suspicion that research misconduct plays a role in the widespread failure to replicate.
Out-and-out fraud in research has been reported in a number of fields; Ben-Yehuda and Oliver-Lumerman (2017) have chronicled nearly 750 cases of research fraud between 1880 and 2010 involving fabrication and falsification of data, misrepresentation of research methods and results and plagiarism. Their work has helped to identify the roles of institutional factors in research fraud (e.g., a large percentage of the cases examined involved externally funded research at elite institutions) as well as identifying ways of detecting and responding to fraud. This type of fraud appears to represent only a small proportion of the studies that are published, and since many of the known frauds have been perpetrated by the same individuals, the proportion of genuinely fraudulent researchers may be even smaller.
A more worrisome possibility is that researcher behaviors that fall short of outright fraud may nevertheless bias the outcomes of published research in ways that will make replication less likely. In particular, there is a good deal of evidence that a significant proportion of researchers engage in behaviors such as HARKing (posing “hypotheses” after the results of a study are known) or p-hacking (combing through or accumulating results until you find statistical significance) (Bedeian, Taylor & Miller, 2010; Head, Holman, Lanfear, Kahn & Jennions, 2015; John, Loewenstein & Prelec, 2012). These practices have the potential to bias results because they involve a systematic effort to find and report only the strongest results, which will of course make it less likely that subsequent studies in these same areas will replicate well.
Although it is widely recognized that author misconduct, such as HARKing, can bias the results of published studied (and therefore make replication more difficult), it has proved surprisingly difficult to determine how badly HARKing actually influences research results.
There are two reasons for this. First, HARKing might include a wide range of behaviors, from post-hoc analyses that are clearly labelled as such to unrestricted data mining in search for something significant to pubish, and different types of HARKing might have quite different effects. Second, authors usually do not disclose that the results they are submitting for publication are the result of HARKing, and there is rarely a definitive test for HARKing [O’Boyle, Banks & Gonzalez-Mulé (2017) were able to evaluate HARKing on an individual basis by comparing the hypotheses posed in dissertations with those reported in published articles based on the same work, and they suggested that in the majority of the cases they examined, there was considerably more alignment between results and hypotheses in published papers than in dissertations, presumably as a result of post-hoc editing of hypotheses].
In a recent paper Herman Aguinis and I published in Journal of Business and Psychology (see here), we suggested that simulation methods could be useful for assessing the likely impact of HARKing on the cumulative findings of a body of research. In particular, we used simulation methods to try and capture what it is authors actually do when they HARK. Our review of research on HARKing suggested that two particular types of behavior are both widespread and potentially worrisome. First, some authors decide on a research question, then scan results from several samples, statistical tests, or operationalizations of their key variables, selecting the strongest effects for publication. This type of cherry picking does not invent new hypotheses after the data have been collected, but rather samples the data that have been obtained to obtain the best case for a particular hypothesis. Other authors, scan results from different studies, samples, analyses etc. that involve some range of variables, and decide after looking at the data which relationships look strongest, then write up their research as if they had hypothesied this relationship all along. This form of question trolling is potentially more worrisome than cherry picking because these researchers allow the data to tell them what their research question should be rather than using the research question to determine what sort of data should be collected and examined.
We wrote simulations that mimicked these two types of author behaviors to determine how much bias these behaviors might introduce. Because both cherry picking and question trolling represent choosing the strongest results for publication, they are both likely to introduce some biases (and the make the likelihood of subsequent replications lower). Our results suggest that cherry picking introduces relatively small biases, but because the effects reported in the behavioral and social sciences are often quite small (Bosco, Aguinis, Singh, Field & Pierce, 2015), cherry picking can create a substantially boost in the relative size of effect size estimates. Question trolling has the potential to create biases that are sizable in both an absolute and a relative sense.
Our simulations suggest that the effects of HARKing a cumulative literature can be surprisingly complex. They depend on the prevalence of HARKing, the type of HARKing involved and the size and homogeneity of the pool of results the researcher consults before deciding what his or her “hypothesis” actually is.
Professor Kevin Murphy holds the Kemmy Chair of Work and Employment Studies at the University of Limerick. He can be contacted at Kevin.R.Murphy@ul.ie.
Bedeian, A. G., Taylor, S. G., & Miller, A. N. (2010). Management science on the credibility bubble: Cardinal sins and various misdemeanors. Academy of Management Learning & Education, 9, 715-725.
Ben-Yehuda, N. & Oliver-Lumerman, A. (2017). Fraud and Misconduct in Research: Detection, Investigation and Organizational Response. University of Michigan Press.
Bosco, F. A., Aguinis, H., Singh, K., Field, J. G., & Pierce, C. A. (2015). Correlational effect size benchmarks. Journal of Applied Psychology, 100, 431–449.
Braver, S. L., Thoemmes, F. J., & Rosenthal, R. (2014). Continuously cumulating meta-analysis and replicability. Perspectives on Psychological Science, 9, 333–342. doi:10.1177/1745691614529796
Chang, A. C., & Li, P. (2015). Is Economics Research Replicable? Sixty Published Papers from Thirteen Journals Say ”Usually Not”,” Finance and Economics Discussion Series 2015-083.
Washington: Board of Governors of the Federal Reserve System, doi:10.17016/FEDS.2015.083
Head, M.L., Holman, L., Lanfear, R., Kahn, A.T. & Jennions, M.D. (2015). The Extent and Consequences of P-Hacking in Science. PLOS Biology, https://doi.org/10.1371/journal.pbio.1002106
John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth-telling. Psychological Science, 23, 524-532.
Maxwell, S. E. (2004). The persistence of underpowered studies in psychological research: Causes, consequences, and remedies. Psychological Methods, 9, 147–163. doi:10.1037/1082- 989X.9.2.147
O’Boyle, E. H., Banks, G. C., & Gonzalez-Mulé, E. (2017). The chrysalis effect: How ugly initial results metamorphosize into beautiful articles. Journal of Management, 43, NPi. https://doi.org/10.1177/ 0149206314527133.
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349, aac4716. doi:10.1126/science.aac4716
Pashler, H., & Wagenmakers, E. J. (2012). Editors’ introduction to the special section on replicability in psychological science: A crisis of confidence? Perspectives on Psychological Science, 7, 528–530. doi:10.1177/1745691612465253