The null hypothesis significance testing (NHST) paradigm is the dominant statistical paradigm in the biomedical and social sciences. A key feature of the paradigm is the dichotomization of results into the different categories “statistically significant” and “not statistically significant” depending on whether the p-value is, respectively, below or above the size alpha of the test, where alpha is conventionally set to 0.05. Although prior research has oft criticized this dichotomization for, inter alia, having “no ontological basis” (Rosnow and Rosenthal, 1989) and the arbitrariness of the 0.05 cutoff value, the impact of this dichotomization on the judgments and decision making of academic researchers has received relatively little attention.
Our articles examine this question. We find that the dichotomization intrinsic to the NHST paradigm leads expert researchers from a variety of fields (including medicine, epidemiology, cognitive science, psychology, business, economics, and even statistics) to make errors in reasoning. In particular, when presented with a hypothetical study summary with a p-value experimentally manipulated to be either above or below the 0.05 threshold for statistical significance, we show:
 Academic researchers interpret evidence dichotomously primarily based on whether the p-value is below or above 0.05.
 They fixate on whether a p-value reaches the threshold for statistical significance even when p-values are irrelevant (e.g., when asked about descriptive statistics).
 These findings apply to likelihood judgments about what might happen to future subjects as well as to choices made based on the data.
 Researchers’ judgments reflect a tendency to ignore effect size.
We briefly review these findings with a focus, given the audience of this blog, on our results for economists.
Study 1: Descriptive Statements
In our first series of studies, the hypothetical study summary described a clinical trial of two treatments where the outcome of interest was the number of months lived by the patients (average of 8.2 and 7.5 months for treatments A and B respectively). Our subjects were asked a multiple choice question about whether the number of months lived by those who received treatment A was greater, less, or no different than the number of months lived by those who received treatment B or whether it could not be determined.
The correct answer is, of course, that the average number of post-diagnosis months lived by the patients who received treatment A was greater than that lived by the patients who received treatment B (i.e., 8.2 > 7.5) regardless of the p-value. However, as illustrated in Figure 1, subjects were much more likely to answer the question correctly when the p-value in the question was set to 0.01 than to 0.27. Similar results held for researchers in psychology, business, and, to a lesser extent, statistics.
Study 2: Likelihood Judgments and Choices
In our second series of studies, the hypothetical study summary described a clinical trial of two drugs where the outcome of interest was whether or not patients recovered from a disease (e.g., recovery rate of 52% and 44% for Drugs A and B respectively). Our subjects were asked two multiple choice questions: first, a likelihood judgment question about whether a hypothetical patient would be more likely, less likely, or equally likely to recover if given Drug A versus Drug B or whether it could not be determined, and, second, a choice question asking, if they were a patient, whether they would prefer to take Drug A, Drug B, or were indifferent.
The issue at variance in both the likelihood judgment question and choice question is fundamentally a predictive one: they both ask about the relative likelihood of a new patient recovering if given Drug A rather than Drug B. This in turn clearly depends on whether or not Drug A is more effective than Drug B. The p-value is of course one measure of the strength of the evidence regarding the likelihood that it is. However, the level of the p-value does not alter the “correct” response option for either question: the correct answer is option A as Drug A is more likely to be more effective than Drug B (under the non-informative prior encouraged by the question wording this probability is one minus half the two-sided p-value).
As illustrated in Figure 2, the proportion of subjects who chose Drug A for either question dropped sharply once the p-value rose above 0.05 but it was relatively stable thereafter and the magnitude of the treatment difference had no substantial impact on the results. However, the effect of statistical significance was attenuated for the choice question, consistent with the notion that making matters more personally consequential shifts the focus away from concerns about statistical significance and towards whether an option is superior. Similar results held for researchers in cognitive science, psychology, and, to a lesser extent, statistics.
We repeated similar studies on economists. As illustrated in Figure 3, similar results held. However, as illustrated in Figure 3e, the effect is attenuated when the researchers were presented with not only a p-value but also with a posterior probability based on a non- informative prior. This is interesting because, objectively, the posterior probability is a redundant piece of information: as noted above, under a non-informative prior it is one minus half the two-sided p-value.
Researchers from a wide variety of fields, including both statistics and economics, interpret p-values dichotomously depending upon whether or not they fall below the hallowed 0.05 threshold. This is in direct contravention of the third principal of the recent American Statistical Association Statement on Statistical Significance and p- values (Wasserstein and Lazar, 2016)—“Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold”—as well as countless other similar warnings.
What can be done? Our suggestions are not particularly new or original. We should emphasize that evidence, particularly that based on p-values and other purely statistical measures, lies on a continuum. We would go further and say that, in many cases, it does not make sense to calibrate scientific evidence as a function of the p-value, given that this statistic is defined relative to the generally uninteresting and implausible null hypothesis of zero effect and zero systematic error (McShane et al., 2017).
We suggest looking beyond purely statistical considerations and taking a more holistic and integrative view of evidence that includes prior and related evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain. Most importantly, perhaps, we should move away from dichotomous or categorical reasoning whether in the form of NHST or otherwise.
Blakeley B. McShane is an associate professor at the Kellogg School of Management, Northwestern University. David Gal is a professor at the University of Chicago at Illinois College of Business Administration. Correspondence regarding this blog post can be directed to either or both at firstname.lastname@example.org and email@example.com respectively.
 McShane, B.B., and Gal, D. (2016), “Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence.” Management Science,62(6), 1707-1718.
 McShane, B.B. and Gal, D. (2017), “Statistical Significance and the Dichotomization of Evidence.” Journal of the American Statistical Association, 112(519), 885-895.
 Rosnow RL, Rosenthal R (1989) Statistical procedures and the justification of knowledge in psychological science. Amer. Psychologist 44:1276–1284.
 Wasserstein, R. L., and Lazar, N. A. (2016), “The ASA’s statement on p-values: context, process, and purpose,” The American Statistician, 70(2), 129–133.
 McShane, B. B., Gal, D., Gelman, A., Robert, C., & Tackett, J. L. (2017). Abandon statistical significance. arXiv preprint arXiv:1709.07588.