[NOTE: This post refers to the article “An Economic Approach to Alleviate the Crises of Confidence in Science: With an Application to the Public Goods Game” by Luigi Butera and John List. The article is available as a working paper which can be downloaded here.]
In the process of generating scientific knowledge, scholars sometimes stumble upon new and surprising results. Novel studies typically face a binary fate: either their relevance and validity is dismissed, or findings are embraced as important and insightful. Such judgements however commonly rely on statistical significance as the main criterion for acceptance. This poses two problems, especially when a study is the first of its kind.
The first problem is that novel results may be false positives simply because of the mechanics of statistical inference. Similarly, new surprising results that suffer from low power, or marginal statistical significance, may sometimes be dismissed even though they point toward an economic association that is ultimately true.
The second problem has to do with how people should update their beliefs based on unanticipated new scientific evidence. Given the mechanics of inference, it is difficult to provide a definite answer when such evidence is based on one single exploration. To fix ideas, suppose that before running an experiment, a Bayesian scholar had a prior about the likelihood of a given result being true of only 1%. After running the experiment and observing the significant results (significant at, say, 5% level), the scholar should update his beliefs to 13.9%, a very large increase relative to the initial beliefs. Posterior beliefs can be easily computed, for any given prior, by dividing the probability that a true result is declared true by the probability that any result is declared true. Even more dramatically, a second scholar who for instance had priors of 10%, instead of 1%, would update his posterior beliefs to 64%. The problem is clear: posterior beliefs generated from low priors are extremely volatile when they only depend on evidence provided by a single study. Finding a referee with priors of 10% or 1% can make or break a paper!
The simple solution to this problem is of course to replicate the study: as evidence accumulates, posterior beliefs converge. Unfortunately, the incentives to replicate existing studies are rarely in place in the social sciences: once a paper is published, the original authors have little incentive to replicate their own work. Similarly, the incentives for other scholars to closely replicate existing work are typically very low.
To address this issue, we proposed in our paper a simple incentive-compatible mechanism to promote replications, and generate mutually beneficial gains from trade between scholars. Our idea is simple: upon completion of a study that reports novel results, the authors make it available online as a working paper, but commit never to submit it to a peer-reviewed journal for publication. They instead calculate how many replications they need for beliefs to converge to a desired level, and then offer co-authorship for a second, yet to be written, paper to other scholars willing to independently replicate their study. Once the team of coauthors is established, but before replications begin, the first working paper is updated to include the list of coauthors and the experimental protocol is registered at the AEA RCT registry. This guarantees that all replications, both failed and successful, are accounted for in the second paper. The second paper will then reference the first working paper, include all replications, and will be submitted to a peer-reviewed journal for publication.
We put our mechanism to work on our own experiment where we asked: can cooperation be sustained over time when the quality of a given public good cannot be precisely estimated? From charitable investments to social programs, uncertainty about the exact social returns from these investments is a pervasive characteristic. Yet we know very little about how people coordinate over ambiguous and uncertain social decisions. Surprisingly, we find that the presence of (Knightian) uncertainty about the quality of a public good does not harm, but rather increases cooperation. We interpret our finding through the lenses of conditional cooperation: when the value of a public good is observed with noise, conditional cooperators may be more tolerant to observed reductions in their payoffs, for instance because such reductions may be due, in part, to a lower-than-expected quality of the public good itself rather than solely to the presence of free-riders. However, we will wait until all replications are completed to draw more informed inference about the effect of ambiguity on social decisions.
One final note: while we believe that replications are always desirable, we do not by any means suggest that all experiments, lab or field, necessarily need to follow our methodology. We believe that our approach is best suited for studies that find results that are unanticipated, and in some cases at odds with the current state of knowledge on a topic. This is because in these cases, priors are more likely to be low, and perhaps more sensitive to other factors such as the experience or rank of the investigator. As such, we believe that our approach would be particularly beneficial for scholars at the early stages of their careers, and we hope many will consider joining forces together.
Luigi Butera is a Post-Doctoral scholar in the Department of Economics at the University of Chicago. He can be contacted via email at email@example.com.
… Overall, Morris and Fritz argue that the guidelines appear to have had an effect on the reporting in PS journals, but that effect is admittedly small at best. Guidelines likely have only a small, temporary effect compared to reviewers and editors exerting direct pressure on authors.
The journal FinanzArchiv / Public Finance Analysis put out the following call for a special issue:
“There is considerable concern among scholars that empirical papers are facing a drastically smaller chance of being published if the results looking to confirm an established theory turn out to be statistically insignificant. If true, such a publication bias can provide a wrong picture of economic magnitudes and mechanisms.”
“Against this background, FinanzArchiv / Public Finance Analysis is posting a call for papers for a special issue on “Insignificant Results in Public Finance”. The editors are inviting the submission of carefully executed empirical papers that – despite using state of the art empirical methods – fail to find significant effects for important economic effects that have widespread acceptance.”
[From the article, “Point of View: How should novelty be valued in science?” by Barak A. Cohen, published in the journal eLife]
“Scientists are under increasing pressure to do “novel” research. Here I explore whether there are risks to overemphasizing novelty when deciding what constitutes good science. I review studies from the philosophy of science to help understand how important an explicit emphasis on novelty might be for scientific progress. I also review studies from the sociology of science to anticipate how emphasizing novelty might impact the structure and function of the scientific community. I conclude that placing too much value on novelty could have counterproductive effects on both the rate of progress in science and the organization of the scientific community. I finish by recommending that our current emphasis on novelty be replaced by a renewed emphasis on predictive power as a characteristic of good science.”
[From the article “What a nerdy debate about p-values shows about science — and how to fix it” by Brian Resnick at Vox.com]
“There’s a huge debate going on in social science right now. The question is simple, and strikes near the heart of all research: What counts as solid evidence?…One of the thorniest issues with this question is statistical significance. It’s one of the most influential metrics to determine whether a result is published in a scientific journal.”
“Now a group of 72 prominent statisticians, psychologists, economists, sociologists, political scientists, biomedical researchers, and others want to disrupt the status quo. A forthcoming paper in the journal Nature Human Behavior argues that results should only be deemed “statistically significant” if they pass a higher threshold….“We propose a change to P< 0.005,” the authors write. “This simple step would immediately improve the reproducibility of scientific research in many fields.””
In a recent blogpost at Simply Statistics, Jeff Leek announced a new R package called tidypvals: “The tidypvals package is an effort to find previous collections of published p-values, synthesize them, and tidy them into one analyzable data set.”
In a preview of coming attractions, Leek posts the following graphic, representing more than 2.5 million p-values across 25 disciplines, and asks, “Notice anything funny?”.
Maybe it’s just that some papers report stars (*=0.10, **=0.05, ***=0.01), and don’t report p-values when estimates are insignificant. Or maybe …
To read more, including how to download and install the tidypals package, click here.
In a recent working paper, posted on PsyArXiv Preprints, Daniel Benjamin, James Berger, Magnus Johanneson, Brian Nosek, Eric-Jan Wagenmakers, and 67 other authors(!) argue for a stricter standard of statistical significance for studies claiming new discoveries. In their words:
“…we believe that a leading cause of non-reproducibility has not yet been adequately addressed: Statistical standards of evidence for claiming new discoveries in many fields of science are simply too low. Associating “statistically significant” findings with P < 0.05 results in a high rate of false positives even in the absence of other experimental, procedural and reporting problems. For fields where the threshold for defining statistical significance for new discoveries is 𝑃 < 0.05, we propose a change to 𝑃 < 0.005. This simple step would immediately improve the reproducibility of scientific research in many fields. Results that would currently be called “significant” but do not meet the new threshold should instead be called ‘suggestive.’”
[This blog is a summary of a longer treatment of the subject that was published in Frontiers in Psychology in June 2017. To read that article, click here.]
Physicists have asked “why is there something rather than nothing?” They have theorized that it had to do with the formation of an asymmetry between matter and antimatter in the fractions of milliseconds after the Big Bang. Psychologists and economists could ask a similar question, “why do psychological and economic phenomena exist?”
A simple answer is because people exist as psychological and economic entities, therefore so do psychological and economic phenomena. Since “nothingness” is not a phenomenon, there are no logical or philosophical reasons to empirically cast some phenomena into the trash bin of nothingness. Moreover, empirical science does not have the tools to do so because a negative, such as “God does not exist”, cannot ever be proved. This state of affairs is the case because:
(1) The presence of evidence for X does not necessarily mean the absence of evidence for Y. X’s existence is not a precondition for Y’s nonexistence unless they are mutually exclusive effects, which is rare.
(2) Empiricists can never test all of the conditions and groups of people on earth; they cannot even think of all conditions that could give rise to a phenomenon. Further, “there is an infinite number of ideas and ways” to test phenomena, and consequently, “no idea ever achieves the status of final truth” (McFall).
(3) Empiricists do not have perfectly reliable and valid measures.
(4) Human events and behaviors are multi-causal in real life, even in lab experiments. This means that the manipulation of a focal independent variable affects other causal factors. Moreover, researchers cannot control for all the possible confounds or even think of all of them, since “everything is correlated with everything else, more or less”( Meehl); thus, a theoretically established phenomenon under study is never zero.
(5) Humans are fickle and elusive, sometimes unbearably simple and at other times, irreducibly complex in their thinking, and sometimes both at the same time. Thus, the human mind is unreproducible from situation to situation. Unlike those in physics (e.g., speed of light), psychological and economic phenomena are not fixed constants in space and time. There are no cognitive dissonance particles or “Phillips curve” particles that could irrevocably be verified by empirical data and subsequently declared universal constants.
In short, the nonexistence of phenomena is not logically viable. Since there are infinite ways of measuring and studying a phenomenon, logically, it should be possible to devise experiments both to demonstrate that a phenomenon exists and that it does not exist in specific conditions and during specific times when using specific methods and specific tasks. The former finding means that the phenomenon has been demonstrated to exist and it cannot be retracted.
On the one hand, conceptual replications can establish a phenomenon’s boundary conditions (i.e., when it is more and less likely to occur). On the other hand, the finding of nonexistence would not invalidate the phenomenon — only that it is not strong enough to register in a specific condition. For example, psychologically, people do not “choke” under all stressful conditions because they have learned to deal with pressure. Similarly, if the Phillips curve does not explain the inverse relationship between unemployment and inflation in the present low-interest economic environment particularly well, it can do so under different economic circumstances.
Does the conditional nature of these phenomena mean they should disappear into a “black hole” of nonexistent phenomena. Of course, not. The best that can be done is to empirically test and conceptually replicate well-developed theories (their tenets) with the best tools available, but never claim that the phenomena they describe and explain do not exist.
The present emphasis on reproducibility in psychological and economic science, unfortunately, stems from the application of the physics model of independent verification of precise numeric values for phenomena, such as three recent independent confirmations of “gravity waves” predicted by Einstein’s theory. This type of replication is possible only if there are universal constants to be verified. However, there are none in psychology and economics, and not even in biology.
Physics seeks to discover the laws of nature, whereas in other sciences, both nature and nurture have to be taken into account. This Person-Environment interaction in human cognition, performance and behavior makes direct and precise replications impossible. Human conditions vary, for one thing, because individuals (investors, policy-makers and politicians) are not invariant and rational in their decisions and judgments. “Behavioral economists” (e.g., Thaler and Kahneman) have shown that the rational-agent model is a poor explanation for financial judgments and decisions, or economic growth more generally. People may consider all the information provided but they are also influenced by their self-generated and environmentally-induced emotions in their judgments and behaviors. Individual investors (e.g., prospective homeowners) can also be led to make boneheaded decisions, resulting in “collective blindness” (Kahneman) that can in turn create national and international financial crises, as was seen in the 2008 financial calamity.
All of this means that there will be deviations from overall patterns of individual financial behaviors, and therefore direct and precise replications are impossible. However, conceptual or constructive replications are helpful in elucidating the boundary conditions for the overall pattern, conditions under which a phenomenon is strong and weak. But if we insist on precise replications, then no psychological or economic phenomena exist because it impossible to have and create identical conditions to those of the original testing.
There are no universal constants to be precisely replicated outside the laws of nature and physics. If the conditions are not the same at the individual level, they are not the same at the macro level either. History does not exactly repeat itself, it only rhymes (Twain). The conditions that led to a recession at one time will not be the same causes for the next recession. At the macro level, researchers can build theoretical models trying to predict the next recession, but they conceivably cannot consider all relevant variables, especially exogenous ones, and thus precise predictions (replications) are not possible. Nevertheless, this does not prevent pundits from arguing that it is “different this time”, it is “a new normal”.
A replication’s success is typically determined by statistical means (traditionally p-value and now Effect Size). But psychological and economic phenomena cannot be reduced to statistical phenomena, and theoretical and methodological deficiencies cannot be saved by statistical analyses. Science mainly advances by theory building and model construction, not by empirical testing and replication of the statistical null hypothesis. Psychological and economic phenomena are largely theoretical constructs, not unlike those in physics. Just think where physics would be today without Einstein’s theories. The Higgs boson particle was theorized to exist in 1964 but not verified until 2012. Did the particle not exist in the meantime?
Thus, empirical studies are mainly evaluated for their theoretical relevance and importance, and less for their success or failure in exactly reproducing original findings. It is not empirical data but theory that has generally made scientific progress possible, which is as true of physics as it is of psychology and economics. Along the way, empirical data have complemented and contributed to the expansion of theoretical models, and theories have made data more useful. Of course, theories are eventually abandoned in Kuhnian-like paradigm shifts. In the meantime, there are only “temporary winners” as scientific knowledge is “provisional” and “propositional” in nature.
Seppo Iso-Ahola is Professor of Psychology in Kinesiology at the School of Public Health, University of Maryland. He can be contacted at firstname.lastname@example.org.
*The author thanks Roger C. Mannell for his helpful comments and suggestions.
[From the website of the American Economic Association.]
“The American Economic Association seeks nominations for a new Data Editor to design and oversee its journals’ strategy for curating research data and promoting reproducible research.” …
“The duties of the Data Editor will be to:
— Design, in collaboration with the AEA journal editors and the AEA Executive Committee, a comprehensive strategy for archiving and promoting the curation of data and code that guarantees to the extent possible reproducibility of research and addresses the challenges above.
— Determine the staff and computing resources necessary to implement the strategy.
— Oversee the hiring of staff and implementation of the new policy.”
“The Data Editor must be an established leader in quantitative research, with a PhD in economics, data science or a related field and extensive experience relevant to the duties above. Ideally, the editor would have both editorial experience and research experience with government administrative, commercial or other proprietary data. We expect that the Data Editor will retain their current affiliation, and that the time required for this position will be broadly similar to that of other Editor and Co-Editor positions at AEA journals. The Editor will report to the AEA Executive Committee.”
Many, probably most empirical scientists use frequentist statistics to decide if a hypothesis should be rejected or accepted, in particular null hypothesis significance testing (NHST).
NHST works when we have access to all statistical tests that are being conducted. That way, we should at least in theory be able to see the 19 null results accompanying every statistical fluke (assuming an alpha level of 5%) and decide that effect X probably does not exist. But publication bias throws this off-kilter: When only or mainly significant results end up being published, whereas null results get p-hacked, file-drawered, or rejected, it becomes very difficult to tell false positive from true positive findings.
The number of true findings in the published literature depends on something significance tests can’t tell us: The base rate of true hypotheses we’re testing. If only a very small fraction of our hypotheses are true, we could always end up with more false positives than true positives (this is one of the main points of Ioannidis’ seminal 2005 paper).
When Felix Schönbrodt and Michael Zehetleitner released this great Shiny app a while ago, I remember having some vivid discussions with Felix about what the rate of true hypotheses in psychology may be. In his very nice accompanying blog post, Felix included a flowchart assuming that 30% of all tested hypotheses are true. At the time I found this grossly pessimistic: Surely our ability to develop hypotheses can’t be worse than a coin flip? We spent years studying our subject! We have theories! We are really smart! I honestly believed that the rate of true hypotheses we study should be at least 60%.
A few months ago, this interesting paper by Johnson, Payne, Want, Asher, & Mandal came out. They re-analysed 73 effects from the Reproducibility Project: Psychology and tried to model publication bias. I have to admit that I’m not maths-savvy enough to understand their model and judge their method, but they estimate that over 700 hypothesis tests were run to produce these 73 significant results. They assume that the statistical power for tests of true hypotheses was 75%, and that 7% of the tested hypotheses were true. Seven percent.
Er, ok, so not 60% then. To be fair to my naive 2015 self: this number refers to all hypothesis tests that were conducted, including p-hacking. That includes the one ANOVA main effect, the other main effect, the interaction effect, the same three tests without outliers, the same six tests with age as covariate, … and so on.
Let’s see what these numbers mean for the rates of true and false findings. For this we will need the positive predictive value (PPV) and the negative predictive value (NPV). I tend to forget what exactly they and their two siblings, FDR and FOR, stand for and how they are calculated, so added the table above as a cheat sheet.
Ok, now we got that out of the way, let’s stick the numbers estimated by Johnson et al. into a flowchart. You see that the positive predictive value is shockingly low: Of all significant results, only 53% are true. Wow. I must admit that even after reading Ioannidis (2005) several times, this hadn’t quite sunk in. If the 7% estimate is anywhere near the true rate, that basically means that we can flip a coin any time we see a significant result to estimate if it reflects a true effect.
But I want to draw your attention to the negative predictive value. The chance that a non-significant finding is true is 98%! Isn’t that amazing and heartening? In this scenario, null results are vastly more informative than significant results.
I know what you’re thinking: 7% is ridiculously low. Who knows what those statisticians put into their Club Mate when they calculated this? For those of you who are more like 2015 me and think psychologists are really smart, I plotted the PPV and NPV for different levels of power across the whole range of the true hypothesis rate, so you can pick your favourite one. I chose five levels of power: 21% (estimate for neuroscience by Button et al., 2013), 75% (Johnson et al. estimate), 80% and 95% (common conventions), and 99% (upper bound of what we can reach).
The not very pretty but adaptive (you can chose different values for alpha and power) code is available here.
The plot shows two vertical dashed lines: The left one marks 7% true hypotheses, as estimated by Johnson et al. The right one marks the intersection of PPV and NPV for 75% power: This is the point at which significant results become more informative than negative results. That happens when more than 33% of the studied hypotheses are true. So if Johnson et al. are right, we would need to up our game from 7% of true hypotheses to a whopping 33% to get to a point where significant results are as informative as null results!
There are a few things to keep in mind: First, 7% true hypotheses and 75% power are simply an estimate, based on data from one replication project. I can certainly imagine that this isn’t far from the truth in psychology, but even if the estimate is accurate, it will vary at least slightly across different fields and probably across time.
Second, we have to be clear about what “hypothesis” means in this context: It refers to any statistical test that is conducted. A researcher could have one “hypothesis” in mind, yet perform twenty different hypothesis tests on their data to test this hypothesis, all of which would count towards the denominator when calculating the rate of true hypotheses. I personally believe that the estimate by Johnson et al. is so low because psychologists tend to heavily exploit so-called “researcher degrees of freedom” and test many more hypotheses than they themselves are aware of. Third, statistical power will vary from study to study and the plot above shows that this affects our conclusions. It is also important to bear in mind that power refers to a specific effect size: A specific study has different levels of power for large, medium, and small effects.
We can be fairly certain that most of our hypotheses are false (otherwise we would waste a lot of money by researching trivial questions). The exact percentage of true hypotheses remains unknown, but if it there is something to the estimate of Johnson et al., the fact that an effect is significant doesn’t tell us much about whether or not it is real. Non-significant findings, on the other hand, likely are correct most of the time in this scenario – maybe even 98% of the time! Perhaps we should start to take them more seriously.
Anne Scheel is a PhD student in psychology at Ludwig-Maximilians-Universität, Munich (LMU). She is co-moderator of the Twitter site @realsci_DE and co-blogger at The 100% CI. She can be contacted at email@example.com.
Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14, 365–376.
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLOS Medicine, 2(8), e124. doi: 10.1371/journal.pmed.0020124
Johnson, V. E., Payne, R. D., Wang, T., Asher, A., & Mandal, S. (2017). On the reproducibility of psychological science. Journal of the American Statistical Association, 112(517), 1-10. doi: 10.1080/01621459.2016.1240079
Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.