Does Psychology Have a Publication Bias Problem? Yes and No

[From the article, “The Meaningfulness of Effect Sizes in Psychological Research: Differences Between Sub-Disciplines and the Impact of Potential Biases” by Thomas Schäfer and Marcus Schwarz, published April 11, 2019 in Frontiers in Psychology]
“From past publications without preregistration, 900 effects were randomly drawn and compared with 93 effects from publications with pre-registration, revealing a large difference: Effects from the former (median r = 0.36) were much larger than effects from the latter (median r = 0.16). That is, certain biases, such as publication bias or questionable research practices, have caused a dramatic inflation in published effects…”
“As we have argued throughout this article, biases in analyzing, reporting, and publishing empirical data (i.e., questionable research practices and publication bias) are most likely responsible for the differences between the effect sizes from studies with and without pre-registration.”
To read the article, click here.
[From the article, “Publication bias examined in meta-analyses from psychology and medicine: A meta-meta-analysis” by Robbie van Aert, Jelte Wicherts, and Marcel van Assen, published April 12, 2019 in PlosONE]
“A large-scale data set was created with meta-analyses published between 2004 and 2014 in Psychological Bulletin and in the Cochrane Library to study the extent and prevalence of publication bias in psychology and medicine…”
“Results of p-uniform suggest that possible overestimation because of publication bias was at most minimal for subsets from Psychological Bulletin.”
“The results of our paper are not in line with previous research…Only weak evidence for the prevalence of publication bias was observed in our large-scale data set of homogeneous subsets of primary studies. No evidence of bias was obtained using the publication bias tests. Overestimation was minimal but statistically significant…”
“Based on these findings in combination with the small percentages of statistically significant effect sizes in psychology and medicine, we conclude that evidence for publication bias in the studied homogeneous subsets is weak, but suggestive of mild publication bias in both disciplines.”
To read the article, click here.

Disagreeing With Disagreeing About Abandoning Statistical Significance

[From the preprint “Abandoning statistical significance is both sensible and practical” by Valentin Amrhein, Andrew Gelman, Sander Greenland, and Blakely McShane, available at PeerJ Preprints]
“Dr Ioannidis writes against our proposals to abandon statistical significance…”
“…we disagree that a statistical significance-based “filtering process is useful to avoid drowning in noise” in science and instead view such filtering as harmful.”
“First, the implicit rule to not publish nonsignificant results biases the literature with overestimated effect sizes and encourages “hacking” to get significance.”
“Second, nonsignificant results are often wrongly treated as zero.”
“Third, significant results are often wrongly treated as truth rather than as the noisy estimates they are, thereby creating unrealistic expectations of replicability.”
“Fourth, filtering on statistical significance provides no guarantee against noise. Instead, it amplifies noise because the quantity on which the filtering is based (the p-value) is itself extremely noisy and is made more so by dichotomizing it.”
“We also disagree that abandoning statistical significance will reduce science to “a state of statistical anarchy.” Indeed, the journal Epidemiology banned statistical significance in 1990 and is today recognized as a leader in the field.”
“The replication crisis in science is not the product of the publication of unreliable findings. … Rather, the replication crisis has arisen because unreliable findings are presented as reliable.”
To read more, click here.

Calling All Graduate Students, Post-Docs, Researchers, and Faculty: Would You Fill Out a Short Survey on Preprints?

[From the Center for Open Science]
The Center for Open Science is seeking graduate students, post docs, researchers and academic faculty to participate in a survey to investigate the factors that affect the perceived credibility and use of preprints.
Participation in this study involves:
– A time commitment of up to 20 minutes.
– Survey involving your qualitative, subjective evaluation of factors related to the credibility of preprints.
Currently using preprints (either uploading preprints or reading preprints) is not a prerequisite to participating in the study.
If you are interested in enrolling in this study, please click this link which will lead you directly to the survey. You must be 18 or older to participate.
If you would like more information, please contact this project’s research lead, Courtney Soderberg, by email ( 
Thank you!

Thinking About Using Instrumental Variables? Think Again

[From the paper “Consistency without Inference: Instrumental Variables in Practical Application” by Alwyn Young, posted on his university webpage at London School of Economics]
“I use Monte Carlo simulations, the jackknife and multiple forms of the bootstrap to study a comprehensive sample of 1359 instrumental variables regressions in 31 papers published in the journals of the American Economic Association.”
“I maintain, throughout, the exact specification used by authors and their identifying assumption that the excluded instruments are orthogonal to the second stage residuals. When bootstrapping, jackknifing or generating artificial residuals for Monte Carlos, I draw samples in a fashion consistent with the error dependence within groups of observations and independence across observations implied by authors’ standard error calculations.”
“Non-iid errors weaken 1st stage relations, raising the relative bias of 2SLS and generating mean squared error that is larger than biased OLS in almost all published papers.”
“Monte Carlo simulations based upon published regressions show that non-iid error processes adversely affect the size and power of IV estimates, while increasing the bias of IV relative to OLS, producing a very low ratio of power to size and mean squared error that is almost always larger than biased OLS.”
“In the top third most highly leveraged papers in my sample, the ratio of power to size approaches one, i.e. 2SLS is scarcely able to distinguish between a null of zero and the alternative of the mean effects found in published tables.”
“Monte Carlos show, however, that the jackknife and (particularly) the bootstrap allow for 2SLS and OLS inference with accurate size and a much higher ratio of power to size than achieved using clustered/robust covariance estimates. Thus, while the bootstrap does not undo the increased bias of 2SLS brought on by non-iid errors, it nevertheless allows for improved inference under these circumstances.”
“I find that avoiding the finite sample 2SLS standard estimate altogether and focusing on the bootstrap resampling of the coefficient distribution alone provides the best performance, with tail rejection probabilities on IV coefficients that are very close to nominal size in iid, non-iid, low and high leverage settings.”
“In sum, whatever the biases of OLS may be, in practical application with non-iid error processes and highly leveraged regression design, the performance of 2SLS methods deteriorates so much that it is rarely able to identify parameters of interest more accurately or substantively differently than is achieved by OLS.”
To read the paper, click here.



Don’t Abandon It! Learn (and Teach) to Use It Correctly

[From the paper “The practical alternative to the p-value is the correctly used p-value” by Daniël Lakens, posted at PsyArXiv Preprints]
“I do not think it is useful to tell researchers what they want to know. Instead, we should teach them the possible questions they can ask (Hand, 1994). One of these questions is how surprising observed data is under the assumption of some model, to which a p-value provides an answer.”
“The accusation that p-values are a cause of the problems with replicability across scientific disciplines lacks empirical support. Hanson (1958) examined the replicability of research findings published in anthropology, psychology, and sociology. One of the hypotheses examined was whether propositions advanced with explicit confirmation criteria, such as the rejection of a hypothesis at a 5% significance level, were more replicable than propositions made without such an explicit confirmation criterium. He found that ‘over 70 per cent of the original propositions advanced with explicit confirmation criteria were later confirmed in independent tests, while less than 46 per cent of the propositions advanced without explicit confirmation criteria were later confirmed.”
“There is also no empirical evidence to support the idea that replacing hypothesis testing with estimation, or p-values with for example Bayes factors, will matter in practice. …If alternative approaches largely lead to the same decisions as a p-value when used with care, why exactly is the p-value the problem?”
“Most problems attributed to p-values are problems with the practice of null-hypothesis significance testing. Many misinterpretations of single p-values have to do with either concluding a meaningful effect is absent after a non-significant result, or misinterpreting a significant result as an important effect.”
“I personally believe substantial improvements can be made by teaching researchers how to calculate p-values for minimal-effects tests and equivalence tests. Minimal-effects tests and equivalence tests require the same understanding of statistics as null-hypothesis tests, but provide an easy way to ask different questions from your data, such as how to provide support for the absence of a meaningful effect.”
“Teaching students that testing a range prediction is just as easy as testing against an effect size of 0 has almost no cost but might solve some of the most common misunderstandings of p-values.”
To read the paper, click here.

It’s Not A Problem, It’s an Opportunity

[From the blog “The replication crisis is good for science” by Eric Loken, published at The Conversation]
“Science is in the midst of a crisis: A surprising fraction of published studies fail to replicate when the procedures are repeated.”
“Is this bad for science? It’s certainly uncomfortable for many scientists whose work gets undercut, and the rate of failures may currently be unacceptably high. But, as a psychologist and a statistician, I believe confronting the replication crisis is good for science as a whole.”
“Awareness about the replication crisis appears to be promoting better behavior among scientists. Today, the stakes have been raised for researchers. They know that there’s the possibility that their study might be reviewed by thousands of opinionated commenters on the internet or by a high-profile group like the Reproducibility Project.”
“While there are signs that scientists are indeed reforming their ways, there is still a long way to go. Out of the 1,500 accepted presentations at the annual meeting for the Society for Behavioral Medicine in March, only 1 in 4 of the authors reported using these open science techniques in the work they presented.”
“Finally, the replication crisis is helping improve scientists’ intuitions about statistical inference.”
“Researchers now better understand how weak designs with high uncertainty – in combination with choosing to publish only when results are statistically significant – produce exaggerated results.”
“The breathtaking possibility that a large fraction of published research findings might just be serendipitous is exactly why people speak of the replication crisis. But it’s not really a scientific crisis, because the awareness is bringing improvements in research practice, new understandings about statistical inference and an appreciation that isolated findings must be interpreted as part of a larger pattern.”
To read more, click here.

What Causes a Person to Become an “Open Science Convert”?

From the blog “Reflections of an open science convert. 1: Why I changed my research practices” (Part 1 of a 3-part series) by Ineke Wessel, posted at Mindwise]
“Five years after Stapel’s fraud first became known, I came across Brian Wansink’s blog posts about how exploring data in every possible way can get you publications. The scientific community responded with outrage. On the one hand, Wansink’s data-dredging seemed far more extreme than the post-hoc analyses I used to do. On the other hand, I wondered what exactly, apart from the scale (huge) and intent (find a positive result no matter what), the differences were between me and him.”
“As I browsed the internet, an online lecture by Zoltan Dienes caught my attention. Dienes described the problem that Gelman & Loken (2014) refer to as the garden of forking paths: the idea that every potential, often seemingly arbitrary decision in data analysis (e.g., how to construct a score; what to do with outliers) contributes to a different end-result. Indeed, it is like hiking: choosing either left or right at the first fork in the path (and the fork after that, and the one after that, etc.) will determine where you will have lunch ultimately.”
“Dienes used the example of one particular published study that implicitly harboured 120 plausible combinations of decisions … A plot of the 120 possible difference scores for one particular variable (i.e., a multiverse) showed that their confidence intervals could contain exclusively positive as well as exclusively negative values, and mostly hovered around zero (i.e., no difference). Thus, despite what seemed a convincing effect in the paper, considering the full array of outcomes for that one variable should lead to the conclusion that really nothing can be said about it.”
“I was stunned. So many possibilities, and precisely one of those rare statistically significant occurrences had made it into the literature! Perhaps by coincidence, perhaps because certain routes fit better with the authors’ hypothesis than other routes? But regardless of why this particular result ended up in the paper, how can readers even know about those other 119?”
“So, now I am working on changing my research practices.”
To read more, click here.

Now Tell Me What You Really Think

[From the article “Assessing citizen adoption of e-government initiatives in Gambia: A validation of the technology acceptance model in information systems success. A critical article review, with questions to its publishers” by Daniel Jung, published in Government Information Quarterly]
“The article is on Elsevier’s list of most cited articles from the Government Information Quarterly journal, and has become a key reference in the field of study, with nearly 250 citations (in Google scholar). However, it completely fails when it comes to overall linguistic expression, literature review, grounding in the field, citation practice, questionnaire design, data collection, rendering and interpreting others’ and own data, calculation, claims of user-centeredness and accounting for cultural differences, and the final assertion that all this leads to Gambia benefiting from TAM.”
“Its premise and findings are blatantly wrong: they are not valid, reliable, verifiable or reproducible in any way. No single part could be changed to achieve integrity, and for proper results, everything would have to be redone, starting with the questionnaire design, and ending with the conclusion.”
“This article is as close to a scientific hoax as one can possibly come, but I believe that it is just an unfortunate case of poor science, not a deliberate fraud as such, even if the Zambia/Gambia quote is hard to excuse as unintentional. In any case, this is an article that should obviously never have been published, and raises serious questions about the editorial rigor, the quality of the peer reviewing, and the revision process in the journal.”
To read the article, click here.

Using Z-Curve to Estimate Mean Power for Studies Published in Psychology Journals

[From the blog “Estimating the Replicability of Psychological Science” by Ulrich Schimmack, posted at Replicability-Index]
“Over the past years, I have been working on an … approach to estimate the replicability of psychological science. This approach starts with the simple fact that replicabiliity is tightly connected to the statistical power of a study because statistical power determines the long-run probability of producing significant results (Cohen, 1988). Thus, estimating statistical power provides valuable information about replicability.”
“In collaboration with Jerry Brunner, I have developed a new method that can estimate mean power for a set of studies that are selected for significance and that vary in effect sizes and samples sizes, which produces heterogeneity in power (Brunner & Schimmack, 2018).”
“The input for this method are the actual test statistics of significance tests (e.g., t-tests, F-tests). These test-statistics are first converted into two-tailed p-values and then converted into absolute z-scores. …The histogram of these z-scores, called a z-curve, is then used to fit a finite mixture model to the data that estimates mean power, while taking selection for significance into account.”
“For this blog post, I am reporting results based on preliminary results from a large project that extracts focal hypothesis from a broad range of journals that cover all areas of psychology for the years 2010 to 2017.”
“The figure below shows the output of the latest version of z-curve. The first finding is that the replicability estimate for all 1,671 focal tests is 56% with a relatively tight confidence interval ranging from 45% to 56%.”
“The next finding is that the discovery rate or success rate is 92%, using p < .05 as the criterion. This confirms that psychology journals continue to published results are selected for significance (Sterling, 1959).”
“Z-Curve.19.1 also provides an estimate of the size of the file drawer. … The file drawer ratio shows that for every published result, we would expect roughly two unpublished studies with non-significant results.”
“Z-Curve.19.1 also provides an estimate of the false positive rate (FDR). … Z-Curve 19.1 … provides an estimate of the FDR that treats studies with very low power as false positives. This broader definition of false positives raises the FDR estimate slightly, but 15% is still a low percentage. Thus, the modest replicability of results in psychological science is mostly due to low statistical power to detect true effects rather than a high number of false positive discoveries.”
“This blog post provided the most comprehensive assessment of the replicability of psychological science so far. … replicability is estimated to be slightly above 50%. However, replicability varies across discipline and the replicability of social psychology is below 50%. The fear that most published results are false positives is not supported by the data.”
To read more, click here.

Do Not Abandon Statistical Significance

[From the article “The Importance of Predefined Rules and Prespecified Statistical Analyses: Do Not Abandon Significance” by John Ioannidis, published in JAMA]
“A recent proposal to ban statistical significance gained campaign-level momentum in a commentary with 854 recruited signatories. The petition proposes retaining P values but abandoning dichotomous statements (significant/nonsignificant), suggests discussing “compatible” effect sizes, denounces “proofs of the null,” and points out that “crucial effects” are dismissed on discovery or refuted on replication because of nonsignificance.”
“Changing the approach to defining statistical and clinical significance has some merits; for example, embracing uncertainty, avoiding hyped claims with weak statistical support, and recognizing that “statistical significance” is often poorly understood. However…The statistical data analysis is often the only piece of evidence processing that has a chance of being objectively assessed before experts, professional societies, and governmental agencies begin to review the data and make recommendations.”
“The proposal to entirely remove the barrier does not mean that scientists will not often still wish to interpret their results as showing important signals and fit preconceived notions and biases. With the gatekeeper of statistical significance, eager investigators whose analyses yield, for example, P = .09 have to either manipulate their statistics to get to P < .05 or add spin to their interpretation to suggest that results point to an important signal through an observed “trend.” When that gatekeeper is removed, any result may be directly claimed to reflect an important signal or fit to a preexisting narrative.”
“…there is an advantage in having some agreement about default statistical analysis and interpretation. Deviations from the default would then be easier to spot and questioned as to their appropriateness. For most research questions, post hoc analytical manipulation is unlikely to lead closer to the truth than a default analysis with a basic set of rules.”
“Banning statistical significance while retaining P values (or confidence intervals) will not improve numeracy and may foster statistical confusion and create problematic issues with study interpretation, a state of statistical anarchy. Uniformity in statistical rules and processes makes it easier to compare like with like and avoid having some associations and effects be more privileged than others in unwarranted ways. Without clear rules for the analyses, science and policy may rely less on data and evidence and more on subjective opinions and interpretations.”
To read more, click here.