The (Honest) Truth About Dishonesty: A Personal Example From the Authors?

[From the blog entitled “Oh, I hate it when work is criticized (or, in this case, fails in attempted replications) and then the original researchers don’t even consider the possibility that maybe in their original work they were inadvertently just finding patterns in noise”, posted by Andrew Gelman at Statistical Modeling, Causal Inference, and Social Science]
“I promised you a sad story. But, so far, this is just one more story of a hyped claim that didn’t stand up to the rigors of science. And I can’t hold it against the researchers that they hyped it: if the claim had held up, it would’ve been an interesting and perhaps important finding, well worth hyping.”
“No, the sad part comes next. Collins reports:”
“Multi-lab experiments like this are fantastic. There’s little ambiguity about the result. That said, there is a response by Amir, Mazar and Ariely. Lots of fluff about context. No suggestion of “maybe there’s nothing here”.”
“You can read the response and judge for yourself. I think Collins’s report is accurate, and that’s what made me sad. These people care enough about this topic to conduct a study, write it up in a research article and then in a book—but they don’t seem to care enough to seriously entertain the possibility they were mistaken. It saddens me. Really, what’s the point of doing all this work if you’re not going to be open to learning?”
To read the full blog, click here.

All Roads Lead to Rome?

[From the working paper, “Multiple Perspectives on Inference for Two Simple Statistical Scenarios” by van Dongen et al., posted at PsyArXiv Preprints]
“When analyzing a specific data set, statisticians usually operate within the confines of their preferred inferential paradigm. For instance, frequentist statisticians interested in hypothesis testing may report p-values, whereas those interested in estimation may seek to draw conclusions from confidence intervals. In the Bayesian realm, those who wish to test hypotheses may use Bayes factors and those who wish to estimate parameters may report credible intervals. And then there are likelihoodists, information-theorists, and machine-learners — there exists a diverse collection of statistical approaches, many of which are philosophically incompatible.”
“… We invited four groups of statisticians to analyze two real data sets, report and interpret their results in about 300 words, and discuss these results and interpretations in a round-table discussion.”
“… Despite substantial variation in the statistical approaches employed, all teams agreed that it would be premature to draw strong conclusions from either of the data sets.”
“… each analysis team added valuable insights and ideas. This reinforces the idea that a careful statistical analysis, even for the simplest of scenarios, requires more than a mechanical application of a set of rules; a careful analysis is a process that involves both skepticism and creativity.”
“… despite employing widely different approaches, all teams nevertheless arrived at a similar conclusion. This tentatively supports the Fisher-Jeffreys conjecture that, regardless of the statistical framework in which they operate, careful analysts will often come to similar conclusions.
To read the article, click here.

IN THE NEWS: FiveThirtyEight (December 6, 2018)

[From the article “Psychology’s Replication Crisis Has Made The Field Better” by Christie Aschwanden, published at FiveThirtyEight]
“The replication crisis arose from a series of events that began around 2011, the year that social scientists Uri Simonsohn, Leif Nelson and Joseph Simmons published a paper, “False-Positive Psychology,” that used then-standard methods to show that simply listening to the Beatles song “When I’m Sixty-Four” could make someone younger. It was an absurd finding, and that was the point. The paper highlighted the dangers of p-hacking — adjusting the parameters of an analysis until you get a statistically significant p-value (a difficult-to-understand number often misused to imply a finding couldn’t have happened by chance) — and other subtle or not-so-subtle ways that researchers could tip the scales to produce a favorable result. Around the same time, other researchers were reporting that some of psychology’s most famous findings, such as the idea that “priming” people by presenting them with stereotypes about elderly people made them walk at a slower pace, were not reproducible.”
“A lot has happened since then. I’ve been covering psychology’s replication problem for FiveThirtyEight since 2015, and in that time, I’ve seen a culture change. “If a team of research psychologists were to emerge today from a 7-year hibernation, they would not recognize their field,” Nelson and his colleagues wrote in the journal Annual Reviews last year. What has changed?”
To read more, click here.

Registered Reports Are Not Optimal?

[From the working paper, “Which findings should be published?” by Alexander Frankel and Maximilian Kasy]
“There have been calls for reforms in the direction of non-selective publication. One proposal is to promote statistical practices that de-emphasize statistical significance … Another proposal is for journals to adopt Registered Reports, in which pre-registered analysis plans are reviewed and accepted prior to data collection … Registered Reports guarantee that publication will not select at all on findings…”
“…In this paper we seek the optimal rule for determining whether a study should be published … In this framework, we will show that non-selective publication is not in fact optimal. Some findings are more valuable to publish than others. Put differently, we will find a trade-off between policy relevance and statistical credibility.”
“… The optimal publication rule defined in this manner selects on a study’s findings.”
To read the article, click here.

How Should One Statistically Analyse a Replication? It Depends.

[From the preprint, “Statistical Analyses for Studying Replication: Meta-Analytic Perspectives” by Larry Hedges and Jacob Schauer, forthcoming in Psychological Methods]
“Formal empirical assessments of replication have recently become more prominent in several areas of science, including psychology. These assessments have used different statistical approaches to determine if a finding has been replicated. The purpose of this article is to provide several alternative conceptual frameworks that lead to different statistical analyses to test hypotheses about replication.”
“…The differences among the methods described involve whether the burden of proof is placed on replication or nonreplication, whether replication is exact or allows for a small amount of “negligible heterogeneity,” and whether the studies observed are assumed to be fixed (constituting the entire body of relevant evidence) or are a sample from a universe of possibly relevant studies.”
“…All of them are valid statistical approaches … Because they use different conceptual definitions of “replication” and place the burden of proof differently, these tests vary in their sensitivity.”
“The example illustrates that the same data might reject replication (if exact replication is required), fail to confirm approximate replication (if the burden of proof is place on nonreplication), or fail to reject approximate nonreplication (if the burden of proof is on replication).”
“… studies of replication cannot be unambiguous unless they are clear about how they frame their statistical analyses and clearly define the hypotheses they actually test. Researchers should also recognize that different frameworks for evaluating replication could lead to different conclusions from the same data.”
“…The power computations offered in this article illustrate that it is likely to be difficult to obtain strong empirical tests for replication.”
To read the article, click here (NOTE: article is behind a paywall).

JOB AD: Center for Open Science Looking for Someone With Economics Background

[From a Twitter post by Center for Open Science]
“COS [Center for Open Science] has been awarded a 3-year grant for an upcoming replication project, and we are seeking someone with an economics background for our Project Coordinator position. Perfect for recent BA or MA!”
“… COS is undertaking a project (to be announced) to automate and validate methods for assessing credibility of research claims in the social-behavioral sciences.  COS will (1) create a large, enriched dataset of claims and evidence, (2) advance the efficiency and scalability of gathering that data, and (3) conduct replications and reproductions of a sample of the claims to test the accuracy of confidence scores generated by partners.  The project is expected to run from January 2019 to December 2021.”
“Two Project Coordinators will support the Project Managers in selecting eligible social-behavioral science studies from the sampling frame, vetting and coding claims for eligibility for reproduction or replication, coding datasets, gathering materials for studies, developing study preregistrations, determining study awards for replication teams, managing a portfolio of partner individuals and teams conducting replication (new data) or reproduction (same data) studies, and rigorously documenting the process to maximize reproducibility. This work requires effective communication, attention to detail, effective documentation skills, and comfort with research methods, social-behavioral science literature, and managing many research practices on a variety of topics simultaneously.”
To read more about the position, click here.

Redefining RSS

[From the blog “Justify Your Alpha by Decreasing Alpha Levels as a Function of the Sample Size” by Daniël Lakens, posted at The 20% Statistician]
“Testing whether observed data should surprise us, under the assumption that some model of the data is true, is a widely used procedure in psychological science. Tests against a null model, or against the smallest effect size of interest for an equivalence test, can guide your decisions to continue or abandon research lines. Seeing whether a p-value is smaller than an alpha level is rarely the only thing you want to do, but especially early on in experimental research lines where you can randomly assign participants to conditions, it can be a useful thing. Regrettably, this procedure is performed rather mindlessly.”
“…Here I want to discuss one of the least known, but easiest suggestions on how to justify alpha levels in the literature, proposed by Good. The idea is simple, and has been supported by many statisticians in the last 80 years: Lower the alpha level as a function of your sample size.”
“Leamer (you can download his book for free) correctly notes that this behavior, an alpha level that is a decreasing function of the sample size, makes sense from both a Bayesian as a Neyman-Pearson perspective.”
“So instead of an alpha level of 0.05, we can think of a standardized alpha level:”
Capture
“…with 100 participants α and αstan are the same, but as the sample size increases above 100, the alpha level becomes smaller. For example, a α = .05 observed in a sample size of 500 would have a αstan of 0.02236.”
To read more, click here.

REED: The Devil, the Deep Blue Sea, and Replication

In a recent article (“Between the Devil and the Deep Blue Sea: Tensions Between Scientific Judgement and Statistical Model Selection” published in Computational Brain & Behavior), Danielle Navarro identifies blurry edges around the subject of model selection. The article is a tour de force in thinking largely about statistical model selection. She writes,
What goal does model selection serve when all models are known to be systematically wrong? How might “toy problems” tell a misleading story? How does the scientific goal of explanation align with (or differ from) traditional statistical concerns? I do not offer answers to these questions, but hope to highlight the reasons why psychological researchers cannot avoid asking them.”
She goes on to say that researchers often see model selection as…
“…a perilous dilemma in which one is caught between two beasts from classical mythology, the Scylla of overfitting and the Charybdis of underfitting. I find myself often on the horns of a quite different dilemma, namely the tension between the devil of statistical decision making and the deep blue sea of addressing scientific questions. If I have any strong opinion at all on this topic, it is that much of the model selection literature places too much emphasis on the statistical issues of model choice and too little on the scientific questions to which they attach.”
The article never mentions nor alludes to replication, but it seems to me the issue of model selection is conceptually related to the issue of “replication success” in economics and other social sciences. Numerous attempts have been developed to quantitatively define “replication success” (for a recent effort, see here). But just as the issue of model selection demands more than a goodness-of-fit number can supply, so the issue of “replication success” requires more than constructing a confidence interval for “the true effect” or calculating a p-value for some hypothesis about “the true effect”. 
For starters, it’s not clear there is a single, “true effect.” Let’s suppose there is. Maybe the original study was content to demonstrate the existence of “an effect.” So replication success should be content with this as well. Alternatively, maybe the goal of the original study was to demonstrate that “the effect” was equal to a specific numerical value.  This is a common situation in economics. For example, in the evaluation of public policies, it not sufficient to show that a policy will have a desirable outcome, but rather that the benefit it produces is greater than the cost. The numbers matter, not just the sign of the effect. Accordingly, the definition of replication success will be different.
This is exactly the conclusion from the recent issue in Economics on The Practice of Replication (see here for “takeaways” from that issue). There is no single measure of replication success because scientific studies do not all have the same purpose. While it may be the case that the purposes of studies can be categorized, and that replication success can be defined within specific categories — it may be the case, though this is yet to be demonstrated — it is certainly the case that there is no single scientific purpose, and thus no single measure of replication success.
Speaking of her own field of human cognition, Navarro writes, 
“To my way of thinking, understanding how the qualitative patterns in the empirical data emerge naturally from a computational model of a psychological process is often more scientifically useful than presenting a quantified measure of its performance, but it is the latter that we focus on in the “model selection” literature. Given how little psychologists understand about the varied ways in which human cognition works, and given the artificiality of most experimental studies, I often wonder what purpose is served by quantifying a model’s ability to make precise predictions about every detail in the data.”
People are complex and complicated. Aggregating them to markets and economies does not make them easier to understand. Thus the points that Navarro makes apply a fortiori to replication and economics.
Bob Reed is a professor of economics at the University of Canterbury in New Zealand. He is also co-organizer of the blogsite The Replication Network. He can be contacted at bob.reed@canterbury.ac.nz. 

IN THE NEWS: New York Times (November 19, 2018)

[From the article, “Essay: The Experiments Are Fascinating. But Nobody Can Repeat Them” by Andrew Gelman, published in The New York Times]
“At this point, it is hardly a surprise to learn that even top scientific journals publish a lot of low-quality work — not just solid experiments that happen, by bad luck, to have yielded conclusions that don’t stand up to replication, but poorly designed studies that had no real chance of succeeding before they were ever conducted.”
“…We see it all the time. Remember the claims that subliminal smiley faces on a computer screen can cause big changes in attitudes toward immigration? That elections are decided by college football games and shark attacks? These studies were published in serious journals or promoted in serious news outlets.”
“Scientists know this is a problem. In a recent paper in the journal Nature Human Behaviour, a team of respected economists and psychologists released the results of 21 replications of high-profile experiments.”
“…Here’s where it gets really weird. The lack of replication was predicted ahead of time by a panel of experts using a ‘prediction market,’ in which experts were allowed to bet on which experiments were more or less likely to — well, be real.”
“… One potential solution is preregistration, in which researchers beginning a study publish their analysis plan before collecting their data. Preregistration can be seen as a sort of time-reversed replication, a firewall against “data dredging,” the inclination to go looking for results when your first idea doesn’t pan out. But it won’t fix the problem on its own.”
To read more, click here.

Your One Stop Shop for the Pre-registration Debate

Recently the 59th annual meeting of the Psychonomic Society in New Orleans played host to an interesting series of talks on how statistical methods should interact with the practice of science. Some speakers discussed exploratory model building, suggesting that this activity may not benefit much, if any at all, from preregistration.

[From the blog ““Don’t Interfere with my Art”: On the Disputed Role of Preregistration in Exploratory Model Building” by Eric-Jan Wagenmakers and Nathan Evans, posted at Bayesian Spectacles.]
Recently the 59th annual meeting of the Psychonomic Society in New Orleans played host to an interesting series of talks on how statistical methods should interact with the practice of science. Some speakers discussed exploratory model building, suggesting that this activity may not benefit much, if any at all, from preregistration.
On the Twitterverse, reports of these talks provoked an interesting discussion between supporters and detractors of preregistration for the purpose of model building. Below we describe the most relevant presentations, point to some interesting threads on Twitter, and then provide our own perspective.
To read the blog, click here.