The Replication Network

IN THE NEWS: Mother Jones (September 25, 2018)

Posted on 1st October 2018 by replicationnetwork

Leave a Comment

[From the article, “This Cornell Food Researcher Has Had 13 Papers Retracted. How Were They Published in the First Place?” by Kiera Butler, published in Mother Jones]

“In 2015, I wrote a profile of Brian Wansink, a Cornell University behavioral science researcher who seemed to have it all: a high-profile lab at an elite university, more than 200 scientific studies to his name, a high-up government appointment, and a best-selling book.”

“…In January 2017, a team of researchers reviewed four of [Wansink’s] published papers and turned up 150 inconsistencies. Since then, in a slowly unfolding scandal, Wansink’s data, methods, and integrity have been publicly called into question. Last week, the Journal of the American Medical Association (JAMA) retracted six articles he co-authored. To date, a whopping 13 Wansink studies have been retracted.”

“… when I first learned of the criticisms of his work, I chalked it up to academic infighting and expected the storm to blow over. But as the scandal snowballed, the seriousness of the problems grew impossible to ignore. I began to feel foolish for having called attention to science that, however fun and interesting, has turned out to be so thin. Were there warning signs I missed? Maybe. But I wasn’t alone. Wansink’s work has been featured in countless major news outlets—the New York Times has called it “brilliantly mischievous.” And when Wansink was named head of the USDA in 2007, the popular nutrition writer Marion Nestle deemed it a “brilliant appointment.””

“Scientists bought it as well. Wansink’s studies made it through peer review hundreds of times—often at journals that are considered some of the most prestigious and rigorous in their fields. The federal government didn’t look too closely, either: The USDA based its 2010 dietary guidelines, in part, on Wansink’s work. So how did this happen?”

To read more, click here.

Category: NEWS & EVENTS Tags: Brian Wansink, Cornell Food and Brand Lab, food and nutrition, Journal of the American Medical Association, Mother Jones, p-hacking, peer review, retraction

GOODMAN: Systematic Replication May Make Many Mistakes

Posted on 28th September 2018 by replicationnetwork

2 Comments

Replication seems a sensible way to assess whether a scientific result is right. The intuition is clear: if a result is right, you should get a significant result when repeating the work; if it it’s wrong, the result should be non-significant. I test this intuition across a range of conditions using simulation. For exact replications, the intuition is dead on, but when replicas diverge from the original studies, error rates increase rapidly. Even for the exact case, false negative rates are high for small effects unless the samples are large. These results bode ill for large, systematic replication efforts, which typically prioritize uniformity over fidelity and limit sample sizes to run lots of studies at reasonable cost.

INTRODUCTION

The basic replication rationale goes something like this: (1) many published papers are wrong; (2) this is a serious problem the community must fix; and (3) systematic replication is an effective solution. (In recent months, I’ve seen an uptick in pre-registration as another solution. That’s a topic for another day.) In this post, I focus on the third point and ask: viewed as a statistical test, how well does systematic replication work; how well does it tell the difference between valid and invalid results?

By “systematic replication” I mean projects like Many Lab, Reproducibility Project: Psychology (RPP), Experimental Economics Replication Project (EERP), and Social Sciences Replication Project (SSRP) that systematically select studies in a particular field and repeat them in a uniform fashion. The main publications for these projects are Many Lab, RPP, EERP, SSRP.

I consider a basic replication scheme in which each original study is repeated once. This is like RPP and EERP, but unlike Many Lab as published which repeated each study 36 times and SSRP which used a two-stage replication strategy. I imagine that the replicators are trying to closely match the original study (direct replication) while doing the replications in a uniform fashion for cost and logistical reasons.

My test for replication success is the same as SSRP (what they call the statistical significance criterion): a replication succeeds if the replica has a significant effect in the same direction as the original.

A replication is exact if the two studies are sampling the same population. This is an obvious replication scenario. You have a study you think may be wrong; to check it out, you repeat the study, taking care to ensure that the replica closely matches the original. Think cold fusion. A replication is near-exact if the populations differ slightly. This is probably what systematic replication achieves, since the need for uniformity reduces precision.

Significance testing of the replica (more precisely, the statistical significance criterion) works as expected for exact replications, but error rates increase rapidly as the populations diverge. This isn’t surprising when you think about it: we’re using the replica to draw inferences about the original study; it stands to reason this will only work if the two studies are very similar.

Under conditions that may be typical in systematic replication projects, the rate of false positive mistakes calculated in this post ranges from 1-71% and false negative mistakes from 0-85%. This enormous range results from the cumulative effect of multiple unknown, hard-to-estimate parameters.

My results suggest that we should adjust our expectations for systematic replication projects. These projects may make a lot of mistakes; we should take their replication failure rates with a grain of salt.

The software supporting this post is open source and freely available in GitHub.

SCENARIO

The software simulates studies across a range of conditions, combines pairs of studies into pairwise replications, calculates which replications pass the test, and finally computes false positive and false negative rates for conditions of interest.

The studies are simple two group comparisons parameterized by sample size and population effect size d_pop (d_pop ≥ 0). For each study, I generate two groups of n random numbers. One group comes from a standard normal distribution with mean = 0; the other is standard normal with mean = d_pop. I then calculate the p-value from a t-test. When I need to be pedantic, I use the term study set for the ensemble of studies for a given combination of n and d_pop.

The program varies n from 20 to 500 and d_pop from 0 to 1 with 11 discrete values each (a total of 11² = 121 combinations). It simulates 10⁴ studies for each combination yielding about 1.2 million simulated studies. An important limitation is that all population effect sizes are equally likely within the range studied. I don’t consider publication bias which may make smaller effect sizes more likely, or any prior knowledge of expected effect sizes.

To generate pairwise replications, I consider all (ordered) pairs of study sets. For each pair, the software permutes the studies of each set, then combines the studies row-by-row. This multiplies out to 121² = 14,641 pairs of study sets and almost 150 million simulated replications. The first study of the pair is the original and the second the replica. I consistently use the suffixes 1 and 2 to denote the original and replica respectively.

Four variables parameterize each pairwise replication: n1, n2, d1_pop, and d2_pop. These are the sample and population effect sizes for the two studies.

After forming the pairwise replications, the program discards replications for which the original study isn’t significant. This reflects the standard practice that non-significant findings aren’t published and thus aren’t candidates for systematic replication.

Next the program determines which replications should pass the replication test and which do pass the test. The ones that should pass are ones where the original study is a true positive, i.e., d1_pop ≠ 0. The ones that do pass are ones where the replica has a significant p-value and effect size in the same direction as the original.

A false positive replication is one where the original study is a false positive (d1_pop = 0) yet the replication passes the test. A false negative replication is one where the original study is a true positive (d1_pop ≠ 0), yet the replication fails the test. The program calculates false positive and false negative rates (abbr. FPR and FNR) relative to the number of replications in which the original study is significant.

My definition of which replications should pass depends only on the original study. A replication in which the original study is a false positive and the replica study a true positive counts as a false positive replication. This makes sense if the overarching goal is to validate the original study. If the goal were to test the result of the original study rather than the study itself, it would make sense to count this case as correct.

To get “mistake rates” I need one more parameter: , the proportion of replications that are true. This is the issue raised in Ioannidis’s famous paper, “Why most published research findings are false” and many other papers and blog posts including one by me. The terminology for “mistake rates” varies by author. I use terminology adapted from Jager and Leek. The replication-wise false positive rate (RWFPR) is the fraction of positive results that are false positives; the replication-wise false negative rate (RWFNR) is the fraction of negative results that are false negatives.

RESULTS

Exact replications

A replication is exact if the two studies are sampling the same population; this means d1_pop = d2_pop.

Figure 1 shows FPR for n1 = 20 and n2 varying from 50 to 500. The x-axis shows all four parameters using d1, d2 as shorthand for d1_pop, d2_pop. d1_pop = d2_pop = 0 throughout because this is the only way to get false positives with exact replications. Figure 2 shows FNR for the same values of n1 and n2 but with d1_pop = d2_pop ranging from 0.1 to 1.

I mark the conventionally accepted thresholds for false positive and negative error rates (0.05 and 0.2, resp.) as known landmarks to help interpret the results. I do not claim these are the right thresholds for replications.

For this ideal case, replication works exactly as intuition predicts. FPR is the significance level divided by 2 (the factor of 2 because the effect sizes must have the same direction). Theory tell us that FNR = 1 – power and though not obvious from the graph, the simulated data agrees well.

As one would expect, if the population effect size is small, n2 must be large to reliably yield a positive result. For d = 0.2, n2 must be almost 400 in theory and 442 in the simulation to achieve FNR = 0.2; to hit FNR = 0.05, n2 must be more than 650 (in theory). These seem like big numbers for a systematic replication project that needs to run many studies.

Near exact replications

A replication is near-exact if the populations differ slightly, which means d1_pop and d2_pop differ by a small amount, near; technically, abs(d1_pop – d2_pop) ≤ near.

I don’t know what value of near is reasonable for a systematic replication project. I imagine it varies by research area depending on the technical difficulty of the experiments and the variability of the phenomena. The range 0.1-0.3 feels reasonable. I extend the range by 0.1 on each end just to be safe.

Figure 3 uses the same values of n1, n2, and d1_pop as Figure 1, namely n1 = 20, n2 varies from 50 to 500, and d1_pop= 0. Figure 4 uses the same values of n1 and n2 as Figure 2 but fixes d1_pop = 0.5, a medium effect size. In both figures, d2_pop ranges from d1_pop – near to d1_pop + near with values less than 0 or greater than 1 discarded. I restrict values to the interval [0,1] because that’s the range of d in the simulation.

FPR is fine when n2 is small, esp. when near is also small, but gets worse as n2 (and near) increase. It may seem odd that the error rate increases as the sample size increases. What’s going on is a consequence of power. More power is usually good, but in this setting every positive is a false positive, so more power is bad. This odd result is a consequence of how I define correctness. When the original study is a false positive (d1_pop = 0) and the replica a true positive (d2_pop≠ 0), I consider the replication to be a false positive. This makes sense if we’re trying to validate the original study. If instead we’re testing the result of the original study, it would make sense to count this case as correct.

FNR behaves in the opposite direction: bad when n2 is small and better as n2 increases.

To show the tradeoff between FPR and FNR, Figure 5 plots both error rates for near = 0.1 and near = 0.3.

For near =0.1, n2 = 150 is a sweet spot with both error rates about 0.05. For near = 0.3, the crossover point is n2 = 137 with error rates of about 0.15.

FNR also depends on d1_pop for “true” cases, i.e., when the original study is a true positive, getting worse when d1_pop is smaller and better when d1_pop is bigger. The table below shows the error rates for a few values of n2, near, and d1_pop. Note that FPR only depends on n2 and near, while FNR depends on all three parameters. The FNR columns are for different values of d1_pop in true cases.

FNR is great for d1_pop = 0.8, mostly fine for d1_pop = 0.5, and bad for d1_pop= 0.2. Pushing up n2 helps but even when n2 = 450, FNR is probably unacceptable for d1_pop= 0.2. Increasing n2 worsens FPR. It seems the crossover point above, n2 = 137, is about right. Rounding up to 150 seems a reasonable rule-of-thumb.

Replication-wise error rates

The error rates reported so far depend on whether the original study is a false or true positive: FPR assumes the original study is a false positive, FNR assumes it’s a true positive. The next step is to convert these into replication-wise error rates: RWFPR and RWFNR. To do so, we need one more parameter: prop.true, the proportion of replications that are true.

Of course, we don’t know the value of prop.true; arguably it’s the most important parameter that systematic replication is trying to estimate. Like near , it probably varies by research field and may also depend on the quality of the investigator. Some authors assume prop.true = 0.5, but I see little evidence to support any particular value. It’s easy enough to run a range of values and see how prop.true affects the error rates.

The table below shows the results for near = 0.1, 0.3 as above, and prop.true ranging from 0.1 to 0.9. The RWFPR and RWFNR columns are for different values of d1_pop in “true” cases, i.e., when the original study is a true positive.

Check out the top and bottom rows. The top row depicts a scenario where most replications are false (prop.true = 0.1) and the replicas closely match the original studies (near = 0.1); for this case, most positives are mistakes and most negatives are accurate. The bottom row is a case where most replications are true (prop.true = 0.9) and the replicas diverge from the originals (near = 0.3); here most positives are correct and, unless d1_popis large, most negatives are mistakes.

Which scenario is realistic? There are plenty of opinions but scant evidence. Your guess is as good as mine.

DISCUSSION

Systematic replication is a poor statistical test when used to validate published studies. Replication works well when care is taken to ensure the replica closely matches the original study. This is the norm in focused, one-off replication studies aiming to confirm or refute a single finding. It seems unrealistic in systematic replication projects, which typically prioritize uniformity over fidelity to run lots of studies at reasonable cost. If the studies differ, as they almost certainly must in systematic projects, mistake rates grow and may be unacceptably high under many conditions.

My conclusions depend on the definition of replication correctness, i.e., which replications should pass. The definition I use in this post depends only on the original study: a replication should pass if the original study is a true positive; the replica study is just a proxy for the original one. This makes sense if the goal is to validate the original study. If the goal were to test the result of the original study rather than the study itself, it would make sense to let true positive replicas count as true positive replications. That would greatly reduce the false positive rates I report.

My conclusions also depend on details of the simulation. An important caveat is that population effect sizes are uniformly distributed across the range studied. I don’t consider publication bias which may make smaller effect sizes more likely, or any prior knowledge of expected effect sizes. Also, in the near exact case, I assume that replica effect sizes can be smaller or larger than the original effect sizes; many investigators believe that replica effect sizes are usually smaller.

My results suggest that systematic replication is unsuitable for validating existing studies. An alternative is to switch gears and focus on generalizability. This would change the mindset of replication researchers more than the actual work. Instead of trying to refute a study, you would assume the study is correct within the limited setting of the original investigation and try to extend it to other settings. The scientific challenge would become defining good “other settings” – presumably there are many sensible choices — and selecting studies that are a good fit for each. This seems a worthy problem in its own right that would move the field forward no matter how many original studies successfully generalize.

I’ve seen plenty of bad science up close and personal, but in my experience statistics isn’t the main culprit. The big problem I see is faulty research methods. Every scientific field has accepted standard research methods. If the methods are bad, even “good” results are likely to be wrong; the results may be highly replicable but wrong nonetheless.

The quest to root out bad science is noble but ultimately futile. “Quixotic” comes to mind. Powerful economic forces shape the size and make-up of research areas. Inevitably some scientists are better researchers than others. But “Publish or Perish” demands that all scientists publish research papers. Those who can, publish good science; those who can’t, do the best they can.

We will do more good by helping good scientists do good science than by trying to slow down the bad ones. The truly noble quest is to develop tools and techniques that make good scientists more productive. That’s the best way to get more good science into the literature.

Nat Goodman is a retired computer scientist living in Seattle Washington. His working years were split between mainstream CS and bioinformatics and orthogonally between academia and industry. As a retiree, he’s working on whatever interests him, stopping from time-to-time to write papers and posts on aspects that might interest others. He can be contacted at natg@shore.net.

Category: GUEST BLOGS Tags: Experimental Economics Replication Project, Many Labs project, Near exact replications, replication, Reproducibility Project, Social Sciences Replication Project

How Many Biases? Let Us Count the Ways

Posted on 26th September 2018 by replicationnetwork

[From the article “Congratulations. Your Study Went Nowhere” by Aaron Carroll, published at http://www.nytimes.com]

“When we think of biases in research, the one that most often makes the news is a researcher’s financial conflict of interest. But another bias, one possibly even more pernicious, is how research is published and used in supporting future work.”

“A recent study in Psychological Medicine examined how four of these types of biases came into play in research on antidepressants.”

“… Publication bias refers to the decision on whether to publish results based on the outcomes found. “

“… Outcome reporting bias refers to writing up only the results in a trial that appear positive, while failing to report those that appear negative.”

“… Spin refers to using language, often in the abstract or summary of the study, to make negative results appear positive.”

“… Research becomes amplified by citation in future papers. The more it’s discussed, the more it’s disseminated both in future work and in practice. Positive studies were cited three times more than negative studies. This is citation bias.”

To read more, click here.

Category: NEWS & EVENTS Tags: antidepressants, citation bias, Medicine, NY Times, outcome reporting bias, publication bias, spin

18,000 Retractions?

Posted on 25th September 2018 by replicationnetwork

Leave a Comment

[From the video, “The Retraction Watch Database” by Ivan Oransky, posted at YouTube].

Ivan Oransky, MD, co-founder of the website Retraction Watch gave a talk at the Joint Roadmap for Open Science Tools Workshop at Berkeley in August. In this short video, Oranksy talks about how retracted papers continue to be cited after they have been retracted. 40% of the time when a retracted paper has been cited, there is no acknowledgment that the paper has been retracted (likely because the citing author did not know). And then there’s this:

To watch the full video (it’s only about 8 and a half minutes long), click here.

Category: NEWS & EVENTS Tags: Citations, Ivan Oransky, Open Science, Retraction Watch

M Is For Pizza

Posted on 25th September 2018 by replicationnetwork

Leave a Comment

[From the blog ““Tweeking”: The big problem is not where you think it is” by Andrew Gelman, posted at Statistical Modeling, Causal Inference, and Social Science]

“In her recent article about pizzagate, Stephanie Lee included this hilarious email from Brian Wansink, the self-styled “world-renowned eating behavior expert for over 25 years”:

“OK, what grabs your attention is that last bit about “tweeking” the data to manipulate the p-value, where Wansink is proposing research misconduct (from NIH: “Falsification: Manipulating research materials, equipment, or processes, or changing or omitting data or results such that the research is not accurately represented in the research record”).”

“But I want to focus on a different bit: “. . . although the stickers increase apple selection by 71% . . .””

“This is the type M (magnitude) error problem—familiar now to us, but not so familiar a few years ago to Brian Wansink, James Heckman, and other prolific researchers.”

To read more, click here.

Category: NEWS & EVENTS Tags: Andrew Gelman, Brian Wansink, p-hacking, p-value, pizzagate, Type M error

CSI: Tilburg

Posted on 23rd September 2018 by replicationnetwork

Leave a Comment

[From the article “The truth squad” by Erik Stokstad, published in Science]

“In August 2011, Diederik Stapel, a prominent psychologist and a dean at Tilburg University here, confessed to faking data for dozens of papers over 15 years. As part of an internal investigation, Marcel van Assen, a psychologist in the university’s Department of Methodology and Statistics, spent months looking into Stapel’s data, methods, and results. The scope of the fraud was staggering, but just as alarming as the fabricated data, Van Assen says, were the flawed analyses, rife with statistical problems, that Stapel had performed. The fact that all his papers had been approved by co-authors and published in respectable journals meant psychology had a larger problem, Van Assen says. “I thought, holy shit, this is not a characteristic just of Stapel or Tilburg.””

To read more, click here.

Category: NEWS & EVENTS Tags: Diederik Stapel, European Research Council, Jelte Wicherts, Marcel van Assen, metascience, Michele Nuijten, Science journal, Tilburg University

An Economist’s Journey Into the Replication Crisis

Posted on 23rd September 2018 by replicationnetwork

Leave a Comment

[From the blog “Why We Cannot Trust the Published Empirical Record in Economics and How to Make Things Better” by Sylvain Chabé-Ferret, posted at the blogsite An Economist’s Journey]

“A strain of recent results is casting doubt on the soundness of the published empirical results in economics. Economics is now undergoing a replication crisis similar to the one psychology and cancer research have undergone in the last ten years. This crisis is so broad that it concerns all of the published empirical results and it is so severe that it might mean that most of them are wrong. The mere fact that we cannot trust any of them and do not know which ones hold and which ones do not casts doubt on all of the empirical results in our field. This is very serious business.”

“In this blog post, I want to briefly explain what the replication crisis in economics is and what are its most likely causes. I’ll then provide some evidence along with personal stories and anecdotes that illustrate the bad behaviors that generate the replication crisis. Then, I’m going to detail the set of solutions that I think we need in order to overcome the crisis. Finally, I will end with a teaser about a project that I am preparing with some colleagues, the Social Science Knowledge Accumulation Initiative, or SKY (yeah, I pronounce SSKAI SKY, can I do that?), that we hope is going to provide the infrastructure necessary in order to implement the required changes.”

To read more, click here.

Category: NEWS & EVENTS Tags: economics, file drawer problem, HARKing, p-hacking, publication bias, replication crisis, SKY, Social Science Knowledge Accumulation Initiative

IN THE NEWS: Bloomberg (September 18, 2018)

Posted on 22nd September 2018 by replicationnetwork

Leave a Comment

[From the article “Why Economics Is Having a Replication Crisis” by Noah Smith, published at http://www.bloomberg.com]

“By now, most people have heard of the replication crisis in psychology. When researchers try to recreate the experiments that led to published findings, only slightly more than half of the results tend to turn out the same as before. Biology and medicine are probably riddled with similar issues.”

“But what about economics? Experimental econ is akin to psychology, and has similar issues. But most of the economics research you read about doesn’t involve experiments — it’s empirical, meaning it relies on gathering data from the real world and analyzing it statistically. Statistical calculations suggest that there are probably a lot of unreliable empirical results getting published and publicized.”

“…That doesn’t mean that single results aren’t worth reporting or taking into account, but a single finding shouldn’t be enough to generate certainty about how the world works. In a universe filled with uncertainty, social science can’t progress by leaps and bounds — it must crawl forward, feeling its way inch by inch toward a little more truth.”

To read more, click here.

Category: NEWS & EVENTS Tags: economics, Hoxby and Rothstein, Reinhart and Rogoff, replication crisis

Are Meta-Analyses Overrated?

Posted on 22nd September 2018 by replicationnetwork

Leave a Comment

[From the article, “Meta-analyses were supposed to end scientific debates. Often, they only cause more controversy” by Jop de Vrieze, published at http://www.sciencemag.org]

“Meta-analyses were thought to be debate enders, but now we know they rarely are,” Ferguson says. “They should be regarded as an argument, not a fact.” It’s a paradox, says Jacob Stegenga, a philosopher of science at University of Cambridge in the United Kingdom: “When the evidence points clearly in one direction, there is little need for a meta-analysis. When it doesn’t, a meta-analysis is unlikely to give the final answer.”

To read more, click here.

Category: NEWS & EVENTS Tags: Cochrane, Media violence, Meta-analysis, Science journal, Video games

ALL INVITED: Workshop on Reproducibility and Integrity in Scientific Research

Posted on 21st September 2018 by replicationnetwork

4 Comments

DATE: Friday 26 October.

PLACE: University of Canterbury, Business School, Meremere, Room 236, Christchurch, NEW ZEALAND

REGISTRATION (important for catering purposes): email to tom.coupe@canterbury.ac.nz

COST: Nada ($0)

Supported by the University of Canterbury Business School Research Committee.

OVERVIEW: There is more and more evidence that findings from many scientific studies cannot be reproduced, casting doubt on the reliability of these studies. At the workshop, we will discuss the extent of this replication crisis, explore various methods that can be used to check whether a study can be replicated, and present tools that can be used to make one’s own research more reproducible and trustworthy.

SPEAKERS AND PRESENTATION TITLES:

– Anton Angelo (University of Canterbury/Library): Transparency and Reproducibility – It’s All About Layers

– Arin Basu (University of Canterbury/Health Sciences): What about Why?

– Annette N. Brown (FHI 360): Which Tests Not Witch Hunts: A Diagnostic Approach to Conducting Replication Research

– Brian Haig (University of Canterbury/Psychology): Understanding Replication in A Way That Is True To Science

– Jeff Miller (University of Otago/Psychology): The Statistical Fundamentals of (Non)-Replicability

– Thomas Pfeiffer (Massey University/Computational Biology/Biochemistry): Betting On Your Peers’ Results: A Tale of Three Markets

– W. Robert Reed (University of Canterbury/Business School): An Update on the Progress of Replications in Economics

– Philip Schluter (University of Canterbury/Health Sciences): A Bayesian Alternative to Hypothesis Testing

– Eric Vanman (University of Queensland/Psychology). How Pre-Registrations Can Improve Science: Tales from the Front-Line

– Ben Wood (Integra LLC): Lessons Learned From Running a Social Science Replication Program

PROGRAMME

9.00-9.30: Registration

9.30-9.35: Introduction

9.35-11.05: SESSION: Replication – Theory and Current Status

11.05-11.30: Coffee Break

11.30-12.30: SESSION: How to Detect the Truth

12.30-13.30: Lunch

13.30-15.00: SESSION: Making Research More Reproducible

15.00-15.30: Coffee Break

15.30-16.30: SESSION: Observations from the Front Lines

16.30-17.00: Closing

ABSTRACTS OF TALKS

SESSION: Replication – Theory and Current Status

– W. Robert Reed (University of Canterbury/Business School): An Update on the Progress of Replications in Economics

Abstract. The last two decades have seen increasing doubt about the credibility of empirical research in science. This has come to be known as the “Replication Crisis,” with the name derived from the fact that many reported empirical findings cannot be reproduced/replicated. Relative to their peers in psychology and political science, economists have been slow to recognize the problem and consider solutions. In 2015, Duvendack et al. published “Replications in Economics: A Progress Report”. Among other things, that study reported (i) the number of replications published in economics over time, (ii) journals that state they publish replications on their websites, (iii) journals that actually publish replications, and (iv) journals that regularly publish data and code along with their empirical papers. This presentation will update those numbers and identify recent trends.

– Jeff Miller (University of Otago/Psychology): The Statistical Fundamentals of (Non)-Replicability

Abstract. A popular conception of science holds that real phenomena should always be replicable when the appropriate conditions are met. Unfortunately, this conception does not hold in scientific fields with inherently probabilistic measurements; in such fields, real phenomena do sometimes fail to replicate. Simple statistical models can illuminate why such errors occur and how their probabilities can be computed, while also clarifying two distinct views of “replication probability”. The models also reveal what quantities are needed to determine the probability of a successful replication. The difficulty of getting good estimates of these quantities makes it hard to determine whether recently observed—and much publicized—replication failures are “as expected” or should instead be regarded as signs of flawed scientific practices.

– Brian Haig (University of Canterbury/Psychology): Understanding Replication in A Way That Is True To Science

Abstract. Successful replication of original research is widely held to be the gold standard by which scientific claims are evaluated. However replication research is said to be rare in the behavioural and biomedical sciences, and recent attempts to replicate published findings in these sciences have reported discouraging results. These two facts have led many to conclude that large tracts of science are in the midst of a replication crisis. As a consequence, it has variously been claimed that the affected disciplines constitute poor science, that their knowledge claims are of doubtful validity, and that much needs to be done to improve their research practices. In this presentation, I challenge a number of widely held assertions that have been made about the nature of replication and its place in science. These challenges are based on underappreciated understandings of good scientific practice.

SESSION: How to Detect the Truth

– Philip Schluter (University of Canterbury/Health Sciences): A Bayesian Alternative to Hypothesis Testing

Abstract. The vast majority of quantitative health research is predicated on Positivism and the Frequentist statistical inference machinery. However, it might be argued that this approach and these methods do not always answer the questions that researcher think are (or want to be) answered. A Bayesian alternative exists – which might have appeal. This talk provides a glimpse into this alternative approach.

– Arin Basu (University of Canterbury/Health Sciences): What about Why?

Abstract. This talk will aim to provide a journey through the world of causal thinking focusing on health sciences. In course of this presentation, I’d like to traverse the different perspectives that have shaped our thinking on what constitutes cause. Starting with Mill’s canons, Sewall Wright’s path analysis, we will examine the role of how Hill’s criteria and Ken Rothman’s notions of necessary and sufficient causal linkages have shaped evidence processes in health and related sciences.

SESSION: Making Research More Reproducible

– Annette N. Brown (FHI 360): Which Tests Not Witch Hunts: A Diagnostic Approach to Conducting Replication Research

Abstract. Replication research can be used to explore original study results that researchers consider questionable, but it should also be a tool for reinforcing the credibility of results that are important to policies and programs. The challenge is to design a replication plan open to both supporting the original findings and uncovering potential problems. Ben Wood and I compiled the examples and lessons from several replication studies to provide researchers with an objective list of checks or tests to consider when planning a replication study. We present tips for diagnostic exercises in four groups: validity of assumptions, data transformations, estimation methods, and heterogeneous impacts. We also provide a list of don’ts for how to conduct and report replication research. In this presentation, I will summarize these tips and suggestions using some examples from replication research.

– Anton Angelo (University of Canterbury/Library): Transparency and Reproducibility – It’s All About Layers

Abstract. In order for research to be easily replicated and verified some things need to be ‘baked in’ from the start. This talk looks at the layers required for data and analysis – how to describe data meaningfully, licences required for others to be able to reuse your data (even if it’s just to verify it), and mind-set required to make it successful.

– Eric Vanman (University of Queensland/Psychology). How Pre-Registrations Can Improve Science: Tales from the Front-Line

Abstract. Pre-registration is a declaration of the researchers’ plans prior to the start of a study. It includes stating a priori hypotheses, rules about how many and who the participants are, the procedures to be used, and how the data will be analysed. I will talk about some options for completing a pre-registration of any study, from an undergraduate research thesis to a grant-funded project. I will also review some of the ways that we have encouraged pre- registration in the UQ School of Psychology, which includes promoting open science practices to students and staff.

SESSION: Observations from the Front Lines

– Thomas Pfeiffer (Massey University/Computational Biology/Biochemistry): Betting On Your Peers’ Results: A Tale of Three Markets

Abstract. Prediction markets are popular mechanisms for the forecasting of future events. They have been used in a variety of domains, including sports and politics. In a recent series of studies we used prediction markets in science to have scientists bet on the reliability their peers’ publications. The markets predicted the outcomes of the replications well, suggesting that there is knowledge about the reliability of studies in the research community which can be elicited through market mechanisms. This knowledge is valuable for following the dynamics of hypothesis testing, and can also be employed more broadly for optimizing decision-making in science.

– Ben Wood (Integra LLC): Lessons Learned From Running a Social Science Replication Program

Abstract. Over six years, Ben co-designed and co-managed the replication program for the International Initiative for Impact Evaluation (3ie). He will provide an overview of the program and present lessons learned around terminology standardization, incentivizing processes, refereeing replication research, and requiring push button replication. The talk will intersperse anecdotes from the replication program with results from recent replication- related research. He will also dispel the myth that replication research is unpublishable by highlighting recently published replication studies. The lessons learned will provide valuable feedback to the producers, user, and funders of evidence.

Category: NEWS & EVENTS Tags: New Zealand, Reproducibility, Sc, University of Canterbury, Workshop

« 1 … 32 33 34 35 36 … 70 »

Furthering the Practice of Replication in Economics

[From the article, “This Cornell Food Researcher Has Had 13 Papers Retracted. How Were They Published in the First Place?” by Kiera Butler, published in Mother Jones]

“In 2015, I wrote a profile of Brian Wansink, a Cornell University behavioral science researcher who seemed to have it all: a high-profile lab at an elite university, more than 200 scientific studies to his name, a high-up government appointment, and a best-selling book.”

To read more, click here.

INTRODUCTION

My test for replication success is the same as SSRP (what they call the statistical significance criterion): a replication succeeds if the replica has a significant effect in the same direction as the original.

Under conditions that may be typical in systematic replication projects, the rate of false positive mistakes calculated in this post ranges from 1-71% and false negative mistakes from 0-85%. This enormous range results from the cumulative effect of multiple unknown, hard-to-estimate parameters.

My results suggest that we should adjust our expectations for systematic replication projects. These projects may make a lot of mistakes; we should take their replication failure rates with a grain of salt.

The software supporting this post is open source and freely available in GitHub.

SCENARIO

The software simulates studies across a range of conditions, combines pairs of studies into pairwise replications, calculates which replications pass the test, and finally computes false positive and false negative rates for conditions of interest.

Four variables parameterize each pairwise replication: n1, n2, d1pop, and d2pop. These are the sample and population effect sizes for the two studies.

After forming the pairwise replications, the program discards replications for which the original study isn’t significant. This reflects the standard practice that non-significant findings aren’t published and thus aren’t candidates for systematic replication.

RESULTS

Exact replications

A replication is exact if the two studies are sampling the same population; this means d1pop = d2pop.

I mark the conventionally accepted thresholds for false positive and negative error rates (0.05 and 0.2, resp.) as known landmarks to help interpret the results. I do not claim these are the right thresholds for replications.

For this ideal case, replication works exactly as intuition predicts. FPR is the significance level divided by 2 (the factor of 2 because the effect sizes must have the same direction). Theory tell us that FNR = 1 – power and though not obvious from the graph, the simulated data agrees well.

Near exact replications

A replication is near-exact if the populations differ slightly, which means d1pop and d2pop differ by a small amount, near; technically, abs(d1pop – d2pop) ≤ near.

FNR behaves in the opposite direction: bad when n2 is small and better as n2 increases.

To show the tradeoff between FPR and FNR, Figure 5 plots both error rates for near = 0.1 and near = 0.3.

For near =0.1, n2 = 150 is a sweet spot with both error rates about 0.05. For near = 0.3, the crossover point is n2 = 137 with error rates of about 0.15.

Replication-wise error rates

The table below shows the results for near = 0.1, 0.3 as above, and prop.true ranging from 0.1 to 0.9. The RWFPR and RWFNR columns are for different values of d1pop in “true” cases, i.e., when the original study is a true positive.

Which scenario is realistic? There are plenty of opinions but scant evidence. Your guess is as good as mine.

DISCUSSION

We will do more good by helping good scientists do good science than by trying to slow down the bad ones. The truly noble quest is to develop tools and techniques that make good scientists more productive. That’s the best way to get more good science into the literature.

[From the article “Congratulations. Your Study Went Nowhere” by Aaron Carroll, published at http://www.nytimes.com]

“When we think of biases in research, the one that most often makes the news is a researcher’s financial conflict of interest. But another bias, one possibly even more pernicious, is how research is published and used in supporting future work.”

“A recent study in Psychological Medicine examined how four of these types of biases came into play in research on antidepressants.”

“… Publication bias refers to the decision on whether to publish results based on the outcomes found. “

“… Outcome reporting bias refers to writing up only the results in a trial that appear positive, while failing to report those that appear negative.”

“… Spin refers to using language, often in the abstract or summary of the study, to make negative results appear positive.”

“… Research becomes amplified by citation in future papers. The more it’s discussed, the more it’s disseminated both in future work and in practice. Positive studies were cited three times more than negative studies. This is citation bias.”

To read more, click here.

[From the video, “The Retraction Watch Database” by Ivan Oransky, posted at YouTube].

To watch the full video (it’s only about 8 and a half minutes long), click here.

[From the blog ““Tweeking”: The big problem is not where you think it is” by Andrew Gelman, posted at Statistical Modeling, Causal Inference, and Social Science]

“In her recent article about pizzagate, Stephanie Lee included this hilarious email from Brian Wansink, the self-styled “world-renowned eating behavior expert for over 25 years”:

“But I want to focus on a different bit: “. . . although the stickers increase apple selection by 71% . . .””

“This is the type M (magnitude) error problem—familiar now to us, but not so familiar a few years ago to Brian Wansink, James Heckman, and other prolific researchers.”

To read more, click here.

[From the article “The truth squad” by Erik Stokstad, published in Science]

To read more, click here.

[From the blog “Why We Cannot Trust the Published Empirical Record in Economics and How to Make Things Better” by Sylvain Chabé-Ferret, posted at the blogsite An Economist’s Journey]

To read more, click here.

[From the article “Why Economics Is Having a Replication Crisis” by Noah Smith, published at http://www.bloomberg.com]

To read more, click here.

[From the article, “Meta-analyses were supposed to end scientific debates. Often, they only cause more controversy” by Jop de Vrieze, published at http://www.sciencemag.org]

To read more, click here.

DATE: Friday 26 October.

PLACE: University of Canterbury, Business School, Meremere, Room 236, Christchurch, NEW ZEALAND

REGISTRATION (important for catering purposes): email to tom.coupe@canterbury.ac.nz

COST: Nada ($0)

Supported by the University of Canterbury Business School Research Committee.

SPEAKERS AND PRESENTATION TITLES:

– Anton Angelo (University of Canterbury/Library): Transparency and Reproducibility – It’s All About Layers

– Arin Basu (University of Canterbury/Health Sciences): What about Why?

– Annette N. Brown (FHI 360): Which Tests Not Witch Hunts: A Diagnostic Approach to Conducting Replication Research

– Brian Haig (University of Canterbury/Psychology): Understanding Replication in A Way That Is True To Science

– Jeff Miller (University of Otago/Psychology): The Statistical Fundamentals of (Non)-Replicability

– Thomas Pfeiffer (Massey University/Computational Biology/Biochemistry): Betting On Your Peers’ Results: A Tale of Three Markets

– W. Robert Reed (University of Canterbury/Business School): An Update on the Progress of Replications in Economics

– Philip Schluter (University of Canterbury/Health Sciences): A Bayesian Alternative to Hypothesis Testing

– Eric Vanman (University of Queensland/Psychology). How Pre-Registrations Can Improve Science: Tales from the Front-Line

– Ben Wood (Integra LLC): Lessons Learned From Running a Social Science Replication Program

PROGRAMME

9.00-9.30: Registration

9.30-9.35: Introduction

9.35-11.05: SESSION: Replication – Theory and Current Status

11.05-11.30: Coffee Break

11.30-12.30: SESSION: How to Detect the Truth

12.30-13.30: Lunch

13.30-15.00: SESSION: Making Research More Reproducible

15.00-15.30: Coffee Break

15.30-16.30: SESSION: Observations from the Front Lines

16.30-17.00: Closing

ABSTRACTS OF TALKS

SESSION: Replication – Theory and Current Status

Four variables parameterize each pairwise replication: n1, n2, d1_pop, and d2_pop. These are the sample and population effect sizes for the two studies.

A replication is exact if the two studies are sampling the same population; this means d1_pop = d2_pop.

A replication is near-exact if the populations differ slightly, which means d1_pop and d2_pop differ by a small amount, near; technically, abs(d1_pop – d2_pop) ≤ near.

The table below shows the results for near = 0.1, 0.3 as above, and prop.true ranging from 0.1 to 0.9. The RWFPR and RWFNR columns are for different values of d1_pop in “true” cases, i.e., when the original study is a true positive.