Open Invitation to the Webinar Launch of the InSPiR2eS International Research Alliance

InSPiR2eS is a new global research network primarily aimed at research training and capacity building, resting on a foundation theme of responsible science (for some more details, please refer to the 2-pager outline here).

Whether you are a current network member or not, you are warmly invited to the 1-hour webinar launch of the network taking place during the window, 22-24th June.

For your convenience, the Zoom launch is offered in 3 separate repeat events summarised below (please see here for a doc that gives more details confirming equivalent dates/times for your part of the world):

#1: Tuesday 22nd June at 18:00 Australian Eastern Standard Time (AEST)

Topic: Robert Faff’s Zoom Meeting #1 launching InSPiR2eS

Join from a PC, Mac, iOS or Android: https://bond.zoom.us/j/91798395671

#2: Wednesday 23rd June at 15:00 AEST

Topic: Robert Faff’s Zoom Meeting #2 launching InSPiR2eS

Join from a PC, Mac, iOS or Android: https://bond.zoom.us/j/99592757933

#3: Thursday 24th June at 06:00 AEST

Topic: Robert Faff’s Zoom Meeting #3 launching InSPiR2eS

Join from a PC, Mac, iOS or Android: https://bond.zoom.us/j/92796833442

If you are interested in joining the Zoom launch of InSPiR2eS, please register ASAP at the Google Docs link here.

Finally, please share this open invitation with whomever you think might be interested. Thank you!

repliCATS is Back!

About the repliCATS project

Based at the University of Melbourne, the repliCATS project team are part of a wider program called SCORE funded by DARPA. We are excited about reimagining peer review as a structured deliberation process. We’re testing this by crowdsourcing expert judgements about the credibility of published research in eight social science disciplines – criminology, education, economics, marketing, management, political science, psychology, public administration, and sociology.

repliCATS workshops – get paid to do post-publication peer review!

In workshops you will be asked to evaluate the credibility of two to three published papers within your domain expertise. We use a structured group deliberation approach called the IDEA protocol on a custom-built web-based platform which means you can participate from anywhere in the world! Working in small groups, you will be asked to evaluate papers across a number of credibility signals – from its comprehensibility, to its replicability, robustness and transparency.

Participants will first make private judgements about the paper, and then get to review and discuss their group’s responses before submitting their final judgements. For these workshops, we’ll group you by compatible time zones, and each group will have its own facilitator who will guide you through a short virtual discussion.

All participants are eligible for a US$200 assessment grant.

Your time commitment

Workshops will run over a period of 7 days, but you get to make your judgements when it suits you over that period. Here is how it works:

1) Consent & demographics, and create repliCATS platform account before the workshop: 20-25 minutes.

2) Workshop introduction: 45 mins, a number of live webinars run by the repliCATS team. Also recorded and can be watched on demand.

3) Assessment period: 45-70 minutes/paper at your own pace over the 7-day assessment period.

4) Group discussion: 30-45 minutes during the 7-day assessment period, dedicated facilitator will guide discussion.

5) Assessment grant administration: 5-10 minutes.

Sign-ups now open for July, August & September workshops

20-27 July – economics, marketing & management (35-40 participants)

24-31 August – Criminology, Political Science & Sociology (35-40 participants)

21-28 September – Criminology, Political Science, Public administration (government and law) & Sociology (35-40 participants)

Sign-up here:  https://replicats.research.unimelb.edu.au/2021/06/02/express-interest-now-open-jul-sept-bushel-workshops/

Have questions? E-mail us at repliCATS-project@unimelb.edu.au  

DUAN & REED: How Are Meta-Analyses Different Across Disciplines?

INTRODUCTION

Recently, one of us gave a workshop on how to conduct meta-analyses. The workshop was attended by participants from a number of different disciplines, including economics, finance, psychology, management, and health sciences. During the course of the workshop, it became apparent that different disciplines conduct meta-analyses differently. While there is a vague awareness that this is the case, we are unaware of any attempts to quantify those differences. That is the motivation for this blog.

We collected recent meta-analyses across a number of different disciplines and recorded information on the following characteristics:

– Size of meta-analysis sample, measured both by number of studies and number of estimated effects included in the meta-analysis

– Type of effect size

– Software package used

– Procedure(s) used to estimate effect size

– Type of tests for publication bias

– Frequency that meta-analyses report (i) funnel plots, (ii) quantitative tests for publication bias, and (iii) meta-regressions.

Unfortunately, given the large number of meta-analyses, and large number of disciplines that do meta-analyses, we were unable to do an exhaustive analysis. Instead, we chose to identify the disciplines that publish the most meta-analyses, and then analyse the 20 most recent meta-analyses published in those disciplines.

LITERATURE SEARCH

To conduct our search, we utilized the library search engine at our university, the University of Canterbury. This search engine, while proprietary to our university, allowed us to simultaneously search multiple databases by discipline (see below).

We conducted our search in January 2021. We used the keyword “meta-analysis”, filtering on “Peer-reviewed” and “Journal article”, and restricted our search depending on publication date. A total of 58 disciplines were individually searchable, including Agriculture, Biology, Business, Economics, Education, Engineering, Forestry, Medicine, Nursing, Physics, Political Science, Psychology, Public Health, Sociology, Social Welfare & Social Work, and Zoology.

Of the 58 disciplines we could search on, 18 stood out as publishing substantially more meta-analyses than others. These are listed below. For each discipline, we then searched for all meta-analyses/”Peer-reviewed”/”Journal article” that were published in January 2021, sorted by relevance. We read through the title and abstract until we found 20 meta-analyses. If January 2021 produced less than meta-analyses for a given discipline, we extended the search back to December 2020. In this manner, we constructed a final sample of 360 meta-analyses. The results are reported below.

NUMBER OF STUDIES

TABLE 1 below reports mean, median, and minimum number of studies for each sample of 20 meta-analyses corresponding to the 18 disciplines. Maximum values are indicated by green shading. Minimum values are indicated by blue.

The numbers indicate wide differences across disciplines in the number of studies included in a “typical” meta-analysis. Business meta-analysis tend to have the largest number of studies with mean and median values of 87.6 and 88 studies, respectively. Ecology and Economics also typically include large numbers of studies.

On the other side, disciplines in the health sciences (Dentistry, Diet & Clinical Nutrition, Medicine, Nursing, and Pharmacy, Therapeutics & Pharma) include relatively few studies. The mean and median number of studies included in meta-analyses in Diet & Clinical Nutrition are 13.9 and 11; and 14.8 and 10 for Nursing, respectively. We even found a meta-analysis in Dentistry that only included 2 studies.

NUMBER OF EFFECTS

Meta-analyses differ not only in number of studies, but the total number of observations/estimated effects they include. In some fields, it is common to include a representative effect, or the average effect from that study. Other disciplines include extensive robustness checks, where the same effect is estimated multiple times using different estimation procedures, variable specifications, and subsamples. Similarly, there may be multiple measures of the same effect, sometimes included in the same equation, and these produce multiple estimates.

Measured by number of estimated effects, Agriculture has the largest meta-analyses with mean and median sample sizes of 934 and 283. Not too far behind are Economics and Business. These three disciplines are characterized by substantially larger samples than other disciplines. As with number of studies, the disciplines with the smallest number of effects per study are health-related fields such as Dentistry, Diet & Clinical Nutrition, Medicine, Nursing, Pharmacy, Therapeutics & Pharma, and Public Health.

MEASURES OF EFFECT SIZE

Disciplines also differ in the effects they measure. We identified four main types of effects: (i) Mean Differences, including standardized mean differences, Cohen’s d, and Hedge’s g; (ii) Odds-Ratios; (iii) Risk Ratios, including Relative Risk, Response Ratios, and Hazard Ratios; (iiia) Correlations, including Fisher’s z; (iiib) Partial Correlations, and (iv) Estimated Effects.

We differentiate correlations from partial correlations because the latter primarily appear in Economics. Likewise, Economics is somewhat unique because the range of estimated effects vary widely across primary studies, with studies focusing on things like elasticities, various treatment effects, and other effects like fiscal multipliers or model parameters. The table below lists the most common and second most common effect sizes investigated by meta-analyses across the different disciplines.

We might ask why does it matter that meta-analyses differ in their sizes and estimated effects? In a recent study, Hong and Reed (2021) present evidence that the performance of various estimators depends on the size of the meta-analyst’s sample. They provide an interactive ShinyApp that allows one to filter performance measures by various study characteristics in order to identify the best estimator for the specific research situation. Performance may also depend on the type of effect being estimated (see here for some tentative experimental evidence on partial correlations).

ESTIMATION – Estimators

One way in which disciplines are very similar is on their reliance on the same estimators to estimate effect sizes. TABLE 4 reports the two most common estimators by discipline. Far and away the most common estimator is the Random Effects estimator that allows for heterogeneous effects across studies.

The second most common estimator is the Fixed Effects estimator, which is built on the assumption of a single population effect, whereby studies produce different estimated effects due only to sampling error. A close relative of the Fixed Effects estimator common in Economics is the Weighted Least Squares estimator of Stanley and Doucouliagos. This estimator produces coefficient estimates identical to the Fixed Effects estimator, but with different standard errors. Despite being the most common estimator, Hong and Reed (2021) show that Random Effects frequently underperforms relative to other meta-analytic estimators.

SOFTWARE PACKAGES

Another way in which disciplines differ is with respect to the software packages they use. These include a number of standalone packages such as MetaWin, RevMan (for Review Manager), and CMA (for Comprehensive Meta-Analysis); as well as packages designed to be used in conjunction with comprehensive software programs such as R and Stata.  

A frequently used R package is metafor. Stata has a built-in meta-analysis suite called meta. In addition to these packages, many researchers have customized their own programs to work with R or Stata. As an example, in economics, Tomas Havránek has published a wide variety of meta-analyses using customized Stata programs. These can be viewed here.

TABLE 5 reports the most common software packages used by the studies in our sample. It is clear that R and Stata are the packages of choice for most researchers when estimating effect sizes.

ESTIMATION – Tests for Publication Bias

Another area where there is much commonality among disciplines is statistical testing for publication bias. While disciplines differ in how frequently they report such tests (see below), when they do, they usually rely on some measure of the relationship between the estimated effect size and its standard error or variance.

Egger’s test is the most common statistical test for publication bias. It consists of a regression of the effect size on the standard error of the effect size. Closely related is the FAT-PET (or its extension, FAT-PET-PEESE). FAT-PET stands for Funnel Asymmetry Test – Precision Effect Test. This is essentially the same as an Egger regression except that the regression is also used to obtain a publication-bias adjusted estimate of the effect size (“PET”, since this effect is commonly estimated in a specification where the mean effect size is measured by the coefficient on the effect size precision variable).

The rank correlation test, also known as Begg’s test or the Begg and Mazumdar rank correlation test, works very similarly except rather than a regression, it rank correlates the estimated effect size with its variance. Other tests, such as Trim and fill, Fail-safe N, and tests based on selection models, are less common.

OTHER META-ANALYSIS FEATURES

In addition to the characteristics identified above, disciplines also differ by how commonly they report information in addition to estimates of the effect size. Three common features are funnel plots, publication bias tests, and meta-regressions.

Funnel plots can be thought of as a qualitative Egger’s test. Rather than a regression relating the estimated effect size to its standard error, a funnel plot plots the relationship, providing a visual impression of potential publication bias. As is apparent from TABLE 6, not all meta-analyses report funnel plots. They appear to be particularly scarce in Agriculture, where only 15% of our sampled meta-analyses reported a funnel plot. For most disciplines, roughly half of the meta-analyses reported funnel plots. Funnel plots were most frequent in Medicine, with approximately 4 out of 5 meta-analyses showing a funnel plot.

TABLE 6 reports the most common statistical tests for publication bias conditional on such tests being carried out. While not all meta-analyses test for publication bias, most do. 15 of the 18 disciplines had a reporting rate of at least 50% when it comes to statistical tests of publication bias. Anatomy & Physiology and Diet & Clinical Nutrition had the highest rates, with 85% of meta-analyses reporting tests for publication bias. Agriculture had the lowest at 30%.

The last feature we focus on is meta-regression. A meta-regression is a regression where the dependent variable is the estimated effect size and the explanatory variables consist of various study, data, and estimation characteristics that the researcher believes may influence the estimated effect size. Technically speaking, an Egger regression is a meta-regression. However, here we restrict it to studies that attempt to explain differences in estimated effects across studies by relating them to characteristics of those studies beyond the standard error of the effect size.

Meta-regressions are very common in Economics, with almost 9 out of 10 meta-analyses including them. They are less common in other disciplines, with most disciplines having a reporting rate less than 50%. None of the 20 Agriculture meta-analyses in our sample reported a meta-regression.

Nevertheless, there are other ways that meta-analyses can explore systematic differences in effect sizes. Many studies perform subgroup analyses. For example, a study of the effect of a certain reading program may break out the full sample according to the predominant racial or ethnic characteristics of the school jurisdiction to determine whether there these characteristics are related to the effectiveness of the program.

CONCLUSION

While our results are based on a limited sampling of meta-analyses, the results indicate that there are important differences in meta-analytic research practices across disciplines. Researchers can benefit from this knowledge by appropriately accommodating their research if they are considering submitting their work to interdisciplinary journals. Likewise, being familiar with another discipline’s norms enables one to provide a fairer, more objective review when one is called to referee meta-analyses from journals outside one’s discipline.

As noted above, estimator performance may also be impacted by study and data characteristics. While some research has explored this topic, this is largely unexplored territory. Recognizing that meta-analyses from different disciplines have different characteristics should make one sensitive that estimators and practices that are optimal in one field may not be well suited in others. We hope this study encourages more research in this area.

Jianhua (Jane) Duan is a post-doctoral fellow in the Department of Economics at the University of Canterbury. She is being supported by a grant from the Center for Open Science. Bob Reed is Professor of Economics and the Director of UCMeta at the University of Canterbury. They can be contacted at jianhua.duan@pg.canterbury.ac.nz and bob.reed@canterbury.ac.nz, respectively.

FAFF: International Society of Pitching Research for Responsible Science

What is the International Society of Pitching Research for Responsible Science (InSPiR2eS) research network?

InSPiR2eS is a globally-facing research network primarily aimed at research training and capacity building, resting on a foundation theme of responsible science.

As measures of its success, the network succeeds if (beyond what we would have otherwise achieved) it inspires:

– responsible research – i.e., research that produces new knowledge that is credible, useful & independent.

– productive research collaboration & partnerships – locally, regionally & globally.

– a collective sense of purpose and achievement towards the whole research process.

Importantly, the network aims to inclusively embrace like-minded university researchers centred on the multi-faceted utility provided by the “Pitching Research” framework, as a natural enabler of responsible science. 

The network alliance is a fully “opt in” organisational structure. Through the very act of joining the network, each member will abide by an appropriate Code of Conduct (under development), including: privacy, confidentiality, communication and notional IP relating to research ideas.

Why create InSPiR2eS?

The underlying premise for creating InSPiR2eS is to facilitate an efficient co-ordinated sharing of relevant research information and resources for the mutual benefit of all participants – whether this occurs through the inputs, processes or outputs linked to our research endeavours. More generally, while the enabling focus is on the Pitching Research framework, the network can offer its members a global outreach for their research efforts – in new and novel ways.

For example, actively engaging the network could spawn new research teams and projects, or other international initiatives and alliances. While such positive outcomes could “just happen” anyway, absent creating a new network – which is hardly novel, the network could e.g., experiment with a “shark tank” type webinar event in which some members pitch for new project collaborators. Exploiting the power of a strong alliance, the network can deliver highly leveraged outcomes compared to what is possible when we act alone – as isolated “sole traders”.

Who are the members of InSPiR2eS?

Professor Robert Faff (Bond University), as the network initiator, is the network convenor and President of InSPiR2eS. Currently, the network has more than 500 founding Ambassadors, Members and Associate Members already signed up representing 73 countries/ jurisdictions: Australia, Pakistan, China, Canada, New Zealand, Vietnam, Brazil, Nigeria, Germany, Indonesia, the Netherlands, England, Kenya, Romania, Poland, Mauritius, Sri Lanka, Bangladesh, Italy, Spain, India, Scotland, Singapore, Japan; Norway; Ireland; the US; Malaysia; Chile; Turkey; Wales; Serbia; Belgium; Thailand; France; South Africa; Switzerland; Croatia; Czech Republic; Hong Kong; Taiwan; Macau; South Korea; Greece; Ukraine; Ghana; Slovenia; Austria; Cyprus; Uganda; Namibia; Portugal; Tanzania; Fiji; Saudi Arabia; Estonia; Iceland; Egypt; Mongolia; Lithuania; Slovakia; Finland; Sweden; Ecuador; Israel; Hungary; UAE; North Cyprus; Mozambique; Philippines; Nepal; Argentina; Malta.

How will InSPiR2eS operate?

Phase 1: Network setup and initial information exchange.

To begin with, we will rely (mostly) on email communication. We will establish an e-newsletter – to provide engaging and organised information exchange. Dr Searat Ali (University of Wollongong) has agreed to be the inaugural Editor of the InSPiR2eS e-Newsletter (in his role as VP – Communications). 

Phase 2: Establishing interactive network engagement.

Live webinar Zoom sessions will be offered on topics linked to the network Mission. These sessions would be recorded and freely accessible from an InSPiR2eS “Resource Library”. Initially, these sessions will be presented by the network leader, but over time others in the network would be welcome to offer sessions – especially, if the topics are of a general nature aiming for research training/capacity building (rather than a research seminar on their latest paper). These webinars would be open to all, irrespective of whether they are network members or not – including network members, as well as to their students, their research collaborators and any other invited associated researcher in their networks.

The inaugural network webinar will broadly address the core theme of “responsible science”, and this material will serve as a beacon against which all network activities will be offered. Subsequent webinar topics might include the following modules:

– A Basic Primer on Pitching Research.

– Using Pitching Research as a Reverse Engineering Tool.

– Advanced Guidelines on applying the Pitching Research Framework.

– Pitching Research for Engagement & Impact.

– Pitching Research as a Tool for Responsible Science.

– Pitching Research as a Tool for Replications.

– Pitching Research as a Tool for Pre-registration.

– Pitching Research for Diagnostic use in Writing.

– Roundtable Panels – e.g., discussing issues related to “responsible science”, etc.

Phase 3: Longer-term, post-COVID network initiatives.

Downstream network initiatives will include the creation of a “one-stop shop” network website. And, once COVID is behind us, we will explore some in-person events like:

– a conference or symposium.

– “shark tank” event(s), either themed on “pitching research for finding collaborators” or “pitching research to journal editors”.

– initiatives/ special projects/ network events suggested and/or co-ordinated by network members.

When will InSPiR2eS content activity begin?

Release of the inaugural edition of the e-Newsletter will be a signature activity  and we are aiming for this to be ready later in May, 2021. Zoom webinars, will also start soon – we are aiming for a network opening event in June 2021. Please keep an eye for publicity on this event soon.

How do I join the InSPiR2eS research network?

If you are interested in joining the InSPiR2eS research network and engaging in its upcoming rich program of webinars, workshops and research resources, then register at the following Google Docs link (click here). In locations where Google is problematic, click here.

Robert Faff is Professor of Finance at Bond University. He is Network Convenor & President of InSPiR2eS. Professor Faff can be contacted at rfaff@bond.edu.au.

—————————————————————-

[1] In part, the idea of the network itself is inspired by the community for Responsible Research in Business and Management that released a position paper in 2017, in which they outline a vision for the year 2030 “… of a future in which business schools and scholars worldwide have successfully transformed their research toward responsible science, producing useful and credible knowledge that addresses problems important to business and society.”


COS wants YOU to collaborate with them!

The Center for Open Science (COS), as part of Phase 2 of the SCORE project, is looking for researchers interested in collaborating on replication and reproduction projects. In a nutshell, COS has identified a number of articles across a wide variety of disciplines: management, economics, finance, psychology, political science, sociology, and many more. To see the kinds of replication/reproducibility projects COS is willing to fund, see here.

For the time being, COS is only recruiting for projects that require local IRB/ethics review and approval. Human Subjects Research (HSR) projects can be funded up to $10,000. However your institution is required to have a Federalwide Assurance (FWA) in order for you to receive funding for HSR projects.  More information about the FWA is available here, and you can check whether your institution has an active FWA here. All projects need to be completed by November 2021. If you’re interested in participating, fill out the Interest Form here.  COS will be in touch with more information regarding projects that are available for collaboration.

If you have any questions, please reach out to Nick and Olivia at scorecoordinator@cos.io.

On the Past and Present of Reproducibility and Replicability in Economics

[Excerpts are taken from the article “Reproducibility and Replicability in Economics” by Lars Vilhuber, published in Harvard Data Science Review]

“In this overview, I provide a summary description of the history and state of reproducibility and replicability in the academic field of economics.”

“The purpose of the overview is not to propose specific solutions, but rather to provide the context for the multiplicity of innovations and approaches that are currently being implemented and developed, both in economics and elsewhere.”

“In this text, we adopt the definitions of reproducibility and replicability articulated, inter alia, by Bollen et al. (2015) and in the report by NASEM (2019).”

“At the most basic level, reproducibility refers to “to the ability […] to duplicate the results of a prior study using the same materials and procedures as were used by the original investigator.”

“Replicability, on the other hand, refers to “the ability of a researcher to duplicate the results of a prior study if the same procedures are followed but new data are collected”…, and generalizability refers to the extension of the scientific findings to other populations, contexts, and time frames.”

“Much of economics was premised on the use of statistics generated by national statistical agencies as they emerged in the late 19th and early 20th century…Economists were requesting access for research purposes to government microdata through various committees at least as far back as 1959 (Kraus, 2013).”

“Whether using private-sector data, school-district data, or government administrative records, from the United States and other countries, the use of these data for innovative research has been increasing in recent years. In 1960, 76% of empirical AER articles used public-use data. By 2010, 60% used administrative data, presumably none of which is public use.”

“In economics, complaints about the inability to properly conduct reproducibility studies, or about the absence of any attempt to do so by editors, referees, and authors, can be traced back to comments and replies in the 1970s.”

“In the early 2000s, as in other sciences (National Research Council, 2003), journals started to implement “data’ or ‘data availability’ policies. Typically, they required that data and code be submitted to the journal, for publication as ‘supplementary materials.’”

“Journals in economics that have introduced data deposit policies tend to be higher ranked…None of the journals…request that the data be provided before or during the refereeing process, nor does a review of the data or code enter the editorial decision contrast to other domains (Stodden et al., 2013). All make provision of data and code a condition of publication, unless an exemption for data provision is requested.”

“More recently, economics journals have increased the intensity of enforcement of their policies. Historically being mainly focused on basic compliance, associations that publish journals …have appointed staff dedicated to enforcing various aspects of their data and code availability policies…The enforcement varies across journals, and may include editorial monitoring of the contents of the supplementary materials, reexecution of computer code (verification of computational reproducibility), and improved archiving of data.”

If the announcement and implementation of data deposit policies improve the availability of researchers’ code and data…, what has the impact been on overall reproducibility? Table 2B, shows the reproduction rates both conditional on data availability as well as unconditionally, for a number of reproducibility studies”Data that is not provided due to licensing, privacy, or commercial reasons (often incorrectly collectively referred to as ‘proprietary’ data) can still be useful in attempts at reproduction, as long as others can reasonably expect to access the data…Providers will differ in the presence of formal access policies, and this is quite important for reproducibility: only if researchers other than the original author can access the non-public data can an attempt at reproducibility even be made, if it at some cost.

“We made a best effort to classify the access to the confidential data, and the commitment by the author or third parties to provide the data if requested. For instance, a data curator with a well-defined, nonpreferential data access policy would be classified under ‘formal commitment.’…We could identify a formal commitment or process to access the data only for 35% of all nonpublic data sets.”

“One of the more difficult topics to empirically assess is the extent to which reproducibility is taught in economics, and to what extent in turn economic education is helped by reproducible data analyses. The extent of the use of replication exercises in economics classes is anecdotally high, but I am not aware of any study or survey demonstrating this.”

“More recently, explicit training in reproducible methods (Ball & Medeiros, 2012; Berkeley Initiative for Transparency in the Social Sciences, 2015), and participation of economists in data science programs with reproducible methods has increased substantially, but again, no formal and systematic survey has been conducted.”

“Because most reproducibility studies of individual articles ‘only’ confirm existing results, they fail the ‘novelty test’ that most editors apply to submitted articles (Galiani et al., 2017). Berry and coauthors (2017) analyzed all papers in Volume 100 of the AER, identifying how many were referenced as part of replication or cited in follow-on work.”

“While partially confirming earlier findings that strongly cited articles will also be replicated (Hamermesh, 2007), the authors found that 60% of the original articles were referenced in replication or extension work, but only 20% appeared in explicit replications. Of the roughly 1,500 papers that cite the papers in the volume, only about 50 (3.5%) are replications, and of those, only 8 (0.5%) focused explicitly on replicating one paper.”

“Even rarer are studies that conduct replications prior to their publication, of their own volition. Antenucci et al. (2014) predict the unemployment rate from Twitter data. After having written the paper, they continued to update the statistics on their website (“Prediction of Initial Claims for Unemployment Insurance,” 2017), thus effectively replicating their paper’s results on an ongoing basis. Shortly after release of the working paper, the model started to fail. The authors posted a warning on their website in 2015, but continued to publish new data and predictions until 2017, in effect, demonstrating themselves that the originally published model did not generalize.”

“Reproducibility has certainly gained more visibility and traction since Dewald et al.’s (1986) wake-up call…Still, after 30 years, the results of reproducibility studies consistently show problems with about a third of reproduction attempts, and the increasing share of restricted access data in economic research requires new tools, procedures, and methods to enable greater visibility into the reproducibility of such studies. Incorporating consistent training in reproducibility into graduate curricula remains one of the challenges for the (near) future.”

To read the article, click here.

REED: The State of Replications in Economics – A 2020 Review (Part 3)

This final instalment on the state of replications in economics, 2020 version, continues the discussion of how to define “replication success” (see here and here for earlier instalments). It then delves further into interpreting the results of a replication. I conclude with an assessment of the potential for replications to contribute to our understanding of economic phenomena.  

How should one define “replication success”?

In their seminal article assessing the rate of replication in psychology, Open Science Collaboration (2015) employed a variety of definitions of replication success. One of their measures has come to dominate all others: obtaining a statistically significant estimate with the same sign as the original study (“SS-SS”). For example, this is the definition of replication success employed by the massive SCORE project currently being undertaken by the Center for Open Science.

The reason for the “SS-SS” definition of replication success is obvious. It can easily be applied across a wide variety of circumstances, allowing a one-size, fits-all measure of success. It melds two aspects of parameter estimation – effect size and statistical significance – into a binary measure of success. However, studies differ in the nature of their contributions. For some studies, statistical significance may be all that matters, say when establishing the prediction of a given theory. For others, the size of the effect may be what’s important, say when one is concerned about the effect of a tax cut on government revenues.

The following example illustrates the problem. Suppose a study reports that a 10% increase in unemployment benefits is estimated to increase unemployment duration by 5%, with a 95% confidence interval of [4%, 6%]. Consider two replication studies. Replication #1 estimates a mean effect of 2% with corresponding confidence interval of [1%, 3%]. Replication #2 estimates a mean effect of 5%, but the effect is insignificant with a corresponding confidence interval of [0%, 10%].

Did either of the two replications “successfully replicate” the original? Did both? Did none? The answer to this question largely depends on the motivation behind the original analysis. Was the main contribution of the original study to demonstrate that unemployment benefits affect unemployment durations? Or was the motivation primarily budgetary? So that the size of the effect was the important empirical contribution?

There is no general right or wrong answer to these questions. It is study-specific. Maybe even researcher-specific. For this reason, while I understand the desire to develop one-size-fits-all measures of success, it is not clear how to interpret these “success rates”. This is especially true when one recognizes — and as I discussed in the previous instalment to this blog — that “success rates” below 100%, even well below 100%, are totally compatible with well-functioning science.

How should we interpret the results of a replication?

The preceding discussion might give the impression that replications are not very useful. While measures of the overall “success rate” of replications may not tell us much, they can be very insightful in individual cases.

In a blog I wrote for TRN entitled “The Replication Crisis – A Single Replication Can Make a Big Difference”, I showed how a single replication can substantially impact one’s assessment of a previously published study.

Define “Prior Odds” as the Prob(Treatment is effective):Prob(Treatment is ineffective). Define the “False Positive Rate” (FPR) as the percent of statistically significant estimates in published studies for which the true underlying effect is zero; i.e, the treatment has no effect. If the prior odds of a treatment being effective are relatively low, Type I error will generate a large number of “false” significant estimates that can overwhelm the significant estimates associated with effective treatments, causing the FPR to be high. TABLE 1 below illustrates this.

The FPR values in the table range from 0.24 to 0.91. For example, given 1:10 odds that a randomly chosen treatment is effective, and assuming studies have Power equal to 0.50, the probability that a statistically significant estimate is a false positive is 50%. Alternatively, if we take a Power value of 0.20, which is approximately equal to the value that Ioannidis et al. (2017) report as the median value for empirical research in economics, the FPR rises to 71%.

It needs to be emphasized that these high FPRs have nothing to do with publication bias or file drawer effects. They are the natural outcomes of a world of discovery in which Type I error is combined with a situation where most studied phenomena are non-existent or economically negligible.

TABLE 2 reports what happens when a researcher in this environment replicates a randomly selected significant estimate. The left column reports the researcher’s initial assessment that the finding is a false positive (as per TABLE 1). The table shows how that probability changes as a result of a successful replication.

For example, suppose the researcher thinks there is a 50% chance that a given empirical claim is a false positive (Initial FPR = 50%). The researcher then performs a replication and obtains a significant estimate. If the replication study had 50% Power, the updated FPR would fall from 50% to 9%.

TABLE 2 demonstrates that successful replications produce substantial decreases in false positive rates across a wide range of initial FPRs and Power values. In other words, while discipline-wide measures of “success rates” may not be very informative, replications can have a powerful impact on the confidence that researchers attach to individual estimates in the literature.

Do replications have a unique role to play in contributing to our understanding of economic phenomena?

To date, replications have not had much of an effect on how economists do their business. The discipline has made great strides in encouraging transparency by requiring authors to make their data and code available. However, this greater transparency has not resulted in a meaningful increase in published replications. While there are no doubt many reasons for this, one reason may be that economists do not appreciate the unique role that replications can play in contributing to our understanding of economic phenomena.

The potential for empirical analysis to inform our understanding of the world is conditioned on the confidence researchers have in the published literature. While economists may differ in their assessment of the severity of false positives, the message of TABLE 2 is that, for virtually all values of FPRs, replications substantially impact that assessment. A successful replication lowers, often dramatically lowers, the probability that a given empirical finding is a false positive.

It is worth emphasizing that replications are uniquely positioned to make this contribution. New studies fall under the cloud of uncertainty that hangs over all original findings; namely, the rational suspicion that reported results are merely a statistical artefact. Replications, because of their focus on individual findings, are able to break through the fog. It is hoped that economists will start to recognize the unique role that replications can play in the process of scientific discovery. And that publishing opportunities for well-done replications; and appropriate professional rewards for the researchers who do them, follow.

Bob Reed is a professor of economics at the University of Canterbury in New Zealand. He is also co-organizer of the blogsite The Replication Network and Principal Investigator at UCMeta. He can be contacted at bob.reed@canterbury.ac.nz.

REED: The State of Replications in Economics – A 2020 Review (Part 2)

This instalment follows on yesterday’s post where I addressed two questions: Are there more replications in economics than there used to be? And, Which journals publish replications? These questions deal with the descriptive aspect of replications. We saw that replications seemingly constitute a relatively small — arguably negligible – component of the empirical output of economists. And while that component appears to be growing, it is growing at a rate that is, for all practical purposes, inconsequential. I would like to move on to more prescriptive/normative subjects.

Before I can get there, however, I need to acknowledge that the assessment above relies on a very specific definition of a replication, and that the sample of replications on which it is based is primarily drawn from one data source: Replication Wiki. Is it possible that there are a lot more replications “out there” that are not being counted? More generally, is it even physically possible to know how many replications there are?

Is it possible to know how many replications there are?

One of the most comprehensive assessments of the number of replications in economics was done in a study by Frank Mueller-Langer, Benedikt Fecher, Dietmar Harhoff, and Gert Wagner, published in Research Policy in 2019 and blogged about here. ML et al. reviewed all articles published in the top 50 economics journals between 1974 and 2014. They calculated a “replication rate” of 0.1%. That is, 0.1% of all the articles in the top 50 economics journals during this time period were replication studies.

0.1% is likely an understatement of the overall replication rate in economics, as replications are likely to be underrepresented in the top journals. With 400 mainline economics journals, each publishing an average of approximately 100 articles a year, it is a daunting task to assess the replication rate for the whole discipline.

One possibility is to scrape the internet for economics articles and use machine learning algorithms to identify replications. In unpublished work, colleagues of mine at the University of Canterbury used “convolutional neural networks” to perform this task. They compared the texts of the replication studies listed at The Replication Network (TRN) with a random sample of economics articles from RePEc.

Their final analysis produced a false negative error rate (the rate at which replications are mistakenly classified as non-replications) of 17%. The false positive rate (the rate at which non-replications are mistakenly classified as replications) was 5%.

To give a better feel for what these numbers means, consider a scenario where the replication rate is 1%. Suppose we have a sample of 10,000 papers, of which 100 are replications. Applying the false negative and positive rates above produces the numbers in TABLE 1.

Given this sample, a researcher would identify 578 replications, of which 83 would be true replications, and 495 would be “false replications”, that is, non-replication studies falsely categorized as replication studies. One would have to get a false positive rate below 1% before even half of the identified “replications” were true replications. Given a relatively low replication rate (here 1%), it is obvious that it is highly unlikely that machine learning will ever be accurate enough to produce reliable estimates of the overall replication rate in the discipline.

A final alternative is to follow the procedure of ML et al., but choose a set of 50 journals outside the top economics journals. However, as reported in yesterday’s blog, replications tend to be clustered in a relatively small number of journals. Results of replication rates would likely depend greatly on the particular sample of journals that was used.

Putting the above together, the answer to the question “Is it possible to know how many replications there are” appears to be no.

I now move on to assessing what we have learned from the replications that have been done to date. Specifically, have replications uncovered a reproducibility problem in economics?

Is there a replication crisis in economics?

The last decade has seen increasing concern that science has a reproducibility problem. So it is fair to ask, is there a replication crisis in economics? Probably the most famous study of replication rates is the study by Brian Nosek and the Open Science Collaboration (Science, 2015) that assessed the replication rate of 100 experiments in psychology. They reported an overall “successful replication rate” of 39%. Similar studies focused more on economics report higher rates (see TABLE 2).

The next section will delve a little more into the meaning of “replication success”. For now, let’s first ask, what rate of success should we expect to see if science is performing as it is supposed to? In a blog for TRN (“The Statistical Fundamentals of (Non-)Replicability”), Jeff Miller considers the case where a replication is defined to be “successful” when it reproduces a statistically significant estimate reported in a previous study (see FIGURE 1 below).

FIGURE 1 assumes 1000 studies each assess a different treatment. 10% of the treatments are effective. 90% have no effect. Statistical significance is set at 5% and all studies have statistical power of 60%. The latter implies that 60 of the 100 studies with effective treatments produce significant estimates.  The Type I error rate implies that 45 of the remaining 900 studies with ineffectual treatments also generate significant estimates. As a result, 105 significant estimates are produced from the initial set of 1000 studies.

If these 105 studies are replicated, one would expect to see approximately 38 significant estimates, leading to a replication “success rate” of 36% (see bottom right of FIGURE 1). Note that there is no publication bias here. No “file drawer effect”. Even when science works as it is supposed to, we should not expect a replication “success rate” of 100%. “Success rates” far less than 100% are perfectly consistent with well-functioning science.

Conclusion

Replications come in many sizes, shapes, and flavors. Even if we could agree on a common definition of a replication, it would be very challenging to make discipline-level conclusions about the number of replications that get published. Given the limitations of machine learning algorithms, there is no substitute for personally assessing each article individually. With approximately 400 mainline economics journals, each publishing approximately 100 articles a year, that is a monumental, seemingly insurmountable, challenge.

Beyond the problem of defining a replication, beyond the problem of defining “replication success”, there is the further problem of interpreting “success rates”. One might think that a 36% replication success rate was an indicator that science was failing miserably. Not necessarily so.

The final instalment of this series will explore these topics further. The goal is to arrive at an overall assessment of the potential for replications to make a substantial contribution to our understanding of economic phenomena (to read the next instalment, click here).

Bob Reed is a professor of economics at the University of Canterbury in New Zealand. He is also co-organizer of the blogsite The Replication Network and Principal Investigator at UCMeta. He can be contacted at bob.reed@canterbury.ac.nz.

REED: The State of Replications in Economics – A 2020 Review (Part 1)

This post is based on a keynote presentation I gave at the Editor’s Meeting of the International Journal for Re-Views of Empirical Economics in June 2020. It loosely follows up two previous attempts to summarize the state of replications in economics: (i) An initial paper by Maren Duvendack, Richard Palmer-Jones, and myself entitled “Replications in Economics: A Progress Report”, published in Econ Journal Watch in 2015; and (ii) a blog I wrote for The Replication Network (TRN) entitled “An Update on the Progress of Replications in Economics”, posted in October 2018.

In this instalment, I address two issues:

– Are there more replications in economics than there used to be?

– Which journals publish replications?

Are there more replications in economics than there used to be?

Before we count replications, we need to know what we are counting. Researchers use different definitions of replications, which produce different numbers. For example, at the time of this writing, Replication Wiki reports 670 replications at their website. In contrast, TRN, which relies heavily on Replication Wiki, lists 491 replications.

Why the difference? TRN employs a narrower definition of a replication. Specifically, it defines a replication as “any study published in a peer-reviewed journal whose main purpose is to determine the validity of one or more empirical results from a previously published study.”

Replications come in many sizes and shapes. For example, sometimes a researcher will develop a new estimator and want to see how it compares with another estimator. Accordingly, they replicate a previous study using the new estimator. An example is De Chaisemartin & d’Haultfoeuille’s “Fuzzy differences-in-differences” (Review of Economic Studies, 2018). D&H develop a DID estimator that accounts for heterogeneous treatment effects when the rate of treatment changes over time. To see the difference it makes, they replicate Duflo (2001) which uses a standard DID estimator.

Replication Wiki counts D&H as a replication. TRN does not. The reason TRN does not count D&H as a replication is because the main purpose of D&H is not to determine whether Duflo (2001) is correct. The main purpose of D&H is to illustrate the difference their estimator makes. This highlights the grey area that separates replications from other studies.

Reasonable people can disagree about the “best” definition of replication. I like TRN’s definition because it restricts attention to studies whose main goal is to determine “the truth” of a claim by a previous study. Studies that meet this criterion tend to be more intensive in their analysis of the original study and give it a more thorough empirical treatment. A further benefit is that TRN has consistently applied the same definition of replication over time, facilitating time series comparisons.

FIGURE 1 shows the growth in replications in economics over time. The graph is somewhat misleading because 2019 was an exceptional year, driven by special replication issues at the Journal of Development Studies, the Journal of Development Effectiveness, and, especially, Energy Economics. In contrast, 2020 will likely end up having closer to 20 replications. Even ignoring the big blip in 2019, it is clear that there has been a general upwards creep in the number of replications published in economics over time. It is, however, a creep, and not a leap. Given that there are approximately 40,000 articles published annually in Web of Science economics journals, the increase over time does not indicate a major shift in how the economics discipline values replications.

Which journals publish replications?

TABLE 1 reports the top 10 economics journals in terms of total number of replications published in their journal lifetimes. Over the years, a consistent leader in the publishing of replications has been the Journal of Applied Econometrics. In second place is the American Economic Review. However, an important distinction between these two journals is that JAE publishes both positive and negative replications; that is, replications that both confirm and refute the original studies. In contrast, the AER only very rarely publishes a positive replication.

There have been several new initiatives by journals to publish replications. Notably, the International Journal for Re-Views of Empirical Economics (IREE) was started in 2017 and is solely dedicated to the publishing of replications. It is an open access journal with no author processing charges (APCs), supported by a consortium of private and public funders. As of January 2021, it had published 10 replication studies.

To place the numbers in TABLE 1 in context, there are approximately 400 mainline economics journals. About one fourth (96) have ever published a replication. 2 journals account for approximately 25% of all replications that have ever been published. 9 journals account for over half of all replication studies. Only 25 journals (about 6% of all journals) have ever published more than 5 replications in their lifetimes.

Conclusion

While a little late to the party, economists have recently made noises about the importance of replication in their discipline. Notably, the 2017 Papers and Proceedings issue of the American Economic Review prominently featured 8 articles addressing various aspects of replications in economics. And indeed, there has been an increase in the number of replications over time. However, the growth in replications is best described as an upwards creep rather than a bold leap.

Perhaps the reason replications have not really caught on is because fundamental questions about replications have not been addressed. Is there a replication crisis in economics? How should “replication success” be measured? What is the “success rate” of replications in economics? How should the results of replications be interpreted? Do replications have a unique role to play in contributing to our understanding of economic phenomena? I take these up in subsequent instalments of this blog (to read the next instalment, click here).

Bob Reed is a professor of economics at the University of Canterbury in New Zealand. He is also co-organizer of the blogsite The Replication Network and Principal Investigator at UCMeta. He can be contacted at bob.reed@canterbury.ac.nz.

TER SCHURE: Accumulation Bias – How to handle it ALL-IN

An estimated 85% of global health research investment is wasted (Chalmers and Glasziou, 2009); a total of one hundred billion US dollars in the year 2009 when it was estimated. The movement to reduce this research waste recommends that previous study results be taken into account when prioritising, designing and interpreting new research (Chalmers et al., 2014; Lund et al., 2016). Yet any recommendation to increase efficiency this way requires that researchers evaluate whether the studies already available are sufficient to complete the research effort; whether a new study is necessary or wasteful. These decisions are essentially stopping rules – or rather noisy accumulation processes, when no rules are enforced – and unaccounted for in standard meta-analysis. Hence reducing waste invalidates the assumptions underlying many typical statistical procedures.

Ter Schure and Grünwald (2019) detail all the possible ways in which the size of a study series up for meta- analysis, or the timing of the meta-analysis, might be driven by the results within those studies. Any such dependency introduces accumulation bias. Unfortunately, it is often impossible to fully characterize the processes at play in retrospective meta-analysis. The bias cannot be accounted for. In this blog we revisit an example accumulation bias process, that can be one of many influencing a single meta-analysis, and use it to illustrate the following key points:

– Standard meta-analysis does not take into account that researchers decide on new studies based on other study results already available. These decisions introduce accumulation bias because the analysis assumes that the size of the study series is unrelated to the studies within; it essentially conditions on the number of studies available.

– Accumulation bias does not result from questionable research practices, such as publication bias from file-drawering a selection of results. The decision to replicate only some studies instead of all of them biases the sampling distribution of study series, but can be a very efficient approach to set priorities in research and reduce research waste.

– ALL-IN meta-analysis stands for Anytime, Live and Leading INterim meta-analysis. It can handle accumulation bias because it does not require a set number of studies, but performs analysis on a growing series – starting from a single study and accumulating as many studies as needed.

– ALL-IN meta-analysis also allows for continuous monitoring of the evidence as new studies arrive, even as new interim results arrive. Any decision to start, stop or expand studies is possible, while keeping valid inference and type-I error control intact. Such decisions can be strategic: increasing the value of new studies, and reducing research waste.

Our example: extreme Gold Rush accumulation bias

We imagine a world in which a series of studies is meta-analyzed as soon as three studies become available. Many topics deserve a first initial study, but the research field is very selective with its replications. Nevertheless, for significant results in the right direction, a replication is warranted. We call this the Gold Rush scenario, because after each finding of a positive significant result – the gold in science – some research group rushes into a replication, but as soon as a study disappoints, the research effort is terminated and no-one bothers to ever try again. This scenario was first proposed by Ellis and Stewart (2009) and formulated in detail and under this name by Ter Schure and Grünwald (2019). Here we consider the most extreme version of the Gold Rush where finding a significant positive result not only makes a replication more probable, but even inevitable: the dependency of occurring replications on their predecessor’s result is deterministic.

Biased Gold Rush sampling

We denote the number of studies available on a certain topic by t. This number t can also indicate the timing of a meta-analysis, such that a meta-analysis can possibly occur at number of studies t = 1, 2, 3, . . . up to some maximum number of studies T . This notation follows from Ter Schure and Grünwald (2019); the Technical Details at the end of this blog make the notation involved in this blog more explicit.

We summarize the results of individual studies into a single per-study Z-score (z1 for the first study, z2 for the second, etc), such that we have the following information on a series of size t: z1, z2, . . . , zt . We distinguish between Z-scores that are significant and in the right direction, and Z-scores that are not. A first significant positive study is indicated by z1 = z1*   (z1* > zα with zα = 1.96 for α = 2.5%).  A first nonsignificant or negative study is indicated by z1 = z1 (z1 <= zα).  We use the same notation for the second and third study and limit our world to three studies (our maximum T = 3). After all, we meta-analyze studies on all topics and only those topics that have spurred a series of three studies. Our Gold Rush world consists of the following possible study series:

Gold Rush world

Here A(t) denotes whether we accumulate and analyze the t studies: It can be that A(2) = 0 and A(3) = 0 because we are stuck at one study, but also A(1) = 0 because we don’t “meta-analyze” that single study. It can only be that A(2) = 1 if we accumulate and meta-analyze a two-study series and A(3) = 1 if we accumulate and meta-analyze a three-study series. In our Gold Rush world a very specific subset of studies accumulate into a three-study series such that they are meta-analyzed (A(3) = 1).

z(3) denotes the Z-score of a fixed effects meta-analysis. This meta-analysis Z-score is simply a re-normalized average and can, assuming equal sample size and variances in all studies, be obtained from the individual study  Z-scores  as  follows: z(3) =[1/sqrt(3)) × sum(zi)i = 1 to 3]. The effects of accumulation bias are not limited to fixed-effects meta-analysis (see for example Kulinskaya et al. (2016)), but fixed-effects meta-analysis does provide us with a simple illustration for the purposes of this blog.

We observe in our Gold Rush world above that the study series that are eventually meta-analyzed into a Z-score z(3) are a very biased subset of all possible study series. So we expect these z(3) scores to be biased as well. In the next section, we simulate the sampling distribution of these z(3) scores to illustrate this bias.

The conditional sampling distribution under extreme Gold Rush accumulation bias

Assume that we are in the scenario that only true null effects are studied in our Gold Rush world, such that any new study builds on a false-positive result. How large would the bias be if the three-study series are simply analyzed by standard meta-analysis? We illustrate this by simulating this Gold Rush world using the R code below.

Theoretical sampling process: A fixed-effects meta-analysis assumes that if three studies z1, z2, z3 are each sampled under the null hypothesis, each has a standard normal with mean zero and the standard normal sampling distribution also applies for the combined z(3) score. The R code in Figure 1 illustrates this sampling process: First, a large population is simulated of possible first (Z1), second (Z2) and third (Z3) studies from a standard normal distribution. Then in Zmeta3 each index i represents a possible study series, such that c(Z1[i], Z2[i], Z3[i]) samples an unbiased study series and calcZmeta calculates its fixed-effects meta- analysis Z-score z(3). So the large number of Z-scores in Zmeta3 capture the unbiased sampling distribution that is assumed for fixed-effects meta-analysis z(3)-scores.

Gold Rush sampling process: In contrast, the code resulting in A3 selects only those study series for which A(3) = 1 under extreme Gold Rush accumulation bias. So the large number of Z-scores in Zmeta3. A3 captures a biased sampling distribution for the fixed effects meta-analysis z(3)-scores.

Meta-analysis under Gold Rush accumulation bias: The final lines of code in Figure 1 plot two histograms of z(3) samples, one with and one without the Gold Rush A(t) accumulation bias process, based on Zmeta3.A3 and Zmeta3 respectively. Figure 2 gives the result.

We observe in Figure 2 that the theoretical sampling process, resulting in the pink histogram, gives a distribution for the three-study meta-analysis z(3)-scores that is centered around zero. Under the Gold Rush sampling process, however, our three-study z(3)-scores do not behave like this theoretical distribution at all. The blue histogram has a smaller variance and is shifted to the right – representing the bias.

We conclude that we should not use conventional meta-analysis techniques to analyze our study series under Gold Rush accumulation bias: Conventional fixed-effects meta-analysis assumes that any three-study summary statistic Z(3) is sampled from the pink distribution in Figure 2 under the null hypothesis, such that the meta- analysis is significant for Z(3)-scores larger than zα = 1.96 for a right-sided test with type-I error control  α = 2.5%. Yet the actual blue sampling distribution under this accumulation bias process shows that a much larger fraction of series that accumulate three studies will have Z(3)-scores larger than 1.96 than is assumed by the theory of random sampling. This (extremely) inflated proportion of type-I errors is 88% instead of 2.5% in our extreme Gold Rush, and can be obtained from our simulation by the code in Figure 3.

Accumulation bias can be efficient

The steps in the code from Figure 1 that arrive at the biased distribution in Figure 2 illustrate that accumulation bias is in fact a selection bias. Nevertheless, accumulation bias does not result from questionable research practices, such as publication bias from file-drawering a selection of results. The selection to replicate only some studies instead of all of them biases the sampling distribution of study series, but can be a very efficient approach to set priorities in research and reduce research waste.

By inspecting our Gold Rush world a bit closer, we observe that a fixed-effects meta-analysis of three studies actually conditions on this number of studies ((A(t) needs to be A(3) to be 1), and that this conditional nature is what is driving the accumulation bias; in technical details subsection A.3 we show this explicitly. In the next section we take the unconditional view.

The unconditional sampling distribution under extreme Gold Rush accumulation bias

We first adapt our Gold Rush accumulation bias world a bit, and not only meta-analyze three-study series but one-study “series” and two-study series as well. All possible scenarios for study series in this “all-series-size” Gold Rush world are illustrated below. We assume that we only meta-analyze series in a terminated state, and therefore first await a replication for significant studies before performing the meta-analysis. So a single-study “meta-analysis” can only consist of a negative or nonsignificant initial study (z1); only in that case we are in a terminated state with A(1) = 1 and the series does not grow to two (A(2) = 0). In a two-study meta-analysis the series starts with a significant positive initial study and is replicated by a nonsignificant or negative one; only in that case A(2) = 1, and the series does not grow to three so A(3) = 0. And only three-study series that start with two significant positive studies are meta-analyzed in a three-study synthesis; only in that case A(3) = 1.

Gold Rush world; all-series-size

The R code in Figure 4 calculates the fixed-effects meta-analysis z(1), z(2) and z(3) scores, conditional on meta- analyzing a one-study, two-study, or three-study series in this adjusted Gold Rush accumulation bias scenario. The histograms of these conditional z(t) scores are shown in Figure 5, including the theoretical unbiased z(3) histogram that was also shown in Figure 2 and largely overlaps with the “A(1) = 1, A(2) = 0”-scenario. The difference between these two sampling distributions is only visible in their right tail, with the green histogram excluding values larger than zα= 1.96 and redistributing their mass over other values.

Figure 5 clarifies that single studies are hardly biased in this extreme Gold Rush scenario, that the bias is problematic for two-study series and most extreme for three-study ones.

However, what this plot does not show us is how often we are in the one-study, two-study and three-study case.

To illustrate the relative frequencies of one-study, two-study and three-study meta-analyses, the code in Figure 6 samples the series in their respective numbers, instead of in equal numbers (which happens in the size = numSim.3series statement in Figure 4, part of creating the data frame). Plotting the total number of sampled Z-scores is dangerous for the single study z(1)-scores, however, since there are so many of them (it can crash your R studio). So before plotting the histogram, a smaller sample (of size = 3*numSim.3series in total) is drawn that keeps the ratios between z(1)s, z(2)s and z(3)s intact.

The histogram in Figure 7 illustrates an unconditional distribution by the raw counts of the z(t)-scores: many result from a single study, very few from a two-study series and almost none from a three-study series. In fact, this unconditional sampling distribution is hardly biased, as we will illustrate with our table further below.

We first introduce an example of an ALL-IN meta-analysis to argue that such an unconditional approach can in fact be very efficient.

ALL-IN meta-analysis

Figure 8 shows an example of an ALL-IN meta-analysis. Each of the red/orange/yellow lines represents a study out of the ten separate studies in as many different countries. The blue line indicates the meta-analysis synthesis of the evidence; a live account of the evidence so far in the underlying studies. In fact, ALL-IN meta-analysis stands for Anytime, Live and Leading INterim meta-analysis, in which the Anytime Live property assures valid inference under continuously monitoring and the Leading property allows the meta-analysis results to inform whether individual studies should be stopped or expanded. This is important to note that such data-driven decisions would invalidate conventional meta-analysis by introducing accumulation bias.

To interpret Figure 8, we observe that initially only the Dutch (NL) study contributes to the meta-analysis and the blue line completely overlaps with the light yellow one. Very quickly, the Australian (AU) study also starts contributing and the blue meta-analysis line captures a synthesis of the evidence in two studies. Later on, also the study in the US, France (FR) and Uruguay (UY) start contributing and the meta-analysis becomes a three-study, four-study and five-study meta-analysis. How many studies contribute to the analysis, however, does not matter for its evidential value.

Some studies (like the Australian one) are much larger than others, such that under a lucky scenario this study could reach the evidential threshold even before other studies start observing data.  This threshold (indicated at 400) controls type-I errors at a rate of α= 1/400 = 0.0025 (details in the final section). So in repeated sampling under the null, the combined studies will only have a probability to cross this threshold that is smaller than 0.25%. In this repeated sampling the size of the study series is essentially random: we can be lucky and observe very convincing data in the early studies, making more studies superfluous, or we can be unlucky and in need of more studies. The threshold can be reached with a single study, with a two-study meta-analysis, with a three-study,.. etc, and the repeated sampling properties, like type-I error control, hold on average over all those sampling scenarios (so unconditional on the series size).

ALL-IN meta-analysis allows for meta-analyses with Type-I error control, while completely avoiding the effects of accumulation bias and multiple testing. This is possible for two reasons: (1) we do not just perform meta- analyses on study series that have reached a certain size, but continuously monitor study series irrespective of the current number of studies in the series; (2) we use likelihood ratios (and their cousins, e-values (Grünwald et al., 2019) instead of raw Z-scores and p-values; we say more on likelihood ratios further below.

Accumulation bias from ALL-IN meta-analysis vs Gold Rush

The ALL-IN meta-analysis in Figure 8 illustrates an improved efficiency by not setting the number of studies in advance, but let it rely on the data and be – just like the data itself – essentially random before the start of the research effort. This introduces dependencies between study results and series size that can be expressed in similar ways as Gold Rush accumulation bias. Yet this field of studies might make decisions differently to our Gold Rush: a positive nonsignificant result might not terminate the research effort, but encourage extra studies. And instead of always encouraging extra studies, a very convincing series of significant studies might conclude the research effort. If a series of studies is dependent on any such data-driven decisions, the use of conventional statistical methods is inappropriate. These dependencies actually do not have to be extreme at all: Many fields of research might be a bit like the Gold Rush scenario in their response to finding significant negative results of harm. A widely known study result that indicated significant harm might make it very unlikely that the series will continue to grow. So large study series will very rarely have a completely symmetric sampling distribution, since initial studies that observe results of significant harm do not grow into large series. Hence this small aspect of accumulation bias will already invalidate conventional meta-analysis, when it assumes such symmetric distributions under the null hypothesis with equal mass on significant effects of harm and benefit.

Properties averaged over time

Accumulation bias can already result from simply excluding results of significant harm from replication. This exclusion also takes place under extreme Gold Rush accumulation bias, since results of significant harm as well as all nonsignificant results are not replicated. Fortunately, any such scenarios can be handled by taking an unconditional approach to meta-analysis. We will now give an intuition for why this is true in case of our extreme Gold Rush scenario: initial studies have bias that balances the bias in larger study series when averaged over series size and analyzed in a certain way.

Table 1 is inspired by Senn (2014) (different question, similar answer) and represents our extreme Gold Rush world of study series.  It takes the same approach as Figure 7 and indicates the probability to meta-analyze   a one-study, two-study or three-study series of each possible form under the null hypothesis. The three study series are very biased, with two or even three out of three studies showing a positive significant effect. But the P0 column shows that the probability of being in this scenario is very small under the null hypothesis, as was also apparent from Figure 7. In fact, most analysis will be of the one-study kind, that hardly have any bias, and are even slightly to the left of the theoretic standard null distribution. Exactly this phenomenon balances the biased samples of series of larger size.

A Z-score is marked by a * and color orange (e.g.  z1*)  in  case  the individual  study  result  is  significant  and  positive  (z1 ≥ zα  (one-sided test)) and  by  a (e.g.   z1)  otherwise.   The  column  t  indicates  the number of  studies  and  the  column counts the number of significant studies. The fifth and sixth column multiply P0 with the column and t column to arrive at an expected value E0[*] and E0[t] respectively in the bottom row.

The bottom row of Table 1 gives the expected values for the number of significant studies per series in the *P0 column, and the expected value for the total number of studies per series in the t P0 column. If we use these expressions to obtain the proportion of expected number of significant to expected total number of studies, we get the following:

The proportion of expected significant effects to expected series size is still α in Table 1 under extreme Gold Rush accumulation bias, as it would also be without accumulation bias.

This result is driven by the fact that there is a martingale process underlying this table. If a statistic is a martingale process and it has a certain value after t studies, the conditional expected value of the statistic after t + 1 studies, given all the past data, is equal to the statistic after t studies. So if our proportion of significant positive studies is exactly α for the first study (t = 1),  we  expect to also observe a proportion α if we  grow  our series with an additional study (t = 1+1 = 2). The Accumulation bias does not affect such statistics when averaged over time if martingales are involved (Doob’s optional stopping theorem for martingales). You can verify this aspect by deleting the last row for z1*, z2*, z3*  from our table and adding two rows for t = 4 in its place with z1*, z2*, z3  and either a fourth significant or a nonsignificant study.  If you calculate the expected significant effects to expected series size, you will again arrive at α.

Martingale properties drive many approaches to sequential analysis, including the Sequential Probability Ratio Test (SPRT), group-sequential analysis and alpha spending. When applied to meta-analysis, any such inferences essentially average over series size, just like ALL-IN meta-analysis.

Multiple testing over time

Just having the expectation of some statistics not affected by stopping rules is not enough to monitor data continuously, as in ALL-IN meta-analysis. We need to account for the multiple testing as well. In that respect, the approaches to sequential analysis differ by either restricting inference to a strict stopping rule (SPRT), or setting a maximum sample size (group-sequential analysis and alpha spending).

ALL-IN meta-analysis takes an approach that is different from its predecessors and is part of an upcoming field of sequential analysis for continuous monitoring with an unlimited horizon. These approaches are called Safe for optional stopping and/or continuation (Grünwald et al., 2019) any-time valid (Ramdas et al., 2020). Their methods rely on nonnegative martingales (Ramdas et al., 2020); with its most well-known and useful martingale: the likelihood ratio. For a meta-analysis Z-score, a martingale process of likelihood ratios could look as follows:

The subscript 10 indicates that the denominator of the likelihood ratio is the likelihood of the Z-scores under the null hypothesis of mean zero, and in the numerator is some alternative mean normal likelihood. The likelihood ratio becomes smaller when the data are more likely under the null hypothesis, but the likelihood ratio can never become smaller than 0 (hence the “nonnegative” martingale). This is crucial, because a nonnegative martingale allows us to use Ville’s inequality (Ville, 1939), also called the universal bound by Royall (1997). For likelihood ratios, this means that we can set a threshold that guarantees type-I error control under any accumulation bias process and at any time, as follows:

The ALL-IN meta-analysis in Figure 8 in fact is based on likelihood ratios like this, and controls the type-I error by the threshold 400 at level 1/400 = 0.25%.

The code below illustrates that likelihood ratios can also control type-I error rates under continuous monitoring when extreme Gold Rush accumulation bias is at play. Within our previous simulation, we again assume a Gold Rush world with only true null studies and very biased two-study and three-study series. The code in Figure 11 calculates likelihood ratios for the growing study series under accumulation bias. Figure 11 illustrates that still very few likelihood ratios ever grow very large.

If we set our type-I error rate α to 5%, and compare our likelihood ratios to 1= 20 we observe that less than  1/20 = 5% of  the  study  series  ever achieves  a  value  of  LR10   larger  than  20 (Figure 12).  The  simulated type-I error is even much smaller than 5% since in our Gold Rush world series stop growing at three studies, yet this procedure controls type-I error also in the case none of these series stops growing at three studies, but all continue to grow forever.

The type-I error control is thus conservative, and we pay a small price in terms of power. That price is quite manageable, however, and can be tuned by setting the mean value of the alternative likelihood (arbitrarily set to mean = 1 in the code for calcLR of Figure 10). More on that in Grünwald et al. (2019) and the forthcoming preprint paper on ALL-IN meta-analysis that will appear on https://projects.cwi.nl/safestats/.

It is this small conservatism in controlling type-I error that allows for full flexibility: There isn’t a single accumulation bias process that could invalidate the inference. Any data-driven decision is allowed. And data- driven decisions can increase the value of new studies and reduce research waste.

Conclusion

In our imaginary world of extreme Gold Rush accumulation bias, the sampling distribution of the meta-analysis Z-score behaves very different from the sampling distribution assumed to calculate p-values and confidence intervals. A meta-analysis p-value conditions on the available sample size – on the sample size of the studies and on the number of studies available – and represents the tail area of this conditional sampling distribution under the null based on the observed Z-statistic. Analogously, a meta-analysis confidence interval provides coverage under repeated sampling from this conditional distribution. So if this sample size is driven by the data, as in any accumulation bias process, there is a mismatch between the assumed sampling distribution of the meta-analysis Z-statistic, and the actual sampling distribution.

We believe that some accumulation bias is at play in almost any retrospective meta-analysis, such that p-values and confidence intervals generally do not have their promised type-I error control and coverage. ALL-IN meta- analysis based on likelihood ratios can handle accumulation bias, even if the exact process is unknown. It also allows for continuous monitoring; multiple testing is no problem. Hence taking the ALL-IN perspective on meta-analysis will reduce research waste by allowing efficient data-driven decisions – not letting them invalidate the inference – and incorporating single studies and small study series into meta-analysis inference.

Postscript

ALL-IN meta-analysis has been applied during the corona pandemic to analyze an accumulating series of studies while they were still ongoing. Each study investigated the ability of the BCG vaccine to prevent covid-19, but data on covid cases came in only slowly (fortunately). Meta-analyzing interim results and data-driven decisions improved the possibility of finding efficacy earlier in the pandemic. A webinar on the methodology underlying this meta-analysis – the specific likelihood ratios – is available on https://projects.cwi.nl/safestats/ under the name ALL-IN-META-BCG-CORONA.

Judith ter Schure is a PhD student in the Department of Machine Learning at Centrum Wiskunde & Informatica in the Netherlands. She can be contacted at Judith.ter.Schure@cwi.nl.

Acknowledgements

My thanks go to Professor Bob Reed for inviting this contribution to his website and his patience with its publication. I also want to acknowledge Professor Peter Grünwald for checking the details. Daniel Lakens provided me with great advice to write this text more blog-like. Muriel Pérez helped me with the details of the martingale underlying the table.

References

Iain Chalmers and Paul Glasziou. Avoidable waste in the production and reporting of research evidence. The Lancet, 114(6):1341–1345, 2009.

Iain Chalmers, Michael B Bracken, Ben Djulbegovic, Silvio Garattini, Jonathan Grant, A Metin Gülmezoglu, David W Howells, John PA Ioannidis, and Sandy Oliver. How to increase value and reduce waste when research priorities are set. The Lancet, 383(9912):156–165, 2014.

Hans Lund, Klara Brunnhuber, Carsten Juhl, Karen Robinson, Marlies Leenaars, Bertil F Dorch, Gro Jamtvedt, Monica W Nortvedt, Robin Christensen, and Iain Chalmers. Towards evidence based research. Bmj, 355: i5440, 2016.

Judith ter Schure and Peter Grünwald. Accumulation Bias in meta-analysis: the need to consider time in error control [version 1; peer review: 2 approved]. F1000Research, 8:962, June 2019. ISSN 2046-1402. doi: 10.12688/f1000research.19375.1. URL https://f1000research.com/articles/8-962/v1.

Steven P Ellis and Jonathan W Stewart. Temporal dependence and bias in meta-analysis. Communications in Statistics—Theory and Methods, 38(15):2453–2462, 2009.

Elena Kulinskaya, Richard Huggins, and Samson Henry Dogo. Sequential biases in accumulating evidence. Research synthesis methods, 7(3):294–305, 2016.

Peter Grünwald, Rianne de Heide, and Wouter Koolen. Safe testing. arXiv preprint arXiv:1906.07801, 2019.

Stephen Senn. A note regarding meta-analysis of sequential trials with stopping for efficacy. Pharmaceutical Statistics, 13(6):371–375, 2014.

Aaditya Ramdas, Johannes Ruf, Martin Larsson, and Wouter Koolen. Admissible anytime-valid sequential inference must rely on nonnegative martingales. arXiv preprint arXiv:2009.03167, 2020.

Jean Ville. Etude critique de la notion de collectif. Bull. Amer. Math. Soc, 45(11):824, 1939.

Richard Royall. Statistical evidence: a likelihood paradigm, volume 71. CRC press, 1997.

Judith ter Schure, Alexander Ly, Muriel F. Pérez-Ortiz, and Peter Grünwald. Safestats and all-in meta-analysis project page. https://projects.cwi.nl/safestats/, 2020.

This blog post discusses approaches to meta-analysis that control type-I error averaged over study series size. This is called error control surviving over time in Ter Schure and Grünwald (2019), as will become more clear in the technical details.

You can find a link to a these four pages of technical details here. A link to the file of R code used in this blog can be found here.