The Replication Network

Economists are Biased. Who Would Have Guessed?!

Posted on 25th August 2019 by replicationnetwork

1 Comment

[Excerpts taken from the blog “Ideology is Dead! Long Live Ideology!” by Mohsen Javdani and Ha-Joon Chang, posted at the website of the Institute for New Economic Thinking]

“Mainstream (neoclassical) economics has always put a strong emphasis on the positivist conception of the discipline, characterizing economists and their views as objective, unbiased, and non-ideological…the matter has never been directly subjected to empirical scrutiny. In a recent study, we do just that.”

“Using a well-known experimental “deception” technique embedded in an online survey that involves just over 2400 economists from 19 countries, we fictitiously attribute the source of 15 quotations to famous economists of different leanings. In other words, all participants received identical statements to agree or disagree with, but source attribution was randomly changed without the participants’ knowledge.”

“For example, when a statement criticizing “symbolic pseudo-mathematical methods of formalizing a system of economic analysis” is attributed to its real source, John Maynard Keynes, instead of its fictitious source, Kenneth Arrow, the agreement level among economists drops by 11.6%.”

“Similarly, when a statement criticizing intellectual monopoly (i.e. patent, copyright) is attributed to Richard Wolff, the American Marxian economist at the University of Massachusetts, Amherst, instead of its real source, David Levine, professor of economics at the Washington University in St. Louis, the agreement level drops by 6.6%.”

“We believe that recognizing their own biases, especially when there exists evidence suggesting that they could operate through implicit or unconscious modes, is the first step for economists who strive to be objective and ideology-free. This is also consistent with the standard to which most economists in our study hold themselves.”

“To echo the words of Alice Rivlin in her 1987 American Economic Association presidential address, ‘economists need to be more careful to sort out, for ourselves and others, what we really know from our ideological biases.'”

To read the article, click here.

Category: NEWS & EVENTS Tags: Experiment, Ideological bias, Institute for New Economic Thinking, Neoclassical economics

Bad News and Good News for Meta-Analyses in Economics

Posted on 25th August 2019 by replicationnetwork

Leave a Comment

[Excerpts taken from the working paper “Practical Significance, Meta-Analysis and the Credibility of Economics” by Tom Stanley and Chris Doucouliagos, posted at SSRN]

“…we find that large biases and high rates of false positives will often be found by conventional meta-analysis methods. Nonetheless, the routine application of meta-regression analysis and considerations of practical significance largely restore research credibility.”

“In this study, we employ Monte Carlo simulations to investigate whether typical levels of statistical power, selective reporting, and heterogeneity found in economics research will cause meta-analysis to have notable biases and high rates of false positives; that is, claiming the presence of economic effects or phenomena that may not exist.”

“Our simulations evaluate the performance of four methods: random-effects (RE), unrestricted weighted least squares (WLS), the weighted average of the adequately powered (WAAP), and the PET-PEESE.”

“Recently, Ioannidis et al. (2017) conducted a large survey of bias and statistical power among more than 64,000 reported economic effects from nearly 6,700 research papers. The average number of estimated effects reported per meta-analysis is just over 400…the typical relative heterogeneity (I2) is 93%, and the median exaggeration of reported effects is 100%…we focus on a 50% incidence of selective reporting…”

“To calibrate our simulations, we focus on the 35 meta-analyses of elasticities from Ioannidis et al. (2017) and force the distribution of SEs in the simulations to reproduce closely the distribution of SE found in these 35 reference meta-analyses.”

“…when there is the typical amount of heterogeneity…but no overall effect, the average study reports an elasticity just over 0.18…As the true elasticity gets larger, this bias decreases…but notable bias remains even when the true elasticity is 0.3. These biases are especially large at the highest levels of heterogeneity (I2 = 98%).”

“…All meta-analysis methods fail to distinguish a genuine effect from the artefact of publication bias reliably under common conditions found in economics research. The rate of false positives revealed in our simulations is a serious problem that threatens the scientific credibility and practical utility of simple meta-analysis.”

“To investigate likely departures from the random-effects constant-variance, additive heterogeneity model, we conduct alternative simulation experiments where random heterogeneity is roughly proportional to the random sampling error variance…”

“When heterogeneity is roughly proportional to SE, the simple mean and RE have even larger biases, but the biases of WLS, WAAP and PET-PEESE are much smaller and practically insignificant…”

“…When meta-analysts test for practical significance and heterogeneity is proportional to sampling errors, then false positives are no longer an issue for WLS, WAAP and PET. Unfortunately, random-effects can still have unacceptable rates of false positives even when testing for practical significance.”

“…if meta-analysts use cluster-robust standard errors when they test for practical significance (even with additive, constant-variance heterogeneity), PET has acceptable type I error rates…Note further that WLS, WAAP and PET maintain high levels of power to detect even small elasticities for areas of research which have the typical number of estimates…”

“We take the issue of false positives seriously and, therefore, recommend that systematic reviews and meta-analyses test against practical significance. Doing so largely reduces PET’s type I error rate to acceptable levels for common research conditions in economics.”

To read the paper, click here.

Category: NEWS & EVENTS Tags: Cluster robust standard errors, economics, Heterogeneity, Meta-analyses, PET-PEESE, Practical significance, publication bias, Random Effects, Selective reporting, WAAP, WLS

FIDLER & MODY: The repliCATS Bus – Where It’s Been, Where It’s Going

Posted on 19th August 2019 by replicationnetwork

Leave a Comment

For several years now scientists—in at least some disciplines—have been concerned about low rates of replicability. As scientists in those fields, we worry about the development of cumulative knowledge, and about wasted research effort. An additional challenge is to consider end-users (decision and policy makers) and other consumers of our work: what level of trust should they place in the published literature[1]? How might they judge the reliability of the evidence base?

With the latter questions in mind, our research group recently launched ‘The repliCATS project’ (Collaborative Assessments for Trustworthy Science). In the first phase, we’re focussing on eliciting predictions about the likely outcomes of direct replications of 3,000 (empirical, quantitative) research claims[2] in the social and behavioural sciences[3]. A subset of these 3,000 research claims will be replicated by an independent team of researchers, to serve as an evaluation of elicited forecasts.

The repliCATS project forms part of a broader program of replication research which will cover 8 disciplines: business, criminology, economics, education, political science, psychology, public administration, and sociology. The broader program—Systematizing Confidence in Open Research and Evidence, or SCORE—is funded by the US department of defense (DARPA).

It is an example of end user investment in understanding and improving the reliability of our scientific evidence base. It is unique in its cross-disciplinary nature, in scale, as well as its ultimate ambition: to explore the ability to apply machine learning to rapidly assess the reliability of a published study. And as far as we know, it is the first investment (at least of its size) in replication studies and related research to come from end users of research.

The repliCATS project uses a structured group discussion—rather than a prediction market or a survey—called the IDEA protocol to elicit predictions about replicability.

Working in groups of 5-10, diverse groups of participants first Investigate a research claim, answering three questions: (i) about how comprehensible the claim is; (ii) whether the underlying effect described in the research claims seems real or robust; and (iii) then making a private estimate of likelihood of a successful direct replication.

They then join their group, either in a face-to-face meeting or in remote, online groups, to Discuss. Discussions start with the sharing of private estimates as well as the information and reasoning that went into forming those estimates. The purpose of the discussion phase is for the group to share and cross-examine each other’s’ judgements; it is not to form consensus.

After discussion has run its course, researchers are then invited to update their original Estimates, if they wish, providing what we refer to as a ‘round 2 forecast’. These round 2 forecasts are made privately, and not shared with other group members.

Finally, we will mathematically Aggregate these forecasts. For this project, we are trialling a variety of aggregation methods, ranging from unweighted linear averages to aggregating log odds transformed estimates (see figure below).

Some previous replication projects have run prediction markets and/or surveys alongside. Over time, these have become more accurate, particularly in the case of the social science replication project (of Science and Nature papers). Our project departs from these previous efforts, not only by using a very different method of elicitation, but also in the qualitative information we gather about reasoning, information sharing, and the process of updating beliefs (following discussion).

575 claims assessed in our first local IDEA workshop

Earlier this month, we ran our first set of large face-to-face IDEA groups, prior to the Society for Improving Psychological Science (SIPS) conference. 156 researchers joined one of 30 groups, each with a dedicated group facilitator. Over two days, those groups evaluated 575 published research claims (20-25 per group) in business, economics and psychology, making a huge contribution to our understanding of:

– those published claims themselves,

– how participants reason about replication, what information cues and heuristics they use to make such predictions, including what counter points make them change their minds, and

– the research community’s overall beliefs about the state of our published evidence base.

We’ve also started to learn about how researchers evaluate claims within their direct field of expertise versus slightly outside that scope. We don’t yet know, or necessarily expect, that there will be differences in accuracy, but there do seem to be differences in approach and subjective levels of confidence.

What happens to those predictions? How accurate were they?

The short answer is that we wait. As discussed above, the repliCATS projects is part of a larger program. What happens next is that a subset of those 3,000 claims will be fully replicated by an independent research team, serving as evaluation criteria for the accuracy of our elicited predictions.

In about a year’s time, we’ll know how accurate those predictions are. (We’re hoping for at least 80% accuracy.) Our 3,000 predictions will also be used to benchmark machine learning algorithms, developed by other (again, independently funded by DARPA) research teams.

Following our first workshop, our repliCATS team now has a few thousand probabilistic predictions, and associated qualitative reasoning data to get stuck into. It’s an overwhelming amount of information, and barely one fifth of what we’ll have this time next year!

Feedback on SIPS workshop

As you’ve probably gathered, the success of our project relies heavily on attracting large numbers of participants interested in assessing research claims. So it was hugely heartening for us that the researchers who joined our SIPS workshop gave us very positive overall feedback about the experience.

They particularly enjoyed the core task of thinking about what factors or features of a study contribute to likely replicability (or not).

Early career researchers in particular also appreciated the chance to see and discuss others’ ratings and reasoning, and told us that the workshop has helped build their confidence about writing peer reviews in the future. (In fact, several of us came to the opinion that something like our IDEA protocol would make a good substitute for current peer review process in some places!)

To hear what others thought, check out Twitter @repliCATS (click on the image below)

What’s next

Our next large scale face-to-face workshop will take place at the Association for Indisciplinary Metaresearch and Open Science (AIMOS) conference in Melbourne, Australia, this November.

In the meantime, we’re ready to deploy our “repliCATS bus” (we’ll come to you, or help you, run smaller scale workshops at your institution), offering you the opportunity to join ‘remote IDEA groups’ online.

Get in touch

Here is how:

repliCATS-project@unimelb.edu.au | https://replicats.research.unimelb.edu.au | @repliCATS

Fiona Fidler is an Associate Professor at the University of Melbourne with joint appointments in the School of BioSciences and the School of Historical and Philosophical Studies (SHAPS). Fallon Mody is a postdoctoral research fellow in the Department of History and Philosophy of Science at the University of Melbourne.

[1] Note that here we are specifically concerned with trust in the ‘the published literature’ and not trust in science more broadly, or in scientists themselves. The published literature is as much created by the publishing industry as it is by scientists and other scholars.

[2] In this project, a “research claim” has a very specific meaning: it is used to describe a single major finding from a published study – for example, a journal article – as well as details of the methods and results that support this finding.

[3] In subsequent phases, we’ll be thinking about conceptual replications, generalisability, and other factors that build confidence for end users of research.

Category: GUEST BLOGS Tags: AIMOS conference, DARPA, IDEA Protocol, replication, RepliCATS, Scientific reliability, SCORE, SIPS workshop

Assessing the Peer Reviewers’ Openness (PRO) Initiative from the Perspective of PRO Signatories

Posted on 19th August 2019 by replicationnetwork

Leave a Comment

[Excerpts taken from the preprint “’Because it is the Right Thing to Do’: Taking Stock of the Peer Reviewers’ Openness Initiative” by Maike Dahrendorf et al., posted at PsyArXiv Preprints]

“Although the practice of publicly sharing data and code appears relatively straightforward, it is still not the norm…In order to change the status quo and accelerate the adoption of data sharing practices, Morey et al. (2016) introduced the Peer Reviewers’ Openness Initiative (PRO); researchers who sign PRO agree to provide a full review only for manuscripts that publicly share data and code, or else provide a clear reason why sharing is not possible.”

“Almost two years after PRO was launched we sought to take stock of the initiative by surveying signatories on their subjective experiences and opinions.”

“At the time of data collection, 449 researchers had signed the PRO initiative. We successfully retrieved the email addresses of 340 signatories…Compared to surveys on similar topics, the response rate of 37.65%…was relatively high…the final sample size was N = 127.”

“…about 40% of the respondents…indicated that data had been made available, about 30% reported to have received praise from colleagues…and about 25% indicated to have been able to provide a higher-quality review…In contrast, only a small fraction of respondents reported negative experiences…Commensurate with these experiences, 117 respondents indicated that they would sign the initiative again, whereas only 8 indicated that they would not…”

“…it is important to note that this survey only concerns the experiences of a relatively small and highly selective sample. Therefore, one cannot draw general conclusions about the effectiveness and reception of the PRO initiative. Such conclusions necessitate the involvement of researchers on the receiving end of PRO, namely editors and authors.”

To read more, click here.

For previous posts about PRO at TRN, check out here, here, here, and here.

Category: NEWS & EVENTS Tags: Data sharing, Journal policies, Open Science, peer review, Peer Reviewers Openness initiative, PRO

Lesson Learned from Registered Reports From Somebody Who Was There at the Beginning

Posted on 17th August 2019 by replicationnetwork

Leave a Comment

[Excerpts taken from “The registered reports revolution: Lessons in cultural reform” by Chris Chambers, published in Significance, a publication of the Royal Statistical Society”]

“On 12 November 2012, as my train sped towards London, I received one of the most important emails of my life. The message cordially informed me that the publisher of Cortex, a scientific journal that I had recently joined as an editor, had approved our request to launch a new type of article called a registered report.”

“Having just one journal offer registered reports was never going to be enough, and so, in the months that followed, we lobbied hard for others to follow suit. In June 2013, we brought together over 80 journal editors and wrote a joint article in The Guardian. ‘[A]s a group of scientists with positions on more than 100 journal editorial boards,’ we wrote, ‘we are calling for all empirical journals in the life sciences – including those journals that we serve – to offer pre-registered articles at the earliest opportunity.'”

“Regardless of the storm it created (or perhaps because of it), the letter appeared to work as intended. Beyond the heat of the debate – and the numerous misconceptions voiced by opponents – other scientists were quietly deciding that registered reports made sense. The idea had turned a corner and the number of adopting journals steadily grew. By the end of 2013 we had three adopting journals. A year later we had seven. Through 2015, 2016, 2017 and 2018, the number of adopters rose to 21, 41, 88 and 154.”

“Today, registered reports are offered by 202 journals and rising, and across a wide range of sciences. Nearly 200 completed articles have been published, with hundreds more in the pipeline.”

“…we are seeing the first evidence of impacts, and the signs are promising: registered reports are more likely to reveal evidence that is inconsistent with the authors’ pre-specified hypotheses (a possible indicator of reduced confirmation bias); they also have more reproducible code and data than regular articles; and they are cited, on average, at or above the impact factors of the journals in which they are published.”

“Reforming science is a hideously difficult task…If someone then comes along and says, “Whoa, there! You should archive your data in a public repository. You should preregister your protocols to control bias. You should replicate that study before submitting it to Nature”, the response of the player isn’t even “No”. The response is silence.”

“The working scientist sweeps past in a blur of desperation, racing towards the next publication, or tenure, or the next grant or fellowship with a tiny success rate. Every so often, the weary explorer looks up and catches a fragment of the reform argument. ‘You’re telling me I should do X, Y, Z. But why would I? Unless someone is going to make everyone else do it too, I’m just going to become less competitive. I’m a scientist, not a sacrificial lamb.’”

“‘Should’ arguments, like those above, fail because they offer only judgement, not solutions. If should arguments were sufficient to change behaviour then behaviour would have changed decades ago.”

“Breaking this impasse requires aligning incentives so that any reform works for the community and the individual. Registered reports achieve this by turning the pursuit of high quality science into a virtuous transaction: submit your protocol to our journal, receive a positive assessment (most likely after performing some revision based on expert reviews), and we will guarantee to publish your paper regardless of whether or not your results support your hypothesis. Work hard at designing rigorous, careful, important science and we will de-stress your life by making the results a dead currency in quality evaluation.”

“The key lesson of registered reports here is clear: do not tell scientists what they should do. Instead, tell them why what you are proposing is better than the status quo and why the new practice is both in their career interests as well as in the interests of the scientific community. Give them no reason to say, ‘This will harm my career’ or ‘My peers will disapprove’. Give them every reason to say, ‘This works for me and helps me leave a lasting positive legacy on my field’.”

To read more, click here.

Category: NEWS & EVENTS Tags: Chris Chambers, Cultural reform, Incentives, Journal policies, Registered Reports

Registering Your RCT with the AEA? There’s a DOI for That.

Posted on 16th August 2019 by replicationnetwork

Leave a Comment

[Excerpts taken from the blog “Improving research transparency through easier, faster access to studies in the AEA RCT Registry” by Keesler Welch and James Turitto, posted at the BITSS blogsite].

“J-PAL and the American Economic Association (AEA) have been working together over the past year to create digital object identifiers, or DOIs, for each trial registered in the AEA Registry for Randomized Controlled Trials….as of August 13, 2019, the Registry has officially launched this new feature!”

“DOIs provide persistent links to web content to ensure that the content is discoverable at all times—even if its URL or location within a site changes.”

“Using DOIs for trial registration in the social sciences is still something new, and we see it as an important move forward for research transparency.”

“This will make it easier for researchers to link their study designs and their (optional) pre-analysis plans directly to their publications and published research data. This also ensures that research study designs that have been registered will remain findable in perpetuity.”

To read more, click here.

Category: NEWS & EVENTS Tags: AEA RCT Registry, Amereican Economic Association, DOI, J-PAL

Pre-Registration: It’s a Journey

Posted on 16th August 2019 by replicationnetwork

Leave a Comment

[Excerpts taken from the preprint “Preregistration Is Hard, And Worthwhile” by Brian Nosek and others, posted at PsyArXiv Preprints]

“Preregistration of studies serves at least three aims for improving the credibility and reproducibility of research findings.”

“First, preregistration of analysis plans makes clear which analyses were planned a priori and which were conducted post hoc.”

“Second, preregistration enables detection of questionable research practices such as selective outcome reporting…or Hypothesizing After the Results are Known…”

“Third, preregistration of studies can reduce the impact of publication bias—particularly the prioritization of publishing positive over negative results—by making all planned studies discoverable whether or not they are ultimately published.”

“However, preregistration is a skill that requires experience to hone…Preregistration requires research planning and it is hard, especially contingency planning. It takes practice to make design and analysis decisions in the abstract, and it takes experience to learn what contingencies are most important to anticipate.”

“This might lead researchers to shy away from preregistration for worries about imperfection. Embrace incrementalism…Having some plans is better than having no plans, and sharing those plans in advance is better than not sharing them. With experience, planning will improve and the benefits will increase for oneself and for consumers of the research.”

“There are opportunities to accelerate that skill building. Study registries such as OSF and SREE, and decision tools such as Declare Design provide structured workflows to help researchers anticipate common decisions and provide guidance for articulating those decisions.”

“Other strategies for developing these skills include: (1) refining analysis plans by simulating data to practice making the decisions; (2) splitting the “real” data into exploratory and confirmatory subsamples and preregistering the analysis after viewing the exploratory subset; (3) drafting the study methods and results section in advance to help anticipate what should be done and how you will report it; and (4) submitting the plan as a Registered Report for peer review to get expert feedback on the plan.”

“Researchers can embrace the common back-and-forth between planned and unplanned (confirmatory and exploratory) research activities…The key role of preregistration is to clarify which is which.”

“Preregistration is a plan, not a prison…When deviations from the plan will improve the quality of the research, deviate from the plan.”

“Reporting deviations from the plan can be challenging…If possible, report what occurs following the original plan alongside what occurs with the deviations, and share the materials, data, and code so that others can evaluate the reported outcomes and what would have occurred with alternative approaches.”

To read more, click here.

Category: NEWS & EVENTS Tags: Confirmatory analysis, Exploratory data analysis, HARKing, Pre-registration, publication bias, Questionable Research Practices, Reproducibility, Transparency

Anybody Else Ever Make a Mistake? Asking for a Friend.

Posted on 12th August 2019 by replicationnetwork

Leave a Comment

[Excerpts taken from the blog “Corrigendum: a word you may hope never to encounter” by Dorothy Bishop, published at BishopBlog]

“I have this week submitted a ‘corrigendum’ to a journal for an article published in the American Journal of Medical Genetics B (Bishop et al, 2006). It’s just a fancy word for ‘correction’, and journals use it contrastively with ‘erratum’. Basically, if the journal messes up and prints something wrong, it’s an erratum. If the author is responsible for the mistake, it’s a corrigendum.”

“I discovered the error when someone asked for the data for a meta-analysis….although I could recreate most of what was published, I had the chilling realisation that there was a problem with Table II.”

“I had data on siblings of children with autism, and in some cases there were two or three siblings in the family. … I decided to take a mean value for each family. So if there was one child, I used their score, but if there were 2 or 3, then I averaged them.”

“And here, dear Reader, is where I made a fatal mistake. I thought the simplest way to do this would be by creating a new column in my Excel spreadsheet which had the mean for each family, computing this by manually entering a formula based on the row numbers for the siblings in that family.”

“…I noticed when I opened the file that I had pasted a comment in red on the top row that said ‘DO NOT SORT THIS FILE!’. … Despite my warning message to myself, somewhere along the line, it seems that a change was made to the numbering, and this meant that a few children had been assigned to the wrong family. And that’s why table II had gremlins in it and needed correcting.”

“I thought it worth blogging about this to show how much easier my life would have been if I had been using the practices of data management and analysis that I now am starting to adopt. I also felt it does no harm to write about making mistakes, which is usually a taboo subject. I’ve argued previously that we should be open about errors, to encourage others to report them, and to demonstrate how everyone makes mistakes, even when trying hard to be accurate…”

To read the full blog, click here.

Category: NEWS & EVENTS Tags: Corrigendum, data management, Dorothy Bishop, Erratum, Excel, Journal, Mistake

Could Bayes Have Saved Us From the Replication Crisis?

Posted on 10th August 2019 by replicationnetwork

Leave a Comment

[Excerpts are taken from the article “The Flawed Reasoning Behind the Replication Crisis” by Aubrey Clayton, published at nautil.us]

“Suppose an otherwise healthy woman in her forties notices a suspicious lump in her breast and goes in for a mammogram. The report comes back that the lump is malignant.”

“She wants to know the chance of the diagnosis being wrong. Her doctor answers that, as diagnostic tools go, these scans are very accurate. Such a scan would find nearly 100 percent of true cancers and would only misidentify a benign lump as cancer about 5 percent of the time. Therefore, the probability of this being a false positive is very low, about 1 in 20.”

“…the missing ingredient …is the prior probability for the various hypotheses.”

“For the breast cancer example, the doctor would need to consider the overall incidence rate of cancer among similar women with similar symptoms, not including the result of the mammogram. Maybe a physician would say from experience that about 99 percent of the time a similar patient finds a lump it turns out to be benign. So the low prior chance of a malignant tumor would balance the low chance of getting a false positive scan result. Here we would weigh the numbers: (0.05) * (0.99) vs. (1) * (0.01).”

“We’d find there was about an 83 percent chance the patient doesn’t have cancer.”

“The problem, though, is the dominant mode of statistical analysis these days isn’t Bayesian. Since the 1920s, the standard approach to judging scientific theories has been significance testing, made popular by the statistician Ronald Fisher.”

“To understand what’s wrong, consider the following completely true, Fisherian summary of the facts in the breast cancer example (no false negatives, 5 percent false positive rate):”

“Suppose we scan 1 million similar women, and we tell everyone who tests positive that they have cancer. Then, among those who actually have cancer, we will be correct every single time. And among those who don’t have it, we will be only be incorrect 5 percent of the time. So, overall our procedure will be incorrect less than 5 percent of the time.”

“Sounds persuasive, right? But here’s another summary of the facts, including the base rate of 1 percent:”

“Suppose we scan 1 million similar women, and we tell everyone who tests positive that they have cancer. Then we will have correctly told all 10,000 women with cancer that they have it. Of the remaining 990,000 women whose lumps were benign, we will incorrectly tell 49,500 women that they have cancer. Therefore, of the women we identify as having cancer, about 83 percent will have been incorrectly diagnosed.”

“Suppose the women who received positive test results and a presumptive diagnosis of cancer in our example were tested again by having biopsies. We would see the majority of the initial results fail to repeat, a “crisis of replication” in cancer diagnoses. That’s exactly what’s happening in science today.”

“We Bayesians have seen this coming for years. … Now, a consensus is finally beginning to emerge: Something is wrong with science that’s causing established results to fail. One proposed and long overdue remedy has been an overhaul of the use of statistics.”

“In 2015, the journal Basic and Applied Social Psychology took the drastic measure of banning the use of significance testing in all its submissions, and this March, an editorial in Nature co-signed by more than 800 authors argued for abolishing the use of statistical significance altogether.”

“Similar proposals have been tried in the past, but every time the resistance has been beaten back and significance testing has remained the standard. Maybe this time the fear of having a career’s worth of results exposed as irreproducible will provide scientists with the extra motivation they need.”

To read the article, click here.

Category: NEWS & EVENTS Tags: Bayesian approach, Breast cancer, replication crisis, Ronald Fisher, significance testing

Sad News for Happiness Studies

Posted on 21st July 2019 by replicationnetwork

Leave a Comment

[Excerpts taken from the article “The Sad Truth about Happiness Scales” by Timothy Bond and Kevin Lang, forthcoming in the Journal of Political Economy]

“A large literature has attempted to establish the determinants of happiness using ordered response data from questions such as ‘Taking all things together, how would you say things are these days—would you say that you are very happy, pretty happy, or not too happy?'”

“We…reach the striking conclusion that the results from the literature are essentially uninformative about how various factors affect average happiness.”

“The basic argument is as follows. There are a large (possibly infinite) number of states of happiness that are strictly ranked. In order to calculate a group’s “mean” happiness, these states must be cardinalized, but there are an infinite number of arbitrary cardinalizations, each producing a different set of means. The ranking of the means remains the same for all cardinalizations only if the distribution of happiness states for one group first-order stochastically dominates that for the other.”

“…we do not observe the actual distribution of states. We instead observe their distribution in a small number of discrete categories…Without additional assumptions we cannot rank the average happiness of two groups if each has responses in the highest and lowest category. Using observed covariates to achieve full nonparametric identification of the latent happiness distributions would require making assumptions that happiness researchers generally claim to reject.”

“We are therefore forced to follow the standard approach and assume the latent distributions are from a common unbounded location-scale family (e.g., an ordered probit). If we do, it is (almost) impossible to get stochastic dominance, and the conclusion is therefore not robust to simple monotonic transformations of the scale.”

“…we outlined the conditions under which the rank order of happiness for groups can be identified using categorical data on subjective well-being. We now put these into practice for nine key results from the happiness literature: the Easterlin (1973, 1974) paradox for the United States, whether happiness is U-shaped in age, the optimal policy trade-off between inflation and unemployment, rankings of countries by happiness, whether the Moving to Opportunity program increased happiness, whether marriage increases happiness, whether children decrease happiness, the relative decline of female happiness in the United States, and whether disabilities decrease happiness.”

“Table 1 summarizes the results.”

“None of these results are identified nonparametrically. Moreover, in the eight cases for which we can test for equality of variances under a parametric normal assumption, we reject equality. Thus we never have rank-order identification and can always reverse the standard conclusion by instead assuming a left-skewed or rightskewed lognormal.”

“…if researchers wanted to draw any conclusions from these data…they would have to argue that it is appropriate to inform policy based on one arbitrary cardinalization of happiness but not on another…”

“Researchers who wish to continue to interpret such questions more broadly need to…justify their particular cardinalization or parametric assumption…At a bare minimum, we would require a functional form assumption that survived the joint test of the parametric functional form and common reporting function across groups. Certainly calls to replace GDP with measures of national happiness are premature.”

To read the full article, click here. (NOTE: Article is behind a paywall.)

Category: NEWS & EVENTS Tags: Cardinalization, Common reporting function, First Order Stochastic Dominance, Happiness scales, Happiness studies, Monotonic transformation, Rank order identification, Well-being

« 1 … 16 17 18 19 20 … 70 »

Furthering the Practice of Replication in Economics

[Excerpts taken from the blog “Ideology is Dead! Long Live Ideology!” by Mohsen Javdani and Ha-Joon Chang, posted at the website of the Institute for New Economic Thinking]

“For example, when a statement criticizing “symbolic pseudo-mathematical methods of formalizing a system of economic analysis” is attributed to its real source, John Maynard Keynes, instead of its fictitious source, Kenneth Arrow, the agreement level among economists drops by 11.6%.”

“To echo the words of Alice Rivlin in her 1987 American Economic Association presidential address, ‘economists need to be more careful to sort out, for ourselves and others, what we really know from our ideological biases.'”

To read the article, click here.

[Excerpts taken from the working paper “Practical Significance, Meta-Analysis and the Credibility of Economics” by Tom Stanley and Chris Doucouliagos, posted at SSRN]

“…we find that large biases and high rates of false positives will often be found by conventional meta-analysis methods. Nonetheless, the routine application of meta-regression analysis and considerations of practical significance largely restore research credibility.”

“Our simulations evaluate the performance of four methods: random-effects (RE), unrestricted weighted least squares (WLS), the weighted average of the adequately powered (WAAP), and the PET-PEESE.”

“To calibrate our simulations, we focus on the 35 meta-analyses of elasticities from Ioannidis et al. (2017) and force the distribution of SEs in the simulations to reproduce closely the distribution of SE found in these 35 reference meta-analyses.”

“To investigate likely departures from the random-effects constant-variance, additive heterogeneity model, we conduct alternative simulation experiments where random heterogeneity is roughly proportional to the random sampling error variance…”

“When heterogeneity is roughly proportional to SE, the simple mean and RE have even larger biases, but the biases of WLS, WAAP and PET-PEESE are much smaller and practically insignificant…”

“We take the issue of false positives seriously and, therefore, recommend that systematic reviews and meta-analyses test against practical significance. Doing so largely reduces PET’s type I error rate to acceptable levels for common research conditions in economics.”

To read the paper, click here.

The repliCATS project uses a structured group discussion—rather than a prediction market or a survey—called the IDEA protocol to elicit predictions about replicability.

After discussion has run its course, researchers are then invited to update their original Estimates, if they wish, providing what we refer to as a ‘round 2 forecast’. These round 2 forecasts are made privately, and not shared with other group members.

Finally, we will mathematically Aggregate these forecasts. For this project, we are trialling a variety of aggregation methods, ranging from unweighted linear averages to aggregating log odds transformed estimates (see figure below).

575 claims assessed in our first local IDEA workshop

– those published claims themselves,

– how participants reason about replication, what information cues and heuristics they use to make such predictions, including what counter points make them change their minds, and

– the research community’s overall beliefs about the state of our published evidence base.

What happens to those predictions? How accurate were they?

The short answer is that we wait. As discussed above, the repliCATS projects is part of a larger program. What happens next is that a subset of those 3,000 claims will be fully replicated by an independent research team, serving as evaluation criteria for the accuracy of our elicited predictions.

In about a year’s time, we’ll know how accurate those predictions are. (We’re hoping for at least 80% accuracy.) Our 3,000 predictions will also be used to benchmark machine learning algorithms, developed by other (again, independently funded by DARPA) research teams.

Following our first workshop, our repliCATS team now has a few thousand probabilistic predictions, and associated qualitative reasoning data to get stuck into. It’s an overwhelming amount of information, and barely one fifth of what we’ll have this time next year!

Feedback on SIPS workshop

They particularly enjoyed the core task of thinking about what factors or features of a study contribute to likely replicability (or not).

To hear what others thought, check out Twitter @repliCATS (click on the image below)

What’s next

Our next large scale face-to-face workshop will take place at the Association for Indisciplinary Metaresearch and Open Science (AIMOS) conference in Melbourne, Australia, this November.

In the meantime, we’re ready to deploy our “repliCATS bus” (we’ll come to you, or help you, run smaller scale workshops at your institution), offering you the opportunity to join ‘remote IDEA groups’ online.

Get in touch

Here is how:

repliCATS-project@unimelb.edu.au | https://replicats.research.unimelb.edu.au | @repliCATS

[1] Note that here we are specifically concerned with trust in the ‘the published literature’ and not trust in science more broadly, or in scientists themselves. The published literature is as much created by the publishing industry as it is by scientists and other scholars.

[2] In this project, a “research claim” has a very specific meaning: it is used to describe a single major finding from a published study – for example, a journal article – as well as details of the methods and results that support this finding.

[3] In subsequent phases, we’ll be thinking about conceptual replications, generalisability, and other factors that build confidence for end users of research.

[Excerpts taken from the preprint “’Because it is the Right Thing to Do’: Taking Stock of the Peer Reviewers’ Openness Initiative” by Maike Dahrendorf et al., posted at PsyArXiv Preprints]

“Almost two years after PRO was launched we sought to take stock of the initiative by surveying signatories on their subjective experiences and opinions.”

“At the time of data collection, 449 researchers had signed the PRO initiative. We successfully retrieved the email addresses of 340 signatories…Compared to surveys on similar topics, the response rate of 37.65%…was relatively high…the final sample size was N = 127.”

To read more, click here.

For previous posts about PRO at TRN, check out here, here, here, and here.

[Excerpts taken from “The registered reports revolution: Lessons in cultural reform” by Chris Chambers, published in Significance, a publication of the Royal Statistical Society”]

“Today, registered reports are offered by 202 journals and rising, and across a wide range of sciences. Nearly 200 completed articles have been published, with hundreds more in the pipeline.”

“‘Should’ arguments, like those above, fail because they offer only judgement, not solutions. If should arguments were sufficient to change behaviour then behaviour would have changed decades ago.”

To read more, click here.

[Excerpts taken from the blog “Improving research transparency through easier, faster access to studies in the AEA RCT Registry” by Keesler Welch and James Turitto, posted at the BITSS blogsite].

“DOIs provide persistent links to web content to ensure that the content is discoverable at all times—even if its URL or location within a site changes.”

“Using DOIs for trial registration in the social sciences is still something new, and we see it as an important move forward for research transparency.”

“This will make it easier for researchers to link their study designs and their (optional) pre-analysis plans directly to their publications and published research data. This also ensures that research study designs that have been registered will remain findable in perpetuity.”

To read more, click here.

[Excerpts taken from the preprint “Preregistration Is Hard, And Worthwhile” by Brian Nosek and others, posted at PsyArXiv Preprints]

“Preregistration of studies serves at least three aims for improving the credibility and reproducibility of research findings.”

“First, preregistration of analysis plans makes clear which analyses were planned a priori and which were conducted post hoc.”

“Second, preregistration enables detection of questionable research practices such as selective outcome reporting…or Hypothesizing After the Results are Known…”

“Third, preregistration of studies can reduce the impact of publication bias—particularly the prioritization of publishing positive over negative results—by making all planned studies discoverable whether or not they are ultimately published.”

“There are opportunities to accelerate that skill building. Study registries such as OSF and SREE, and decision tools such as Declare Design provide structured workflows to help researchers anticipate common decisions and provide guidance for articulating those decisions.”

“Researchers can embrace the common back-and-forth between planned and unplanned (confirmatory and exploratory) research activities…The key role of preregistration is to clarify which is which.”

“Preregistration is a plan, not a prison…When deviations from the plan will improve the quality of the research, deviate from the plan.”

To read more, click here.

[Excerpts taken from the blog “Corrigendum: a word you may hope never to encounter” by Dorothy Bishop, published at BishopBlog]

“I discovered the error when someone asked for the data for a meta-analysis….although I could recreate most of what was published, I had the chilling realisation that there was a problem with Table II.”

“I had data on siblings of children with autism, and in some cases there were two or three siblings in the family. … I decided to take a mean value for each family. So if there was one child, I used their score, but if there were 2 or 3, then I averaged them.”

“And here, dear Reader, is where I made a fatal mistake. I thought the simplest way to do this would be by creating a new column in my Excel spreadsheet which had the mean for each family, computing this by manually entering a formula based on the row numbers for the siblings in that family.”

To read the full blog, click here.

[Excerpts are taken from the article “The Flawed Reasoning Behind the Replication Crisis” by Aubrey Clayton, published at nautil.us]

“Suppose an otherwise healthy woman in her forties notices a suspicious lump in her breast and goes in for a mammogram. The report comes back that the lump is malignant.”

“…the missing ingredient …is the prior probability for the various hypotheses.”

“We’d find there was about an 83 percent chance the patient doesn’t have cancer.”

“The problem, though, is the dominant mode of statistical analysis these days isn’t Bayesian. Since the 1920s, the standard approach to judging scientific theories has been significance testing, made popular by the statistician Ronald Fisher.”

“To understand what’s wrong, consider the following completely true, Fisherian summary of the facts in the breast cancer example (no false negatives, 5 percent false positive rate):”

“Sounds persuasive, right? But here’s another summary of the facts, including the base rate of 1 percent:”

“We Bayesians have seen this coming for years. … Now, a consensus is finally beginning to emerge: Something is wrong with science that’s causing established results to fail. One proposed and long overdue remedy has been an overhaul of the use of statistics.”

“In 2015, the journal Basic and Applied Social Psychology took the drastic measure of banning the use of significance testing in all its submissions, and this March, an editorial in Nature co-signed by more than 800 authors argued for abolishing the use of statistical significance altogether.”

To read the article, click here.

[Excerpts taken from the article “The Sad Truth about Happiness Scales” by Timothy Bond and Kevin Lang, forthcoming in the Journal of Political Economy]

“A large literature has attempted to establish the determinants of happiness using ordered response data from questions such as ‘Taking all things together, how would you say things are these days—would you say that you are very happy, pretty happy, or not too happy?'”

“We…reach the striking conclusion that the results from the literature are essentially uninformative about how various factors affect average happiness.”

“Table 1 summarizes the results.”

“…if researchers wanted to draw any conclusions from these data…they would have to argue that it is appropriate to inform policy based on one arbitrary cardinalization of happiness but not on another…”

To read the full article, click here. (NOTE: Article is behind a paywall.)