What is REECAP? And Why You Should Watch Their Webinar (19 October 2020, 10.30-12.30 CET)

The Research network on Economic Experiments for the Common Agricultural Policy (REECAP) is an EU-wide informal consortium with the aim to promote and foster economic experimental designs and behavioural analysis in the context of evaluating European agricultural policies.

The network previously organised a roundtable discussion entitled “Reproducibility in Experimental Economics: Crisis or Opportunity?” (see this blog for a nice summary) as part of their annual event in Osnabrueck, Germany. The discussion acknowledged that the principle of reproducibility is a central tenet of experimental research to inform policy relevant decisions in the environmental and agricultural spheres. However, recent studies (e.g. Camerer et al 2018) cast doubts about the replicability of social science experiments, going as far as to say that the social sciences may experience a ‘replication crisis’.

This is the backdrop to the webinar REECAP is offering on October 19, 2020 from 10.30-12.30am CET, details of the programme can be found here. The webinar aims to discuss replications in agricultural economics, and it is of interest to researchers who wish to learn about the ‘replication crisis’, where it comes from and what has been done to tackle it so far. This event will also provide a forum for those interested in exploring participation in REECAP’s replication project that is aiming to coordinate the replication of experiments relevant for shaping agricultural policies.

These replications may then be submitted to a Special Issue in Applied Economic Perspectives and Policy – a call is forthcoming. However, this webinar is not just targeting researchers but also practitioners who may have an interest in research methodologies and who are curious to learn more about how best to improve the robustness of research findings, especially when these findings are used to inform policy relevant decisions in the context of agriculture.   

FERRARO & SHUKLA: Is a Replicability Crisis on the Horizon for Environmental and Resource Economics?

Across scientific disciplines, researchers are increasingly questioning the credibility of empirical research. This research, they argue, is rife with unobserved decisions that aim to produce publishable results rather than accurate results. In fields where the results of empirical research are used to design policies and programs, these critiques are particularly concerning as they undermine the credibility of the science on which policies and programs are designed. In a paper published in the Review of Environmental Economics and Policy, we assess the prevalence of empirical research practices that could lead to a credibility crisis in the field of environmental and resource economics.

We looked at empirical environmental economics papers published between 2015 and 2018 in four top journals: The American Economic Review (AER), Environmental and Resource Economics (ERE), The Journal of the Association of Environmental and Resource Economics (JAERE), and The Journal of Environmental Economics and Management (JEEM). From 307 publications, we collected more than 21,000 test statistics to construct our dataset. We reported four key findings:

1. Underpowered Study Designs and Exaggerated Effect Sizes

As has been observed in other fields, the empirical designs used by environmental and resource economists are statistically underpowered, which implies that the magnitude and sign of the effects reported in their publications are unreliable. The conventional target for adequate statistical power in many fields of science is 80%. We estimated that, in environmental and resource economics, the median power of study designs is 33%, with power less than 80% for nearly two out of the three estimated parameters. When studies are underpowered and when scientific journals are more likely to publish results that pass conventional tests of statistical significance – tests that can only be passed in underpowered designs when the estimated effect is much larger than the true effect size – these journals will tend to be publish exaggerated effect sizes. We estimated that 56% of the reported effect sizes in the environmental and resource economics literature are exaggerated by a factor of two or more; 35% are exaggerated by a factor of four or more.

2. Selective Reporting of Statistical Significance or “p-hacking”

Researchers face strong professional incentives to report statistically significant results, which may lead them to selectively report results from their analyses. One indicator of selective reporting is an unusual pattern in the distribution of test statistics; specifically, a double-humped distribution around conventionally accepted values of statistical significance. In the figure below, we present the distribution of test statistics for the estimates in our sample, where 1.96 is the conventional value for statistical significance (p<0.05). The unusual dip just before 1.96, is consistent with selective reporting of results that are above the conventionally accepted level of statistical significance.

3. Multiple Comparisons and False Discoveries

Repeatedly testing the same data set in multiple ways increases the probability of making false (spurious) discoveries, a statistical issue that is often called the “multiple comparisons problem.” To mitigate the probability of false discoveries when testing more than one related hypothesis, researchers can adopt a range of approaches. For example, they can ensure the false discovery rate is no larger than a pre-specified level. These approaches, however, are rare in the environmental and resource economics literature: 63% of the studies in our sample conducted multiple hypothesis tests, but less than 2% of them used an accepted approach to mitigate the multiple comparisons problem.

4. Questionable Research Practices (QRPs)

To better understand empirical research practices in the field of environmental and resource economics, we also conducted a survey of members of the Association of Environmental and Resource Economists (AERE) and the European Association of Environmental and Resource Economists (EAERE). In the survey, we asked respondents to self-report whether they had engaged in research practices that other scholars have labeled “questionable”. These QRPs include selectively reporting only a subset of dependent variables or analyses conducted, hypothesizing after results are known (also called HaRKing), choosing regressors or re-categorizing data after looking at the results, etc. Although one might assume that respondents would be unlikely to self-report engaging in such practices, 92% admitted to engaging in at least one QRP.

Recommendations for Averting a Replication Crisis

To help improve the credibility of the environmental and resource economics literature, we recommended changes to the current incentive structures for researchers.

– Editors, funders, and peer reviewers should emphasize the designs and research questions more than results, abolish conventional statistical significance cut-offs, and encourage the reporting of statistical power for different effect sizes.

– Authors should distinguish between exploratory and confirmatory analyses, and reviewers should avoid punishing authors for exploratory analyses that yield hypotheses that cannot be tested with the available data.

– Authors should be required to be transparent by uploading to publicly-accessible, online repositories the datasets and code files that reproduce the manuscript’s results, as well as results that may have been generated but not reported in the manuscript because of space constraints or other reasons. Authors should be encouraged to report everything, and reviewers should avoid punishing them for transparency.

– To ensure their discipline is self-correcting, environmental and resource economists should foster a culture of open, constructive criticism and commentary. For example, journals should encourage the publication of comments on recent papers. In a flagship field journal, JAERE, we could find no published comments in the last five years.

– Journals should encourage and reward pre-registration of hypotheses and methodology, not just for experiments, but also for observational studies for which pre-registrations are rare. We acknowledge in our article that pre-registration is no panacea for eliminating QRPs, but we also note that, in other fields, it has been shown to greatly reduce the frequency of large, statistically significant effect estimates in the “predicted” direction.

– Journals should also encourage and reward replications of influential, innovative, or controversial empirical studies. To incentivize such replications, we recommend that editors agree to review a replication proposal as a pre-registered report and, if satisfactory, agree to publish the final article regardless of whether it confirms, qualifies, or contradict the original study.

Ultimately, however, we will continue to rely on researchers to self-monitor their decisions concerning data preparation, analysis, and reporting. To make that self-monitoring more effective, greater awareness of good and bad research practices is critical. We hope that our publication contributes to that greater awareness.

Paul J. Ferraro is the Bloomberg Distinguished Professor of Human Behavior and Public Policy at Johns Hopkins University. Pallavi Shukla is a Postdoctoral Research Fellow at the Department of Environmental Health and Engineering at Johns Hopkins University. Correspondence regarding this blog can be sent to Dr. Shukla at pshukla4@jhu.edu.  

The Social Science Prediction Platform Makes Its Debut

[Excerpts taken from the blog “Announcing the launch of the Social Science Prediction Platform!” by Aleksandar Bogdanoski and Katie Hoeberling, posted at the BITSS blogsite]
“Collecting and recording predictions systematically can help us understand how results relate to our prior beliefs, as well as how we can improve their accuracy. They can also reveal how surprising results really are, potentially protecting against publication bias or the mistaken discounting of results as “uninteresting” after the fact. Finally, tracking prediction accuracy over time makes it possible to identify super-forecasters—individuals who make consistently accurate predictions—who can help prioritize research questions, as well as design strategies and policy options in the absence of rigorous evidence.”
“Recognizing the potential of a more systematic approach to forecasting, BITSS has been working with Stefano DellaVigna and Eva Vivalt to build the first platform of its kind that will allow social scientists to systematically collect predictions about the results of research. Today we are excited to announce the official launch of the Social Science Prediction Platform!”
“The Social Science Prediction Platform, or SSPP, allows researchers to standardize how they source and record predictions, similar to how study registries have systematized hypothesis and design pre-registration. Perhaps most importantly, the SSPP streamlines the development and distribution of forecast collection surveys.”
“To encourage predictions, we will offer the first 200 graduate students $25 for their first 10 predictions, re-evaluating as more projects and predictions are added to the SSPP.”
To read more, click here.

Nature Human Behaviour: “Replications Do Not Fail”

[Excerpts taken from the editorial, “Replications do not fail” published by Nature Human Behaviour]
“For a very long time, replication studies had a lowly status in the hierarchy of publishable research. The equation of scientific advance with novelty and discovery relegated replication studies to second-class status in the scientific world. Over the past few years, this has fortunately been fast changing.”
“Since the launch of this journal, we have wanted to redefine what constitutes a significant scientific advance (https://www.nature.com/articles/s41562-016-0033). Science moves forward not only through discovery, but also through confirmation or disconfirmation of existing findings that have shaped a field. We placed replication studies on the same level as studies that report new findings and adopted the following principle: if the original study was highly influential, then a study that replicates its methodology is of equal value.”
“Still, replication studies have been hard to find in our pages—they constitute a very small proportion of the submissions we receive. Nonetheless, the current issue, by design, features four replication studies, all of high value in terms of their contribution to the scientific record.”
“This issue is a celebration of replication, and we hope that, by featuring these contributions together and writing this editorial, other scientists will also be inspired to devote their efforts in replication projects and more funders will be inclined to prioritize the funding of replication research. We are certainly looking forward to receiving and publishing many more replications in our pages.”
To read the full editorial, click here.

Pre-Analysis Plans in Economics: A Help or a Hindrance for Publication Success and Impact?

[Excerpts taken from the article “Do Pre-analysis Plans Hamper Publication?” by George Ofosu and Daniel Posner, published in the AER: Papers and Proceedings]
“Pre-analysis plans (PAPs) have been criticized… that PAPs generate dull, lab-report-style papers that are disfavored by reviewers and journal editors and thus hampered in the publication process.”
“To the extent that scholars who register and adhere to PAPs are disadvantaged in publishing their papers, researchers may be disincentivized from preregistration. This risks undermining the benefits for research credibility that the broader adoption of PAPs is thought to offer.”
I. Publication Outcomes of NBER Working Papers with and without PAPs
“We analyze papers issued as NBER working papers between 2011 and 2018, the period corresponding with the rise of preregistration in the economics discipline.”
“During this time span, NBER issued 8,706 working papers, of which 973 (11 percent) were experimental and thus were plausible candidates for preregistration. Fifty-three percent of these experimental working papers were subsequently published in peer-reviewed journals, with 13 percent landing in top-five outlets.”
“To assess whether PAPs affect the likelihood of publication, we coded whether each of these papers mentioned a PAP. We then calculated the publication rates of papers with and without PAPs.”
“Papers reporting the results of studies that followed PAPs were 10 percentage points less likely to be published by December 2019 than papers that did not mention a PAP (44 percent versus 54 percent; p < 0.1). However, conditional on being published, papers with PAPs were 39 percentage points more likely to land in a top-five journal (61 percent versus 22 percent; p < 0.01).”
II. Do Studies with PAPs Generate More Citations?
“…we collected data from Google Scholar on the number of citations to the 82 experimental NBER working papers that mention a PAP and a sample of 100 of the 477 published and 100 of the 414 unpublished experimental NBER working papers that do not mention a PAP.”
“Controlling for the number of years since being issued as an NBER working paper, whether the paper was published, and whether it was published in a top-five outlet…we estimate that having a PAP is associated with 14 additional citations…This represents more than a 40 percent increase over the 32 citations achieved by the median NBER working paper in our sample.”
III. Conclusion
“In keeping with the concerns of some PAP critics, who worry that fidelity to a PAP will lead to an uninteresting, mechanical paper that will be disadvantaged in the review process, we find that papers with PAPs are in fact slightly less likely to be published. However, we also find that, conditional on being published, papers with PAPs are more likely to land in top-five journals and are more likely to be cited.”
To read the article, click here.

Predicting Reproducibility: Rise of the Machines

[Excerpts are taken from the article “Estimating the deep replicability of scientific findings using human and artificial intelligence” by Yang Yang, Wu Youyou, and Brian Uzzi, published in PNAS]
“…we trained an artificial intelligence model to estimate a paper’s replicability using ground truth data on studies that had passed or failed manual replication tests, and then tested the model’s generalizability on an extensive set of out-of-sample studies.”
“Three methods to estimate a study’s risk of passing or failing replication have been assessed: the statistics reported in the original study (e.g., P values and effect size, also known as “reviewer metrics”), prediction markets, and surveys.”
“Our model estimates differences in predictive power according to the information used to construct the model—the paper’s narrative, paper’s reported statistics, or both.”
“After calibrating the model with the training data, we test it on hundreds of out-of-sample, manually replicated studies. Table 1 presents the study’s data sources, which include 2 million abstracts from scientific papers…and 413 manually replicated studies from 80 journals.”
“Our methodology involves three stages of development, and in stage 1 of this analysis, we obtained the manuscripts of all 96 [RPP] studies and stripped each paper of all nontext content…In stage 2, a neural-network-based method—word2vec (29) with standard settings—was used to quantitatively represent a paper’s narrative content by defining the quantitative relationship (co-occurrence) of each word with every other word in the corpus of words in the training set.”
“In stage 3, we predicted a study’s manually replicated outcome (pass or fail) from its paper-level vector using a simple ensemble model of bagging with random forests and bagging with logistic… This simple ensemble model generates predictions of a study’s likelihood of replicating [0.0, 1.0] using threefold repeated cross-validation. It is trained on a random draw of 67% of the papers to predict outcome for the remaining 33%, and the process is repeated 100 times with different random splits of training sets vs. test sets.”
“A typical standard for evaluating accuracy, which is assessed relative to a threshold selected according to the evaluator’s interest, is the base rate of failure in the ground-truth data. At the base rate threshold we chose, 59 of 96 studies failed manual replication (61%). We found that the average accuracy of the machine learning narrative-only model was 0.69…In other words, on average, our model correctly predicted the true pass or fail outcome for 69% of the studies, a 13% increase over the performance of a dummy predictor with 0.61 accuracy…”
“A second standard approach to assessing predictive accuracy is top-k precision, which measures the number of actual failures among the k lowest-ranked studies based on a study’s average prediction. When k is equal to the true failure base rate, the machine learning model’s top-k precision is 0.74…”
“To compare the accuracy of the narrative-only model with conventionally used reviewer metrics, we designed a statistics-only model using the same procedure used to design the narrative only model…The reviewer metrics-only model achieved an average accuracy and top-k precision of 0.66…and 0.72…, respectively.”
“To investigate whether combining the narrative-only and reviewer metrics-only models provides more explanatory power than either model alone, we trained a model on a paper’s narrative and reviewer metrics. The combined narrative and reviewer-metrics model achieved an average accuracy of 0.71…and top-k precision of 0.76…The combined model performed significantly better in terms of accuracy and top-k precision than either the narrative or reviewer metrics model alone…”
“We ran robustness tests of our machine learning model on five currently available out-of-sample datasets that report pass or fail outcomes…Fig. 3 summarizes the out-of-sample testing results for narrative (blue), reviewer metrics (green), and narrative-plus-reviewer metrics (orange) models.”
“Test set I…consists of eight similarly conducted published psychology replication datasets (n = 117). The machine learning model generated an out-of-sample accuracy and top-k precision of 0.69 and 0.76, respectively.”
“Test set II…consists of one set of 57 psychology replications done primarily by students as class projects, suggesting more noise in the “ground truth” data. Under these conditions of relatively high noise in the data, the machine learning model yielded an out-of-sample accuracy and top-k precision of 0.65 and 0.69, respectively.”
“Test sets III and IV are notable because they represent out-of-sample tests in the discipline of economics, a discipline that uses different jargon and studies different behavioral topics than does psychology—the discipline on which the model was trained…Test set III includes 18 economics experiments…Test set IV includes 122 economics studies compiled by the Economics Replication Wiki, The accuracy scores were 0.78 and 0.66, and the top-k precision scores were 0.71 and 0.73, respectively.”
“In another type of out-sample-test, we compared the machine learning model with prediction markets…To construct our test, we collected the subsample of 100 papers from test sets I to IV that were included in prediction markets and ranked papers from least to most likely to replicate per the reported results of each prediction market and each associated survey. We then ranked the machine learning model’s predictions of the same papers from least to most likely to replicate.
“In comparing prediction markets, survey, and our machine learning model, we operated under the assumption that the most important papers to correctly identify for manual replication tests are the papers predicted to be least and most likely to replicate.”
“With respect to the 10 most confident predictions of passing, the machine learning model predicts 90% of the studies correctly; the market or survey methods correctly classify 90% of the studies. Among the 10 most confident predictions of failing, the market or survey methods correctly classify 100% of the studies, and the machine learning model correctly classifies 90% of the studies.”
“Machine learning appears to have the potential to aid the science of replication. Used alone, it offers accurate predictions at levels similar to prediction markets or surveys.”
“Though the findings should be taken as preliminary given the necessarily limited datasets with ground-truth data, our out-of-sample tests offer initial results that machine learning produces consistent predictions across studies having diverse methods, topics, disciplines, journals, replication procedures, and periods.”
To read the article, click here.

Assess Business and Economics Claims. Help Science! Make Money!

The repliCATS project is part of a research program called SCORE, funded by DARPA, that eventually aims to build automated tools that can rapidly and reliably assign confidence scores to social science research claims.
Our method – the IDEA protocol – harnesses the power of structured group discussion in predicting likely replicability. We have built a custom web-based platform to do this.
What we ask you to do is to evaluate the replicability of a claim. That is, if one were to follow the methods of the original study with a high degree of similarity do you think one would find similar results to the original study?
We are now in Round 6, which runs from 1 June to 30 June. Round 6 focuses on business and economics* claims.
Prizes? Yes there are prizes!
– US$500: for the person who completes the most**
– US$250: for the people who complete the 2nd and 3rd most business or economics claims**
Claims must be submitted by midnight, AEST.
Interested in participating or learning more? Click here.
* Claims are considered Business or Economics claims if they were from the journals specified the table in FAQs under “From which journals are the 3000 claims being chosen?”
** A completed claim counts as submitting a final assessment.


How Much Power Does the Average Social Science Study Have? Less Than You Think

[Excerpts taken from the blog, “No, average statistical power is not as high as you think: Tracing a statistical error as it spreads through the literature”, by Andrew Gelman, posted at Statistical Modelling]
“I was reading this recently published article by Sakaluk et al. and came across a striking claim:”
“Despite recommendations that studies be conducted with 80% power for the expected effect size, recent reviews have found that the average social science study possesses only a 44% chance of detecting an existing medium-sized true effect (Szucs & Ioannidis, 2017).”
“I noticed this not because the claimed 44% was so low but because it was so high! I strongly doubt that the average social science study possesses a power of anything close to 44%. Why? Because 44% is close to 50%, and a study will have power of 50% if the true effect is 2 standard errors away from zero. I doubt that typical studies have such large effects.”
“…if researchers come into a study with the seemingly humble expectation of 44% power, then they’ll expect that they’ll get “p less than 0.05” about half the time, and if they don’t they’ll think that something went wrong. Actually, though, the only way that researchers have been having such a high apparent success rate in the past is from forking paths. The expectation of 44% power has bad consequences.”
To read the full blog, click here.

When Peer Review is Too Slow – How About a Red Team?

[Excerpts taken from the article, “Pandemic researchers — recruit your own best critics” by Daniël Lakens, published in Nature]
“As researchers rush to find the best ways to quell the COVID-19 crisis, they want to get results out ultra-fast. Preprints — public but unvetted studies — are getting lots of attention…To keep up the speed of research and reduce sloppiness, scientists must find ways to build criticism into the process. scientific ideal, but it is rarely scientific practice.”
“An initial version of a recent preprint by researchers at Stanford University in California estimated that COVID-19’s fatality rate was 0.12–0.2% (E. Bendavid et al. Preprint at medrXiv http://doi.org/dskd; 2020). This low estimate was removed from a subsequent version, but it had already received widespread attention and news coverage. Many immediately pointed out flaws in how the sample was obtained and the statistics were calculated. Everyone would have benefited if the team had received this criticism before the data were collected and the results were shared.”
“It is time to adopt a ‘red team’ approach in science that integrates criticism into each step of the research process. A red team is a designated ‘devil’s advocate’ charged to find holes and errors in ongoing work and to challenge dominant assumptions, with the goal of improving project quality.”
“With research moving faster than ever, scientists should invest in reducing their own bias and allowing others to transparently evaluate how much pushback their ideas have been subjected to. A scientific claim is as reliable as only the most severe criticism it has been able to withstand.”
To read the article, click here.

Debunking Three Common Claims of Scientific Reform

[Excerpts are taken from the article, “The case for formal methodology in scientific reform” by Berna Devezer, Danielle Navarro, Joachim Vandekerckhove, and Erkan Buzbas, posted at bioRxiv]
“Methodologists have criticized empirical scientists for: (a) prematurely presenting unverified research results as facts; (b) overgeneralizing results to populations beyond the studied population; (c) misusing or abusing statistics; and (d) lack of rigor in the research endeavor that is exacerbated by incentives to publish fast, early, and often. Regrettably, the methodological reform literature is affected by similar practices.”
“In this paper we advocate for the necessity of statistically rigorous and scientifically nuanced arguments to make proper methodological claims in the reform literature. Toward this aim, we evaluate three examples of methodological claims that have been advanced and well-accepted (as implied by the large number of citations) in the reform literature:”
1. “Reproducibility is the cornerstone of, or a demarcation criterion for, science.”
2. “Using data more than once invalidates statistical inference.”
3. “Exploratory research uses “wonky” statistics.”
“Each of these claims suffers from some of the problems outlined earlier and as a result, has contributed to methodological half-truths (or untruths). We evaluate each claim using statistical theory against a broad philosophical and scientific background.”
Claim 1: Reproducibility is the cornerstone of, or a demarcation criterion for, science.
“A common assertion in the methodological reform literature is that reproducibility is a core scientific virtue and should be used as a standard to evaluate the value of research findings…This view implies that if we cannot reproduce findings…we are not practicing science.”
“The focus on reproducibility of empirical findings has been traced back to the influence of falsificationism and the hypothetico-deductive model of science. Philosophical critiques highlight limitations of this model. For example, there can be true results that are by definition not reproducible…science does—rather often, in fact—make claims about non-reproducible phenomena and deems such claims to be true in spite of the non-reproducibility.”
“We argue that even in scientific fields that possess the ability to reproduce their findings in principle, reproducibility cannot be reliably used as a demarcation criterion for science because it is not necessarily a good proxy for the discovery of true regularities. To illustrate this, consider the following two unconditional propositions: (1) reproducible results are true results and (2) non-reproducible results are false results.”
True results are not necessarily reproducible
“Our first proposition is that true results are not always reproducible…for finite sample studies involving uncertainty, the true reproducibility rate must necessarily be smaller than one for any result. This point seems trivial and intuitive. However, it also implies that if the uncertainty in the system is large, true results can have reproducibility close to 0.”
False results might be reproducible
“In well-cited articles in methodological reform literature, high reproducibility of a result is often interpreted as evidence that the result is true…The rationale is that if a result is independently reproduced many times, it must be a true result. This claim is not always true. To see this, it is sufficient to note that the true reproducibility rate of any result depends on the true model and the methods used to investigate the claim.”
“We consider model misspecification under a measurement error model in simple linear regression…The blue belt in Figure 2 shows that as measurement error variability grows with respect to sampling error variability, effects farther away from the true effect size become perfectly reproducible. At point F in figure Figure 2, the measurement error variability is ten times as large as the sampling error variability, and we have perfect reproducibility of a null effect when the true underlying effect size is in fact large.”
Claim 2: Using data more than once invalidates statistical inference
“A well-known claim in the methodological reform literature regards the (in)validity of using data more than once, which is sometimes colloquially referred to as double-dipping or data peeking…This rationale has been used in reform literature to establish the necessity of preregistration for “confirmatory” statistical inference. In this section, we provide examples to show that it is incorrect to make these claims in overly general terms.”
“The key to validity is not how many times the data are used, but appropriate application of the correct conditioning…”
“Figure 4 provides an example of how conditioning can be used to ensure that nominal error rates are achieved. We aim to test whether the mean of Population 1 is greater than the mean of Population 2, where both populations are normally distributed with known variances. An appropriate test is an upper-tail two-sample z-test. For a desired level of test, we fix the critical value at z, and the test is performed without performing any prior analysis on the data. The sum of the dark green and dark red areas under the black curve is the nominal Type I error rate for this test.”
“Now, imagine that we perform some prior analysis on the data and use it only if it obeys an exogenous criterion: We do not perform our test unless “the mean of the sample from Population 1 is larger than the mean of the sample from Population 2.” This is an example of us deriving our alternative hypothesis from the data. The test can still be made valid, but proper conditioning is required.”
“If we do not condition on the information given within double quotes and we still use z as the critical value, we have inflated the observed Type I error rate by the sum of the light green and light red areas because the distribution of the test statistic is now given by the red curve. We can, however, adjust the critical value from z to z* such that the sum of the light and dark red areas is equal to the nominal Type I error rate, and the conditional test will be valid.”
“We have shown that using data multiple times per se does not present a statistical problem. The problem arises if proper conditioning on prior information or decisions is skipped.”
Claim 3: Exploratory Research Uses “Wonky” Statistics
“A large body of reform literature advances the exploratory-confirmatory research dichotomy from an exclusively statistical perspective. Wagenmakers et al. (2012) argue that purely exploratory research is one that finds hypotheses in the data by post-hoc theorizing and using inferential statistics in a “wonky” manner where p-values and error rates lose their meaning: ‘In the grey area of exploration, data are tortured to some extent, and the corresponding statistics is somewhat wonky.’”
“Whichever method is selected for EDA [exploratory data analysis]; …it needs to be implemented rigorously to maximize the probability of true discoveries while minimizing the probability of false discoveries….repeatedly misusing statistical methods, it is possible to generate an infinite number of patterns from the same data set but most of them will be what Good (1983, p.290) calls a kinkus—‘a pattern that has an extremely small prior probability of being potentially explicable, given the particular context’.”
“The above discussion should make two points clear, regarding Claim 3: First, exploratory research cannot be reduced to exploratory data analysis and thereby to the absence of a preregistered data analysis plan, and second, when exploratory data analysis is used for scientific exploration, it needs rigor. Describing exploratory research as though it were synonymous with or accepting of “wonky” procedures that misuse or abuse statistical inference not only undermines the importance of systematic exploration in the scientific process but also severely handicaps the process of discovery.”
“Simple fixes to complex scientific problems rarely exist. Simple fixes motivated by speculative arguments, lacking rigor and proper scientific support might appear to be legitimate and satisfactory in the short run, but may prove to be counter-productive in the long run.”
To read the article, click here.