REED: The Devil, the Deep Blue Sea, and Replication

Posted on 1st December 2018 by replicationnetwork

Leave a Comment

In a recent article (“Between the Devil and the Deep Blue Sea: Tensions Between Scientific Judgement and Statistical Model Selection” published in Computational Brain & Behavior), Danielle Navarro identifies blurry edges around the subject of model selection. The article is a tour de force in thinking largely about statistical model selection. She writes,

“What goal does model selection serve when all models are known to be systematically wrong? How might “toy problems” tell a misleading story? How does the scientific goal of explanation align with (or differ from) traditional statistical concerns? I do not offer answers to these questions, but hope to highlight the reasons why psychological researchers cannot avoid asking them.”

She goes on to say that researchers often see model selection as…

“…a perilous dilemma in which one is caught between two beasts from classical mythology, the Scylla of overfitting and the Charybdis of underfitting. I find myself often on the horns of a quite different dilemma, namely the tension between the devil of statistical decision making and the deep blue sea of addressing scientific questions. If I have any strong opinion at all on this topic, it is that much of the model selection literature places too much emphasis on the statistical issues of model choice and too little on the scientific questions to which they attach.”

The article never mentions nor alludes to replication, but it seems to me the issue of model selection is conceptually related to the issue of “replication success” in economics and other social sciences. Numerous attempts have been developed to quantitatively define “replication success” (for a recent effort, see here). But just as the issue of model selection demands more than a goodness-of-fit number can supply, so the issue of “replication success” requires more than constructing a confidence interval for “the true effect” or calculating a p-value for some hypothesis about “the true effect”.

For starters, it’s not clear there is a single, “true effect.” Let’s suppose there is. Maybe the original study was content to demonstrate the existence of “an effect.” So replication success should be content with this as well. Alternatively, maybe the goal of the original study was to demonstrate that “the effect” was equal to a specific numerical value. This is a common situation in economics. For example, in the evaluation of public policies, it not sufficient to show that a policy will have a desirable outcome, but rather that the benefit it produces is greater than the cost. The numbers matter, not just the sign of the effect. Accordingly, the definition of replication success will be different.

This is exactly the conclusion from the recent issue in Economics on The Practice of Replication (see here for “takeaways” from that issue). There is no single measure of replication success because scientific studies do not all have the same purpose. While it may be the case that the purposes of studies can be categorized, and that replication success can be defined within specific categories — it may be the case, though this is yet to be demonstrated — it is certainly the case that there is no single scientific purpose, and thus no single measure of replication success.

Speaking of her own field of human cognition, Navarro writes,

“To my way of thinking, understanding how the qualitative patterns in the empirical data emerge naturally from a computational model of a psychological process is often more scientifically useful than presenting a quantified measure of its performance, but it is the latter that we focus on in the “model selection” literature. Given how little psychologists understand about the varied ways in which human cognition works, and given the artificiality of most experimental studies, I often wonder what purpose is served by quantifying a model’s ability to make precise predictions about every detail in the data.”

People are complex and complicated. Aggregating them to markets and economies does not make them easier to understand. Thus the points that Navarro makes apply a fortiori to replication and economics.

Bob Reed is a professor of economics at the University of Canterbury in New Zealand. He is also co-organizer of the blogsite The Replication Network. He can be contacted at bob.reed@canterbury.ac.nz.

Category: GUEST BLOGS Tags: Danielle Navarro, Model selection, Overfitting, Replication success, Underfitting

Leave a comment Cancel reply