ROBERT GELFOND and RYAN MURPHY: Out-of-Sample Tests and Macroeconomics
The replication crisis has elicited a number of recommendations, from betting on beliefs, to open data, to improved norms in academic journals regarding replication studies. In our recent working paper, “A Call for Out-of-Sample Testing in Macroeconomics” (available at SSRN), we argue that a renewed focus on out-of-sample tests will significantly mitigate the issues with replication, and we document the fact that out-of-sample tests are absent from entire literatures in economics.
Our starting point is the observation that the literature regarding the government spending multiplier lacked almost any result supported by an out-of-sample test. For this we use a fairly forgiving definition of “out-of-sample test;” so long as a model is parameterized in one period and then applied to new data, we count it. We review 87 empirical papers estimating the multiplier. Out-of-sample tests do not make an appearance, with only a few exceptions. Given that this question is perhaps the most important in macroeconomics, with quite literally trillions of dollars on the line, this result is jarring.
It was in 1953 that Milton Friedman published The Methodology of Positive Economics, urging economists to use prediction as their criterion for comparing the worthiness of competing theories. Clearly, philosophy of science and practical econometrics have moved beyond this simplistic dictum, but does it make sense to cast aside out-of-sample predictions altogether when comparing theories and models? Are we all that confident that the results found using the methods which claim the throne of the “credibility revolution in empirical economics” will withstand the scrutiny of truly out-of-sample tests?
The primary exception to our result is a 2007 paper by Frank Smets and Rafael Wouters, who ably perform an out-of-sample test against a series of baseline models. However, in the absence of other papers performing such tests, it is difficult to say how strong of a result it is. Even more laudable is the lengthy attempt in 2012 by Volker Wieland and his colleagues in comparing the performance of a number of macroeconomic models, although it is difficult to parse this study to answer the narrower question regarding the government spending multiplier. Another example is a 2016 paper by Jorda and Taylor, which creates a “counterfactual forecast,” which is similar to, but not quite, an out-of-sample test.
The other two examples we were able to identify were published in 1964 and 1967.
There are clearly additional criteria that economists can and should use for evaluating theories. Nonetheless, the paucity of examples of these tests points to p-hacking, specification searches, and the whole slew of problems associated with the replication crisis of social science. And perhaps macroeconomics is “hard” and things like recessions cannot be reasonably forecasted. Fine. Meteorologists cannot forecast more than a week or so ahead, but they still forecast what they can forecast. What models work the best is still an extremely pertinent question, even if all models fail miserably when a recession hits.
Rather, doing away with out-of-sample tests and other similar tests does away with the scientific ideal of Conjectures and Refutations, with scientific knowledge evolving as bold ideas starkly stated compete for the title of least wrong.
Bob Gelfond is the CEO of MQS Management LLC and the chairman and founder of MagiQ Technologies. Ryan Murphy is a research assistant professor at the O’Neil Center for Global Markets and Freedom at SMU Cox School of Business.