Sound and fury: McCloskey and significance testing in economics
2008; Taylor & Francis; Volume: 15; Issue: 1 Linguagem: Inglês
10.1080/13501780801913298
ISSN1469-9427
AutoresKevin D. Hoover, Mark V. Siegler,
Tópico(s)Complex Systems and Time Series Analysis
ResumoAbstract For more than 20 years, Deidre McCloskey has campaigned to convince the economics profession that it is hopelessly confused about statistical significance. She argues that many practices associated with significance testing are bad science and that most economists routinely employ these bad practices: 'Though to a child they look like science, with all that really hard math, no science is being done in these and 96 percent of the best empirical economics …' (McCloskey Citation1999). McCloskey's charges are analyzed and rejected. That statistical significance is not economic significance is a jejune and uncontroversial claim, and there is no convincing evidence that economists systematically mistake the two. Other elements of McCloskey's analysis of statistical significance are shown to be ill‐founded, and her criticisms of practices of economists are found to be based in inaccurate readings and tendentious interpretations of those economists' work. Properly used, significance tests are a valuable tool for assessing signal strength, for assisting in model specification, and for determining causal structure. Keywords: Deidre McCloskeyStephen Ziliakstatistical significanceeconomic significancesignificance testsR.A. FisherNeyman‐Pearson testingspecification searchC10C12B41 Acknowledgements We thank Deirdre McCloskey for email discussions and, with Stephen Ziliak, for providing us with the individual scores from their 1980s survey and for the individual scores broken down by question for the 1990s survey. We thank: Ryan Brady, for research assistance; and Paul Teller, A. Colin Cameron, Thomas Mayer, Clinton Greene, John Henry, Stephen Perez, Roger Backhouse, and four anonymous referees, for valuable comments on earlier drafts. We also appreciate the comments received when the paper was presented to the University of California, Davis Macroeconomics Workshop, the 2006 conference of the International Network for Economic Method, and a departmental seminar at the University of Kansas. Notes 1. Other writings on this topic include the second edition of The Rhetoric (McCloskey Citation1998) as well as McCloskey (Citation1985b, Citation1992, Citation1994, Citation1995a,Citationb, Citation1997, Citation1999, Citation2002, Citation2005), McCloskey and Zecher (Citation1984), McCloskey and Ziliak (Citation1996), and Ziliak and McCloskey (Citation2004a,Citationb). 2. McCloskey and Ziliak (Citation1996, pp. 99–101) offer an analysis of econometrics textbooks as additional evidence that the answer to this question is, yes. Since their analysis concentrates heavily on a reading of Johnston's (Citation1972) econometrics textbook, its flaws can be demonstrated only through a very detailed examination of the original text as well as McCloskey and Ziliak's claims about it. To save space, we have omitted it. Our analysis is, however, available in section 3 of the earliest working‐paper version of the current article (dated 20 August 2005) at www.econ.duke.edu/˜kdh9/research.html. 3. A complete list of the omitted papers with annotations to the criteria of inclusion can be found at www.econ.duke.edu/˜kdh9/research.html. We do not mean to imply that their failure to include all of the relevant papers would necessarily change their results (or our arguments or conclusions) in any meaningful way. As we discuss below, their survey design and implementation are so critically flawed that any conclusions they reach are meaningless, regardless of the sample. The omission of these papers is only emblematic. 4. It is unclear why Ziliak and McCloskey chose to group authors in ranges rather than to report individual scores. 5. For the 1980s, there are 182 articles. Woodbury and Spiegelman (Citation1987) scored 7 out of 19 and ranked the 41st percentile of papers in the 1980s (i.e. 41% scored 7 or less). Darby (Citation1984) scored either 2 or 3 and is in the 8th or 4th percentile (the latter a seven‐way tie for last place). The ambiguity arises because McCloskey and Ziliak's score sheet reports 3 yes and 17 no for Darby's article, but there are only 19 questions. For the 1990s, there are 137 articles. Bernanke and Blinder (Citation1992) score 8 (58th percentile); Becker, Grossman, and Murphy (Citation1994) score 6 (30th percentile); Bernheim and Wantz (Citation1995) score 1 (last place and the 1st percentile). 6. An additional reference to statistical significance (Darby Citation1984, p. 315) essentially states that a change in specification did not result in an estimate of the coefficient on CDt outside its confidence interval in Equation (12), therefore not triggering any reassessment of the economic interpretation of that equation. 7. Ziliak and McCloskey's position is not fully consistent. We agree that 'a poorly fit correlation with the expected sign would say nothing' (Ziliak and McCloskey Citation2004a, p. 539). Yet, the conclusion that McCloskey draws from her inaccurate recounting of a study of the affect of aspirin on heart attacks is that the size of the measured effect matters even when estimates are statistically insignificant (see section 3.2 below). And if the 'oomph' matters, so does its direction. McCloskey would surely not advocate aspirin as a prophylactic against heart attacks if the correlation between aspirin and heart attacks were positive, though statistically insignificant. 8. Ziliak and McCloskey (Citation2004a, p. 541) continue: 'But their way of finding the elasticities is erroneous.' The charge is never substantiated by pointing out the particular errors. Be that as it may, whether Becker et al. are right or wrong on this point has no bearing on the central question of whether they confuse economic and statistical significance. 9. A similar discussion of a different specification, couched in economic terms, is found on p. 408. And, despite Ziliak and McCloskey's having scored Becker et al., as not conducting any simulations (see question 17 of Table 1), a small simulation is reported on p. 409. 10. Ziliak and McCloskey (Citation2004a, pp. 530–531) completely misread Edgeworth, when they attempt to enlist him as an ally. It is Jevons who takes the position that 'size matters', and Edgeworth who argues that we must attend to statistical significance. Yes, Edgeworth distinguishes between economic and statistical significance, but Ziliak and McCloskey miss his point when they assert that he corrects Jevons for ignoring an economically, as opposed to a statistically, significant difference of 3% or 4% in the volume of commercial bills in different quarters. Edgeworth (1885, p. 208) writes: 'Professor Jevons must be understood to mean that such a difference may for practical purposes be neglected. But for the purposes of science, the discovery of a difference in condition, a difference of 3 per cent. and much less may well be important.' Edgeworth did not dispute Jevons's judgment that 3% to 4% is economically small or criticize him for it. Rather he argued that science may care about differences that are not of practical (i.e. economic) importance. To underwrite that view, he conducted a test of statistical significance: after correcting for a secular increase in the volume of bills, he found that the means of the first and second quarters differed by an amount equal to about 0.8 times the modulus () or, in modern terminology, by 1.1 standard deviations. Edgeworth concluded: 'There is, therefore, "no great difference,'" as Professor Jevons says; still a slight indication of a real law – enough to require the continuation of the inquiry, if the subject repaid the trouble.' Far from contradicting Jevons on the matter of economic importance, Edgeworth found that the differences were not statistically significant by his usual standard of two to three times the modulus; nevertheless, the result, he believed, was significant enough that it might encourage economic scientists to investigate further, even if it would not repay the trouble of a practical man. His caveat about a slight indication of a real law reflects the intuition that, as we might now put it, a p‐value of 0.21 could fall into the acceptance region for a researcher who placed a high value on detecting faint signals. (Edgeworth (1885, p. 201) provides another example of an economically small difference that, because it is statistically significant, should not be neglected scientifically.) 11. Ronald Aylmer Fisher, who is an object of McCloskey's (Citation1998, p. 112; see also Ziliak and McCloskey Citation2004a, pp. 530–531, 542–544) special scorn, clearly understands the conventional nature of the customary 5% size: 'it is convenient to take [a size of 5% or critical value of 1.96] as a limit in judging whether a deviation is to be considered significant or not. Deviations exceeding twice the standard deviation are thus formally regarded as significant' (Fisher Citation1946, p. 44). Ziliak and McCloskey (Citation2004a, p. 531) imply that Fisher has contradicted himself or, at least, shifted ground in moving from convenient in one sentence to formally regarded in the next. We view Fisher as declaring a pragmatic convention, using formally in the sense of '"[A]s a rule"; under normal circumstances' (Oxford English Dictionary, definition 4.b). Our view is reinforced by the fact that Fisher discusses the implications of a number of test sizes other than 5% in the same passage. 12. According to the study in question, 'the external Data Monitoring Board of the Physician's Health Study took the unusual step of recommending the early termination of the randomized aspirin component of the trial, primarily because of a statistically extreme beneficial effect on nonfatal and fatal myocardial infarction had been found. The Board cites the difference in total myocardial infarction between the aspirin and the placebo group as having a statistical significance measured by p<0.00001 – i.e., extremely significant (Hennekens et al., 1988, p. 262, Table 1, emphasis added). McCloskey based her interpretation on secondary sources rather than the original study (email McCloskey to Kevin D. Hoover, July 31, 2003). McCloskey was apparently misled by the statements in a letter to the editor of the New England Journal of Medicine stating that 'researchers found the results so positive that they ethically did not feel they should withhold aspirin's benefits from the control placebo group' and, in a news report (FDA Consumer January–February 1994), that '[t]here was, however, no significant difference between the aspirin and placebo groups in number of strokes … or in overall deaths from cardiovascular disease' (both cited in the previously cited email). But 'positive' need not mean, as McCloskey takes it, large and statistically insignificant nor are heart attacks the same thing as strokes or overall deaths from cardiovascular disease. 13. McCloskey (1985b, p. 202) was more moderate and more correct when she noted that this claim does not hold when the true hypothesis is 'literally zero' – see also section 3.5 below. 14. Information criteria, such as the Bayesian Information Criterion (BIC) can be used to avoid such idiosyncratic preferences, since in an situation in which tests are nested, the BIC essentially acts to lower the significance level as sample size increases. 15. Savage used this term to explain the limits on his own preferred sort of personalist Bayesian statistics, which he regarded as 'a natural late development of the Neyman‐Pearson ideas' (see Keuzenkamp Citation2000, pp. 84–86 for the citation and discussion). 16. Newton did use statistical ideas in assessing historical evidence, but seems to have forgone applying them to a problem of practical personal gain in assessing the quality of the coinage while Master of the Mint (Stigler Citation1999, pp. 394–397). 17. Elliott and Granger (Citation2004, p. 549) make the point that the degree of bending of starlight in Arthur Eddington's famous observations of the eclipse in 1919 was too small to matter practically, yet nevertheless served to distinguish Einstein's from Newton's mechanics. Ziliak and McCloskey's (Citation2004b, pp. 668–669) riposte that Eddington did not use statistics misses the point: Elliott and Granger never said that he did, nor need they even have believed it implicitly, since the argument about loss functions is more general than statistics. Kruskal (Citation1968b, p. 218) points out that many objections to statistical inference apply 'equally to any mode of analysis – formal, informal, or intuitive.' 18. McCloskey (Citation2002, pp. 37–38) certainly claims that economics is more worldly than mathematics, which she regards not as a science, but as 'a kind of abstract art,' though not to be disdained for that. 19. Feynman (1985, p. 7) cites, as one of many examples, measurements of Dirac's number, which are accurate to the 11th decimal place (more precisely, 4×10−9%), the equivalent of measuring the distance between New York and Los Angeles to the width of a human hair. Feynman cites the accuracy of this result, about five times more accurate than the predictions of the relevant theory, not for any gain or loss that it implies, but for the beauty of the conformity of theory and observation. 20. On the economy of research, see Wible (Citation1994, Citation1998); on Peirce as a statistician, see Stigler (Citation1999, chap. 10). 21. The search terms were: 'confidence interval(s),' 'error band(s),' '2 (two) standard error(s),' '2 (two) standard deviation(s).' The numbers for the American Economic Review are not strictly comparable to McCloskey and Ziliak's surveys which exclude articles from the Papers and Proceedings (May) numbers and shorter articles. 22. She is inconsistent. In other moods, McCloskey (Citation2002, pp. 30, 32) berates economists for dismissing metaphysics. 23. Data mining is discussed in detail in Hoover (Citation1995) and Hoover and Perez (Citation2000). On the one hand, McCloskey condemns data mining; on the other hand, McCloskey (Citation1985a, pp. 139–140; Citation1985b, p. 201) cites favourably Leamer's (Citation1978) analysis of specification search and his (Citation1983) 'extreme‐bounds analysis.' Both involve data mining. Oddly, Ziliak and McCloskey (score sheets for Citation2004a, personal communication) omit both Leamer's (Citation1983) article and McAleer, Pagan, and Volcker's (Citation1985) rebuttal, both of which appeared in the American Economic Review and meet the criteria for their survey of articles from the 1990s. Similarly, Cooley and LeRoy's (Citation1981) application of extreme‐bounds analysis to money demand, which is itself favorably cited by McCloskey (Citation1985a, p. 140), is omitted from the survey of the 1980s. More oddly still, in light of the favorable evaluation of extreme‐bounds analysis, Levine and Renelt's (Citation1992) article, which is a straightforward application of extreme‐bounds to cross‐country growth regressions, scores only 3 out of 19 in the 1990s survey – the third worst performance on their survey in the 1990s. 24. The key words were 'statistically significant,' 'statistical significance,' 'significance test,' 'test of significance,' 'significance tests,' 'tests of significance,' 't‐test,' "F‐test,' and 'chi squared.' 25. Additional evidence is found in Staley (Citation2004), who discusses the use of significance tests in high‐energy physics in considerable detail.
Referência(s)