Artigo Acesso aberto Revisado por pares

Never Mind the Bollocks: Chance, Noise, Skepticism, and Statistics

2011; Elsevier BV; Volume: 59; Issue: 1 Linguagem: Inglês

10.1016/j.annemergmed.2011.11.003

ISSN

1097-6760

Autores

William B. Millard,

Tópico(s)

Healthcare cost, quality, practices

Resumo

State lotteries—to take just 1 phenomenon in which probabilistic thinking intersects with everyday life—are described aphoristically as a tax on people who are bad at mathematics. It would be more precise to call them a tax on people who are bad at math but act on it anyway. In the space between an understanding of conditional probability and the need (or desire) for consequential action, much depends on whether one perceives a connection between these realms, forces a connection, or accepts with equanimity that one may not exist.There is often an uneasy tension, suggest specialists in statistics, epidemiology, and emergency medicine, between quantitative thinking and the broader forms of cognition that add up to sound clinical practice. Quantifying risk is a difficult process for most people to comprehend, journalists in particular and physicians not excepted. Proponents of evidence-based medicine, although working to separate comprehensible information from the workings of chance and the influences of wishful thinking or material interest, acknowledge that epidemiologic correlations, clinical courses, and treatment outcomes are all too complex for reduction to a P value or a binary choice.Yet procedures that substitute those deceptively simple, easily biased statistical values for more nuanced modes of knowledge and more meaningful criteria permeate much of the research literature—even a majority of it, according to the analyses of John P. A. Ioannidis, MD, PhD, professor of medicine, epidemiology, and statistics at Stanford School of Medicine. By the time research is translated into headline language, a rough scientific equivalent of the economic observation termed Gresham's Law (“bad money drives out good if exchanged at the same price”1Mundell R. Uses and abuses of Gresham's Law in the history of money.Zagreb J Economics. 1998; 2 (Accessed October 13, 2011): 3-38http://robertmundell.net/ebooks/free-downloads/Google Scholar) may be in operation.Just as methodological errors migrate from professional journals into the lay press so also do methodological critiques. A New York Times op-ed recently explored the general question of the reliability of research reports, particularly when left uncontested by replication studies, citing the “de-discovery” efforts of Columbia University virologist W. Ian Lipkin, MD.2Zimmer C. It's science, but not necessarily right.in: New York Times, June 26, 2011: SR12http://www.nytimes.com/2011/06/26/opinion/sunday/26ideas.htmlGoogle Scholar A profile in The Atlantic3Freedman D.H. Lies, damned lies, and medical science.in: Atlantic, November 2010: 76-86http://www.theatlantic.com/magazine/archive/2010/11/lies-damned-lies-and-medical-science/8269/?single_page=trueGoogle Scholar last year brought Dr. Ioannidis's research to a lay readership's attention. In the wake of Dr. Ioannidis's work, the popular campaign by Yale statistician and information designer Edward Tufte, PhD, against junk statistics and “chartjunk”4Tufte E.R. The Visual Display of Quantitative Information. Graphics Press, Cheshire, CT1983Google Scholar and a spate of prominent reversals of professional consensus about risk factors, cancer-screening outcomes, and once-trusted treatments, physicians may face increasing public skepticism, shading off into cynicism, about whether clinical procedures and decisions rest on solid ground.Positivism's Mixed BlessingsOf the forms of evidence that clinicians consider in assessing how research results might affect their daily practice, methodological rigor increases as one moves from anecdotal case reports to observational studies to randomized controlled trials. Statistical techniques designed to reduce the likelihood that results are due to chance or bias are indispensable but imperfect.“The value of quantitative reasoning is that you can average out the noise and find the signal that's hidden in it,” observed Robert L. Wears, MD, MS, professor of emergency medicine and medical informatics at the University of Florida, Jacksonville. “The problem is that people get enamored with that. I've been there myself; you feel like ‘Ah, I've got the key to the truth,' and what you've got is the key to part of it, because sometimes that noise is the signal … . Sometimes the signal lies in the relation between 2 parts and not in the parts themselves.”The overvaluing of pure numeracy is “part of the legacy of the Enlightenment that we still haven't fought off,” commented Dr. Wears. “It's the legacy of positivism, and it's very powerful. There's no question that a lot of the good things in modern life have come from that approach, but there's a problem when you treat [it] as the only possible approach that can be useful.” Moreover, even among specialists in statistical analysis, there is no consensus on how much meaning is discernible from the fuzzy complexities of measured results.At least 1 leader in the effort to improve quality control in the clinical research enterprise—and to upgrade the presentation of results in the scholarly literature as well as the popular media—locates himself, with appealing frankness, squarely outside the highly numerate segment of the population.“I'm not a natural mathematician, as I believe some people are,” said British methodological researcher and obstetrician Sir Iain Chalmers, former director of the UK Cochrane Centre and founding editor at the James Lind Library, which complements the Cochrane Collaboration's systematic reviews with an archive of writings on their history and purpose. “I was told by a very reassuring statistician once that mathematical ability is a congenital abnormality. And most people don't have it.”Some who undoubtedly have it are unsure that everyone needs it. David Spiegelhalter, PhD, FRS, Winton Professor of the Public Understanding of Risk at the Statistical Laboratory, Centre for Mathematical Sciences, Cambridge, England, commented that “although I'm a statistician and like numbers, I have a lot of sympathy with a non-numerate approach. People get by very well with their gut feelings and their rules of thumb without putting things into numbers; people manage very well indeed! And they have done throughout history.”Dr. Spiegelhalter argues for clearer communications between high- and low-numeracy segments of the population. He favors reporting absolute risks and expressing them in whole numbers (eg, “2 out of 100 people”), which are more natural for most people to understand than ratios or percentages, even if those formats convey identical information.5Spiegelhalter D.J. Why risk is a risky business.New Scientist. 2009; 203 (Accessed October 14, 2011): 20-21http://www.newscientist.com/article/mg20327215.400-why-managing-risk-is-a-risky-business.htmlCrossref Google Scholar, 6Spiegelhalter D.J. Understanding uncertainty.Ann Fam Med. 2008; 6 (Accessed October 14, 2011): 196-197http://www.annfammed.org/cgi/content/full/6/3/196Crossref PubMed Scopus (37) Google Scholar Relative risks and benefits expressed in vague, numerator-free phrases such as “30% reduction in heart attacks,” he said, are “grossly manipulative. That information gives you no basis whatsoever to be able to choose whether the treatment is beneficial or not.” He was encouraged by the new guidelines of the Association of the British Pharmaceutical Industry, which state that relative risks must not appear without absolute risks.7Prescription Medicines Code of Practice AuthorityABPI Code of Practice for the Pharmaceutical Industry 2011. Association of the British Pharmaceutical Industry, London, England2011http://www.pmcpa.org.uk/files/sitecontent/ABPI_Code_2011.pdfGoogle ScholarAnother highly numerate methodological scholar, David L. Schriger, MD, MPH, professor of emergency medicine at UCLA, observed that certain discoveries have struck him and his colleagues as so intuitively advantageous that randomized trials, though valuable for confirmation and communication, would be nearly unnecessary as a determinant of daily practice. Propofol for procedural sedation, for example, earned such popularity after its appearance a little more than a decade ago, Dr. Schriger reported, that between direct experience and word-of-mouth peer communications, he and his colleagues found themselves asking “Why would I use anything else?”Methods Clash Inside Octopus Paul's TankLaypersons and clinicians alike look to specialists in the application of Fisher's exact test, Student's t test, Bayesian analyses of prior and posterior probability, and other related procedures for guidance in evaluating hypotheses; they are an aspect of research that many readers are happy to delegate. This can lead to curious inferences, or worse, when the statistical procedures themselves are (pardon the pun) at odds.Dr. Spiegelhalter pointed out that classic frequentist statistics apply probability theory to data without accounting for what he termed “epistemic uncertainty … uncertainty about the underlying state of the world,” whereas Bayesian statistics include prior probabilities in the calculations. “The crucial thing is, it's not what this study says on its own; it's what it adds to the currently available evidence … . Evidence does not exist in a vacuum. The value of evidence is how it changes your opinion, and Bayesian methods quantify that.”The methods have different applications and can yield dramatically disparate conclusions. “If you've got a strong prior probability against something, and you read a study,” Dr. Spiegelhalter continued, “it's not going to change your mind very much. [As] a classic example of the danger of frequentist statistical methods, the one I always use is Paul the ‘psychic octopus,' who was predicting the World Cup.8Batty D. Paul the “psychic” octopus wins again in World Cup final.Guardian. July 11, 2010; (Accessed October 10, 2011)http://www.guardian.co.uk/football/2010/jul/12/paul-psychic-octopus-wins-world-cupGoogle Scholar He got 8 right in a row! Well, I'm sorry, but my prior probability is so low that an octopus can predict football results—it's so close to zero, if not zero—that it hasn't been shifted at all.”Frequentist calculations, as Dr. Spiegelhalter observed, would confer the much-sought-after P value below .05, a formal statistical significance all too easily equated with causal significance, on Paul: “The chance of getting 8 right in a row is 1 over 256. That's P less than .05 in a 2-sided test.” Bayesian procedures, incorporating contextual probabilities and measuring how new data alter them to yield posterior probabilities, are more resistant to chance conclusions such as an inference of precognition on the part of a cephalopod. “From a Bayesian perspective,” he said, “I haven't shifted one little bit. It's all bollocks!”Different Histories, Different ApplicationsThe processes by which findings reach the lay public, Dr. Spiegelhalter said, justify a highly skeptical stance, at least initially. “The most crucial thing when reading a medical study is to think, ‘Why am I reading this? Why has this come to my attention?’ It's gone through this huge numbers of filters. They decided to do the study; they did the study; they decided to actually write it up; a journal decided to publish it; their institution decided to do a press release; the journalist decided to pick up the press release; the editor decided to stick it into a newspaper. It's gone through all those processes. By the time it gets through all that, I would almost guarantee it's of no value whatsoever. I take a general principle that the very fact that I'm hearing about a story on the radio is a reason to ignore the story on the radio.”“Statistics are not facts about the world,” he continued. “They're not God-given facts. Someone has decided what to measure, how to measure it, what to report; all these filters have happened. They're socially constructed things.” He was careful to dissociate this sense of social constructionism from a cruder meaning of the term: “I'm not going to take a full relativist position and say that means they're just someone's opinion—no, not at all. They're of great value. However, it does have to be taken into account: statistics need deconstructing. You need to be able to take them apart, and that's tricky, particularly if you're not given all the information.”Given that forms of statistical information are social constructions, attention to their history seems appropriate. As Dr. Schriger observed, classic statistics arose from the need for industrial quality control: specifically, wartime needs for reliable ammunition.9Schriger D.L. Problems with current methods of data analysis and reporting, and suggestions for moving beyond incorrect ritual.Eur J Emerg Med. 2002; 9: 203-207Crossref PubMed Scopus (18) Google Scholar Under a daily decision requirement—Does each batch of bullets contain too many duds? Should it be shipped to the front or melted down?—random sampling of batches and testing for quality proved a good match for statistical methods involving hypothesis comparisons, with Fisherian P values as endpoints. Given an assumption that a proportion of x bad bullets would be acceptable, frequentist procedures based on the Neyman-Pearson lemma10Neyman J. Pearson E. On the problem of the most efficient tests of statistical hypotheses.Phil Trans R Soc Lond A. 1933; 231 (Accessed October 14, 2011): 289-337https://doi.org/10.1098/rsta.1933.0009http://www.jstor.org/pss/91247Crossref Google Scholar can generate a range of comparable probabilities for the hypothesis that factory testers would find y bad bullets by chance alone. This method, Dr. Schriger commented, reflects mechanistic behavior-rule assumptions and performs admirably in gate-keeping contexts: binary A/B tests of a hypothesis against the null. It is less useful in clinical situations, in which simple binary choices are rare, unique multivariate conditions are inescapable, and decisions must consider actual observations (posterior probabilities, in Bayesian language), not conjectures and nulls.“The null hypothesis that there's absolutely no difference whatsoever between 2 treatments is nonsensical,” Dr. Spiegelhalter said—not only unrealistic but also easily manipulated in trials in which an interested party chooses a control group. “It does seem strange to posit the whole of medical research on hypotheses that you know are wrong. Two treatments are not exactly the same to the 15th decimal place … . I can't think of any other area in life where you condition your activity on something you know to be false. It's quite bizarre, really, but it's a useful fiction.”“I think Bayesian statistics have considerable advantage, more because they tend to reflect the way people actually treat information in practice,” Dr. Wears commented. “People in their normal working act as if they were Bayesians.” Epidemiologist Sander Greenland, DrPh, of the UCLA School of Public Health, elaborated on this distinction. “The way people misinterpret statistics is they misinterpret the conditional probabilities,” he said. “They take the conditioning in the wrong direction, like with P values and statistical significance: that's a probability of data given a hypothesis, and people turn it around into a probability of a hypothesis given data, which is very tricky, and they get it wrong all the time.”“Frequentist statistics are always misinterpreted as if they're Bayesian,” Dr. Greenland continued, “and the only cure available for that, if there is any, is to teach Bayesian statistics alongside of frequentist statistics from day 1 … . A Bayesian statistic is what people naturally tend to want, and be asking for, and interpreting things as. And then [researchers] give them these frequentist statistics, which are backwards; they're conditional probabilities in a reverse direction, and nobody gets it right. There are ways of Bayesianly interpreting correctly the frequentist statistics, but they're almost never taught, and so people are left to their own devices, and when they are, they inevitably misinterpret them. They give them the wrong Bayesian interpretation.”More Than a Specialists' DebateMathematically naive observers of Paul the mediagenic mollusk aren't the only ones being suckered. Misuse of frequentist methods underlies a great deal of data mining, to mention just 1 category of bias. The call for wider application of Bayesian methods, and for study designs that specifically and purposefully incorporate them, has gone out for years among emergency physicians11Wears R.L. Reaching first Bayes.Ann Emerg Med. 2004; 43 (Accessed October 10, 2011): 447-448https://doi.org/10.1098/rsta.1933.0009http://www.annemergmed.com/article/S0196-0644%2803%2901204-6/fulltextAbstract Full Text Full Text PDF PubMed Scopus (1) Google Scholar and others; the proliferation of computing power capable of these more complex calculations removes a long-standing barrier against their use. Yet even within the more context-sensitive Bayesian understanding of clinical reality, it is possible to support claims that unsettle what the majority of clinicians regard as common sense.Dr. Ioannidis, in a 2005 Public Library of Science (PLoS) Medicine article12Ioannidis J.P.A. Why most published research findings are false.PLoS Med. 2005; 2 (Accessed October 8, 2011): e124https://doi.org/10.1371/journal.pmed.0020124http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0020124Crossref PubMed Scopus (5755) Google Scholar and extensive research published since then, threw down a memorable methodological challenge to the clinical research community. According to his assessment using Bayesian principles (though never naming Bayes directly in the PLoS article), “most published research findings are false.” The positive predictive value of the majority of studies that use the common P<.05 criterion for significance, Dr. Ioannidis argued, is unreliably low, even in the criterion standard model of randomized controlled trials. A positive predictive value above 50%, he calculated, is extremely hard to achieve in most study designs.Bias of various forms (post hoc subgroup analyses, inclusion or exclusion of subjects, selective reporting, and other vices) exacerbates the problem. In a series of corollaries ranging from intuitive to provocative, Dr. Ioannidis also posited that numerous conditions render findings less likely to be true: smaller sample sizes in a field, smaller effect sizes, greater numbers and lesser selection of tested relationships, greater “flexibility in designs, definitions, outcomes, and analytical modes,” greater financial interests and prejudices, and, perhaps counterintuitively, the “hotness” of a field as measured by larger numbers of teams involved. Although suggesting that larger studies with more statistical power (and, if carefully designed, large meta-analyses) have better chances of testing hypotheses meaningfully, Dr. Ioannidis charged that researchers' claims often reflect “prevailing bias” in a field rather than any hypotheses actually corresponding with the truth.Dr. Greenland and Steven Goodman, MD, PhD, MHS, associate dean of clinical and translational research and professor of general internal medicine at Stanford and methodology editor of the Annals of Internal Medicine, responded to the argument by Dr. Ioannidis with what might be called vigorous ambivalence. They acknowledge its value in highlighting prevalent forms of bias and fostering more caution toward claims (particularly those based on P values), but they severely critique its assumptions and methods.13Goodman S. Greenland S. Why most published research findings are false: problems in the analysis.PLoS Med. 2007; 4 (Accessed October 8, 2011): e168https://doi.org/10.1371/journal.pmed.0040168http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0040168Crossref PubMed Scopus (52) Google Scholar, 14Goodman S. Greenland S. Assessing the unreliability of the medical literature: a response to ”Why Most Published Research Findings Are False” (February 2007) Johns Hopkins University, Dept of Biostatistics Working Papers. Working Paper 135.http://www.bepress.com/jhubiostat/paper135Google Scholar A great deal of research is unreliable, Dr. Goodman and Dr. Greenland agreed, but the conclusion invalidating the majority of it is unwarranted.“The study wasn't wrong,” Dr. Greenland commented; “it just has severe limitations. He's using a very sensational way of expressing something that all of us who are up on things in the field would agree with as a general rubric.” He credited Dr. Ioannidis with bringing sobriety to the interpretation of overheated new claims but believed that he missed half the story. “Even though what he said in general might be reasonably taken and part correct—half of the truth, [in that there are] a lot of false positives,” he commented, “he doesn't talk about false negatives. He basically cooked the math results there by setting up his model so that he had to get the results that he reported.” Steps guaranteeing an overly skeptical result, according to the Goodman-Greenland analysis, included assuming a prior probability below 50% for most medical hypotheses, dichotomizing statistical significance by using the P<.05 cutoff point rather than the broader range of small P values actually reported in many studies, and using Bayes's theorem to show that weak evidence cannot shift a hypothesis with a low prior probability to a posterior probability above 50%.The problem with both Dr. Ioannidis's calculations and other well-publicized attacks on epidemiologic research, Dr. Greenland continued, is a map/territory confusion between mathematical models and reality in all its messy, poorly understood complexity. Dr. Greenland has had particularly lively debates with S. Stanley Young, PhD, assistant director of the National Institute of Statistical Sciences in Research Triangle Park, NC, whom he described as “a rabid antiepidemiology person [who] bases all his criticisms on a purely chance model where everything's done randomly.” Claims that attribute observed epidemiologic associations dismissively to randomness or artifact, Dr. Greenland said, ignore the body of work that has brought those associations repeatedly to attention—fires that may exist and merit scrutiny even if their smoke hasn't quite produced an unambiguously significant correlation coefficient.“To say ‘oversimplified’—it'd be like saying that some kind of Mr. Machine robot with an electric motor accurately models the human anatomy and physiology … . To say it's a grotesque oversimplification doesn't begin to capture it. It's a toy.” Attributing the insight to his Berkeley mentor David A. Freedman, PhD, a sophisticated critic of statistical models in multiple fields, Dr. Greenland said, “You can learn a lot from a model; you can learn a lot about balance, dynamics, what must be going on in a human body from trying to build a robot that can walk. But just because you've built a robot that can walk doesn't mean you've figured out how the human balance system works.”More detailed and useful assessments of biasing factors in research, Dr. Greenland and Dr. Schriger both indicated, would more closely resemble the “episcope” of Harvard epidemiologists Malcolm Maclure and Sebastian Schneeweiss.15Maclure M. Schneeweiss S. Causation of bias: the episcope.Epidemiology. 2001; 12: 114-122Crossref PubMed Scopus (95) Google Scholar This model posits 11 specific “filters” between a real object of investigation (an association, if it exists, between a causal agent and morbidity) and the ultimate use of information about that object (by decisionmakers aided by “knowledge brokers”: Cochrane Collaboration meta-analysts, guideline committees, or other experts). Schematizing the pathways through which information travels (including possible interactions and confounding effects among those paths), the episcope then points toward ways of making each form of bias measurable. Not all biases act like confounding variables, and this model does not offer easy answers about how to account for them, but it is decidedly more practical than either qualitative acknowledgements that bias is present (which the authors link to “pessimists who believe epidemiologic evidence is hopelessly biased,” assuming that every bias present is quantitatively important) or reductive mathematical procedures too sensitive to chance to accommodate nuance.If Dr. Ioannidis and others close to his camp were skeptical about excessive certainty, Dr. Greenland was skeptical about excessive skepticism. “It depends on what one means by skepticism,” he observed. “They're skeptical, but only in 1 direction. You can't trust some safety result, but we can't trust some result which says something's unsafe; all the bias problems apply just as much in the other direction.” Dr. Greenland was especially cautious about research critics who “claim to be countering sensationalism” because industry influence ensures that “there's far more money behind underplaying risks than overplaying. In the end, skeptical of what? Skeptical of claims that something is safe and effective? I'm all for that—if you're going to be skeptical of claims that something is unsafe and ineffective. You need some healthy skepticism of everything in all directions, and in the end you just have to take risks and make your bet with the best evidence that you have.”Tides of Opinion and the File-Drawer Effect“I tend to agree with Greenland in this,” said Dr. Wears. “I think false positives are a problem, but I think false negatives are a more common problem … . [W]hen you get a negative, inquiry tends to stop.” Continuing investigation, assuming replication studies are performed and published, will tend to expose false-positive claims by failing to confirm them, he observed, but when false-negative conclusions put paid to an area of research, “you've got to wait for a generation to go away until people don't know that that's been disproved, and they start to look at it again.”Research history is littered with areas in which inferences from faulty methodology (statistical or otherwise) have gone unexamined for years, affecting practice patterns. Dr. Wears cited various views on the utility of presurgical WBC counts in cases of suspected appendicitis: “There was a time when surgeons were very enamored of the white count … [and] then some papers came out that showed on the average, the white count in people with appendicitis was not really that much different from the white count of other people with abdominal pain. And so then the party line became ‘the white count is worthless; don't tell me about the white count.'” Later analysis, he observed, indicated that the count is relevant after all: “Even though the average is the same, it turns out people with extreme white counts have extreme problems, whether they're extremely high or extremely low, and people with moderate ones have moderate problems.“We saw the same thing with central venous pressure: when I was an intern, the CVP [central venous pressure] was considered an important thing to follow; then it became kind of a silly thing to follow. Then right heart catheterization was the ultimate thing to follow, and that turned out that didn't work very well, and now the CVP is back again in the sepsis bundle. Some of that is noise in the data; some of it is this premature closure, the horserace effect that's been disproved, as opposed to looking further and saying, ‘How exactly does it not work? Under what circumstances does it work [or] not work?'”Chalmers offers 2 further examples with relevance to emergency care: the use of human albumin solution in resuscitation of patients with severe burns, popular for roughly half a century after World War II but based on only a handful of initial observations, and the misuse of class I antiarrhythmics in cardiac patients in the 1970s and 1980s. The Saline versus Albumin Fluid Evaluation (SAFE)16SAFE Study InvestigatorsA comparison of albumin and saline for fluid resuscitation in the intensive care unit.N Engl J Med. 2004; 350 (Accessed October 10, 2011): 2247-2256http://www.nejm.org/doi/full/10.1056/NEJMoa040232#TopCrossref PubMed Scopus (2048) Google Scholar study by intensivists in Australia and New Zealand, Chalmers commented, was “a large enough trial to be convincing that there was no evidence of advantage of the £20 more expensive human albumin solution, as compared to normal saline.” (Curiously, a more recent study revealed as not simply faulty but also fraudulent17Berger E. Journal editors retract scores of articles on colloid use: author investigated for fabricating data.Ann Emerg Med. 2011; 58 (Accessed October 13, 2011): A17-A19http://www.annemergmed.com/article/S0196-0644%2811%2901416-8/fulltextAbstract Full Text Full Text PDF PubMed Scopus (1) Google Scholar purported to overturn the SAFE-supported consensus that colloids offer no appreciable mortality benefit over crystalloids.)In the case of antiarrhythmics, Chalmers recalled, clinical observation overturned a deadly though biologically plausible theory but took years to do so. The review of trials in the early 1980s by Curt Furberg, MD, PhD, professor of public health sciences at Wake Forest School of Medicine, showed that suppression of arrhythmias conferred no survival benefit in cardiac arrest. A single 1993 report showed not only a lack of benefit but also a lethal effect, though the design of the study, actually performed in 1980, had not initially evaluated effects on survival.18Cowley A.J. Skene A. Stainer K. et al.The effect of lorcainide on arrhythmias and survival in patients with acute myocardial infarction: an example of publication bias.Int J Cardiol. 1993; 40 (Accessed October 14, 2011): 161-166http://www.sciencedirect.com/science/article/pii/016752739390279PAbstract Full Text PDF PubMed Scopus (58) Google Scholar

Referência(s)