Can Big Data Fulfill Its Promise?
2016; Lippincott Williams & Wilkins; Volume: 9; Issue: 6 Linguagem: Inglês
10.1161/circoutcomes.116.003097
ISSN1941-7705
AutoresPeter W. Groeneveld, John S. Rumsfeld,
Tópico(s)Chronic Disease Management Strategies
ResumoHomeCirculation: Cardiovascular Quality and OutcomesVol. 9, No. 6Can Big Data Fulfill Its Promise? Free AccessResearch ArticlePDF/EPUBAboutView PDFView EPUBSections ToolsAdd to favoritesDownload citationsTrack citationsPermissions ShareShare onFacebookTwitterLinked InMendeleyReddit Jump toFree AccessResearch ArticlePDF/EPUBCan Big Data Fulfill Its Promise? Peter W. Groeneveld, MD, MS and John S. Rumsfeld, MD, PhD Peter W. GroeneveldPeter W. Groeneveld From the Department of Veterans Affairs' Center for Health Equity Research and Promotion, Michael J. Crescenz Veterans Affairs Medical Center, Philadelphia, PA (P.W.G.); Division of General Internal Medicine, Department of Medicine, Perelman School of Medicine, Leonard Davis Institute of Health Economics, and Center for Cardiovascular Outcomes, Quality, and Evaluative Research, University of Pennsylvania, Philadelphia (P.W.G.); University of Colorado School of Medicine, Aurora, CO (J.S.R.); Veterans Affairs Eastern Colorado Health System, Denver (J.S.R.); and American College of Cardiology, Washington, DC (J.S.R.). and John S. RumsfeldJohn S. Rumsfeld From the Department of Veterans Affairs' Center for Health Equity Research and Promotion, Michael J. Crescenz Veterans Affairs Medical Center, Philadelphia, PA (P.W.G.); Division of General Internal Medicine, Department of Medicine, Perelman School of Medicine, Leonard Davis Institute of Health Economics, and Center for Cardiovascular Outcomes, Quality, and Evaluative Research, University of Pennsylvania, Philadelphia (P.W.G.); University of Colorado School of Medicine, Aurora, CO (J.S.R.); Veterans Affairs Eastern Colorado Health System, Denver (J.S.R.); and American College of Cardiology, Washington, DC (J.S.R.). Originally published1 Nov 2016https://doi.org/10.1161/CIRCOUTCOMES.116.003097Circulation: Cardiovascular Quality and Outcomes. 2016;9:679–682Other version(s) of this articleYou are viewing the most recent version of this article. Previous versions: January 1, 2016: Previous Version 1 IntroductionBig data analytics are widely touted as a key innovation to improve health care. Although the term big data is variably defined, it generally implies the application of advanced statistical analyses, with names such as machine learning, artificial intelligence, or cognitive computing, to data sources that greatly exceed the size and complexity of databases traditionally used for healthcare analyses.1 Somewhat counterintuitively, data volume alone does not qualify an analysis as big data analytics. The term more appropriately refers to analytic and computational techniques, combined with information technology innovations, that were specifically developed to yield insights from the large quantities of data that are increasingly common in the digital economy. Big data analytics offer the promise of turning large amounts of data into superior predictive models that can be used to improve healthcare quality and outcomes.Since its inception, Circulation: Cardiovascular Quality and Outcomes has focused on using scientifically rigorous data analyses to improve health care,2 thus the journal is a natural home for scientific investigations using big data analytic techniques. The current Special Issue highlights several high-quality studies that reveal the depth and breadth of these new methods and points toward a future where big data analytics are incorporated into the daily practice of all clinicians, much as big data analytics routinely impact the everyday online experiences of millions of users of Google, Amazon, and other internet-based companies.Although the term big data is a relatively new invention,1 many of the techniques of big data analytics have been in existence for decades, and the field of biomedical informatics has been critical to their development.3 Several recent phenomena have converged to move big data analytics to the forefront of health care, including the widespread adoption of electronic medical records and subsequent digitization of large volumes of healthcare data, advancements in computing and information technology capabilities, and a cadre of data scientists who are migrating to the healthcare sector and who bring know-how and experience gained in other data-intensive industries. However, it is not yet clear how big data should change healthcare delivery. Big data methods do not solve all data and analytic problems, and they also present unique challenges to the process of evaluating the methodological rigor and validity of data analysis and results. Healthcare interventions based on big data analytics will need thorough evaluation to prove their effectiveness because new methods of analyzing data do not address the real-world challenges of implementing new healthcare practices and improving quality.Advantages of Big Data AnalyticsThe potential for big data analytics to transform healthcare is tantalizing, including the development of clinical prediction models that outperform the standard paradigm where a relatively small (eg, <100) number of variables are preselected as potential predictors and an even smaller number of predictive variables (eg, <10) are retained in a final prediction model. Cardiovascular medicine has been a leader in the clinical use of models developed using these techniques. Examples include the criteria proposed by Goldman et al4 and Detsky et al5 for preoperative cardiac risk assessment, the thrombolysis in myocardial infarction risk score for prognosis in acute coronary syndrome,6 and the CHA2DS2-VASc score for stroke prediction in atrial fibrillation.7 Conversely, big data analytics can potentially use thousands of variables, with tens of thousands of permutations, to produce dynamic predictive models that are continuously informed by newly collected information, as opposed to the static models generated by standard multivariable regression methods. Big data analytics are also adaptive to temporal and geographic variation in the data streams that are being analyzed, thus predictions are tailored to the time and place that they are generated.The advantages of big data analytics are principally based on the following 2 premises:Premise 1: It is not possible to prespecify all possible causal and associational pathways in a data environment with thousands (or hundreds of thousands) of variables. In fact, efforts at prespecification inherently reduce the predictive ability of the data by artificially constraining the choice set of predictors to a limited subset of possible inputs.8 By instead allowing computational algorithms to test a large number of potential associations, the data are empowered to speak for itself, and patterns of association may be identified that could not have been predicted in advance, even by the most knowledgeable investigator(s).Premise 2: To maximize operational usefulness, clinical prediction models must adapt to changes in the data over both space and time.9 Logistic regression models (eg, CHA2DS2-VASc) produce the same prediction of atrial fibrillation stroke risk, given the same values on the model's 7 individual input variables, whether a patient is a low-income resident of West Philadelphia receiving care in a public health clinic or a wealthy resident of Atherton, California, receiving concierge care from her private physician. The model's prediction that these 2 hypothetical patients have identical stroke risk is almost certainly wrong because stroke risk is undoubtedly determined by many more factors than the 7 CHA2DS2-VASc inputs. In contrast, big data analytics that can learn on the job and adjust predictions based on variables unique to the time and place at which the data are generated may have a significant advantage in predictive accuracy. Furthermore, as risk factors and their relationship to adverse outcomes inevitably change over time, big data analytic models can naturally adapt, whereas static prediction models derived from a single derivation data set cannot.Blackjack ConundrumAlthough big data analytics have undeniable potential advantages, advocates frequently overlook some of the basic observational data analysis problems that are not solved by better algorithms, more computational power, and petabytes of data.1 The game of blackjack provides a simple example of the 2 fundamental problems with any exercise in predictive analysis, regardless of data set size or computational power or methods—namely, the problems of unobservability and randomness. Beating the house with 95% (or even 55%) frequency in blackjack would require both accurate knowledge of the identity of face-down cards and knowledge of the cards that are to be dealt next from the deck. But because face-down cards are (obviously) unobservable, and the cards coming out of the deck (in the absence of card-counting techniques) are essentially random, no amount of data, analytic complexity, or machine learning can increase a good blackjack player's long-term win (ie, predictive) percentage above ≈40%.10The same 2 issues of unobservability and randomness likewise bedevil prediction in health care. Not infrequently, important predictors of health outcomes may be unobservable—defined as neither measured directly nor measured by some combination of proxy measures, analogous to the face-down card in blackjack. It is tempting to assume that in the comprehensive electronic medical record era, all important predictors of health outcomes are captured in the data, but this is simply not true. Second, the impact of random events in determining many health outcomes cannot necessarily be ignored and in some instances is considerable. The occurrence of most adverse medical events is likely influenced in part by a multitude of essentially random processes that defy prediction, even with a superabundance of data. Thus, we should not be surprised when big data analytics fail to provide predictive information that greatly surpasses standard statistical methods in predictive accuracy.3,11Even in the age of the electronic health record, data quality remains a challenge to precise prediction at the point of care. Missing and inaccurate data, particularly on patient-centered outcomes, imperil the ability of any predictive method to accurately classify patients. Although machines can be programmed to learn from their previous mistakes, it is important to recognize that big data cannot fix bad data.Is "Data Science" Science?In addition to fundamental data problems that remain unsolved by big data analytics, the complexity and nonstatic nature of these methods are a challenge for traditional scientific reporting in journals. The fundamental idea behind scientific publication is the axiom of reproducibility, that is, a scientific team reports their methods and results in a manner that would permit another scientific team to reproduce the results of the study. By its nature, big data analytics defies this axiom by (1) the analytic process's continually adapting to changes in the data, thus there is no model to present in a scientific report, (2) the use of complex computational algorithms that are difficult to describe to the uninitiated in the constrained space of a typical scientific methods section, and (3) an ever-expanding lexicon of big data analytic techniques (ridge regression, the lasso, generalized additive models, random forests, boosting, support vector machines, k-means clustering, etc) that are unfamiliar to a large fraction of most medical journals' readership, even to those that are familiar with the basic tools of statistical analysis. In these ways, data science is more akin to engineering—in which the focus is typically on solving a set of unique design problems—rather than to biomedical sciences where the focus is on understanding the workings of the natural world.This difference poses a unique challenge to journal editors, reviewers (including statistical experts), and readers. How can the methodological quality of big data analytics be externally evaluated? Some of the most rigorous published scientific reports using big data analytics have provided lengthy appendices with careful documentation and voluminous code (R, Python, MATLAB, etc), but given the exponentially expanding syntax and variety of programming languages and techniques, even the most savvy reviewers will not be fluent in all languages. Nor are editors, reviewers, and readers likely to have the time necessary to deconstruct thousands of lines of code and assess its appropriateness for the task. The alternative to this level of review is even less appealing—namely, analytic methods written as a terse stream of statistical jargon that is incomprehensible to most readers and that reveals little of the inner workings of the modeling process, that is, the metaphoric black box. To ensure the integrity of the scientific peer review process of big data analytics, novel methods of evaluation will need to be developed.Concerns about the transparency of big data analytics are not confined to journal editors. In an era where healthcare quality report cards are typically reported as ratios of observed to predicted adverse events, it is important for healthcare providers, policymakers, and payers to clearly understand the methods that produce the predicted events denominators. Big data methods may increase the accuracy of the predicted number of adverse events per surgeon or hospital, but if the methods are buried in machine-learning algorithms, this will not enhance stakeholder trust in the accuracy and fairness of quality ratings.Beyond transparency, while big data analytics are incorporated into clinical practice, it will be vital to compare the clinical outcomes resulting from their use versus outcomes from previously developed clinical decision support tools. Although published comparisons of various predictive techniques should appropriately use the statistical methods (C statistics, etc) developed to quantify the performance of competing classification schemes, it is also important to recognize that numeric improvement in prediction performance may not necessarily translate into better health outcomes for patients. Thus, beyond demonstrating better C statistics, evidence that big data predictive models actually improve health outcomes when applied in real-world clinical settings, while not producing unintended consequences that harm patients, is essential. The stakes in healthcare decision making are arguably higher than the use of big data analytics applied in many other fields. The promise of big data analytics is not a valid excuse to bypass the requirement that innovations must be scientifically proven to be effective in practice.In this regard, 2 additional issues are worthwhile to highlight as big data analytics emerge in health care: causal inference and quality improvement. Both are critical to the operational, effective use of data to improve health outcomes, yet neither are necessarily enhanced by new analytic methods.In Healthcare, Causal Inference MattersAs with any observational data analysis, associations between predictors and outcomes in big data analytic models cannot be determined to be causal or noncausal. The risk of identifying spurious associations (classic examples include sunspot activity correlated with the stock market12 and coffee consumption associated with pancreatic cancer13) is higher when the number of associations being tested is greater, and when there is no previous knowledge of causal pathway plausibility guiding the selection of model predictors. This is important because in many (most) instances, clinicians, administrators, and policymakers want to know not only who the patients in their clinics/hospitals/health systems are that will have an adverse clinical outcome but also why the adverse outcome occurred—that is, the identity of causal variables and whether those variables are modifiable. So although it is important to predict, for example, which intensive care unit patients are highly likely to develop bloodstream infections because of central intravenous catheters, it is also essential to identify key modifiable causal factors that could reduce the future incidence of bloodstream infections.Quality Improvement Is More Than AnalyticsThe complexity of healthcare delivery creates barriers to the appropriate use of information. There is no assurance that the availability of new, accurate predictive information about patients will transform the process of health care so that adverse events are avoided and the probability of good outcomes is maximized. Big data analytics has the potential to generate useful predictive information that could be delivered to clinicians in a timely fashion (via electronic medical record pop-ups, text messages, e-mail, etc), but that information may be overlooked by clinicians bombarded with too much information or ignored because the appropriate action in response to the information is neither obvious nor simple.14 Thus, better predictive information is necessary but not sufficient for the optimization of healthcare outcomes, and lessons from the relatively new field of implementation science will be essential components of leveraging big data analytics.In summary, the rise of big data analytics in healthcare settings presents an exciting opportunity to leverage the power of increasingly voluminous healthcare data in ways that were simply impossible as recently as 10 years ago. However, it is critical to recognize that the fundamental pitfalls of observational data analysis cannot be ignored, and in fact, the risks of such pitfalls demand rigorous scientific testing and novel methods for peer review of big data analytic models. Neither enthusiasm for the potential of big data analytics to transform care, nor the complexity of big data methods, should obviate the need for rigorous scientific evaluation. Big data analytics are fundamentally a tool that will require, like any other medical intervention, clinical integration with quality improvement activities to have a meaningful, positive impact on health and health care. The field of cardiovascular outcomes research should embrace these new techniques, rigorously assess the scientific studies using these methods, and guide policymakers, clinicians, and patients on their meaningful implementation.DisclosuresDr Rumsfeld is Chief Innovation Officer for the American College of Cardiology. The other author reports no conflicts.FootnotesThe opinions expressed in this article are the authors' own and do not represent the official views of the US Department of Veterans Affairs or the American College of Cardiology.Correspondence to Peter W. Groeneveld, MD, MS, Perelman School of Medicine, University of Pennsylvania, 1201 Blockley Hall, 423 Service Dr, Philadelphia, PA 19104. E-mail [email protected]References1. Mayer-Schoönberger V, Cukier K. Big Data: A Revolution That Will Transform How We Live, Work, and Think. Boston: Houghton Mifflin Harcourt; 2013.Google Scholar2. Krumholz HM. Outcomes research: myths and realities.Circ Cardiovasc Qual Outcomes. 2009; 2:1–3. doi: 10.1161/CIRCOUTCOMES.108.844035.LinkGoogle Scholar3. Lippmann RP, Shahian DM. Coronary artery bypass risk prediction using neural networks.Ann Thorac Surg. 1997; 63:1635–1643.CrossrefMedlineGoogle Scholar4. Goldman L, Caldera DL, Nussbaum SR, Southwick FS, Krogstad D, Murray B, Burke DS, O'Malley TA, Goroll AH, Caplan CH, Nolan J, Carabello B, Slater EE. Multifactorial index of cardiac risk in noncardiac surgical procedures.N Engl J Med. 1977; 297:845–850. doi: 10.1056/NEJM197710202971601.CrossrefMedlineGoogle Scholar5. Detsky AS, Abrams HB, Forbath N, Scott JG, Hilliard JR. Cardiac assessment for patients undergoing noncardiac surgery. A multifactorial clinical risk index.Arch Intern Med. 1986; 146:2131–2134.CrossrefMedlineGoogle Scholar6. Antman EM, Cohen M, Bernink PJ, McCabe CH, Horacek T, Papuchis G, Mautner B, Corbalan R, Radley D, Braunwald E. The TIMI risk score for unstable angina/non-ST elevation MI: a method for prognostication and therapeutic decision making.JAMA. 2000; 284:835–842.CrossrefMedlineGoogle Scholar7. Lip GY, Nieuwlaat R, Pisters R, Lane DA, Crijns HJ. Refining clinical risk stratification for predicting stroke and thromboembolism in atrial fibrillation using a novel risk factor-based approach: the euro heart survey on atrial fibrillation.Chest. 2010; 137:263–272. doi: 10.1378/chest.09-1584.CrossrefMedlineGoogle Scholar8. Anderson C. The end of theory: the data deluge makes the scientific method obsolete.http://www.wired.com/2008/06/pb-theory/. Accessed June 1, 2016.Google Scholar9. Kanjilal PP; Institution of Electrical Engineers. Adaptive Prediction and Predictive Control. Hertfordshire, UK: P. Peregrinus on behalf of Institution of Electrical Engineers; 1995.CrossrefGoogle Scholar10. Baldwin R.R. CWE, Maisel H., McDermott J.P. The optimum strategy in blackjack.J Am Stat Assoc. 1956; 51:429–439.CrossrefGoogle Scholar11. Rumsfeld JS, Joynt KE, Maddox TM. Big data analytics to improve cardiovascular care: promise and challenges.Nat Rev Cardiol. 2016; 13:350–359. doi: 10.1038/nrcardio.2016.42.CrossrefMedlineGoogle Scholar12. Collins CJ. An inquiry into the effect of sunspot activity on the stock market.Finance Analyst J. 1965; 21:45–56.CrossrefGoogle Scholar13. MacMahon B, Yen S, Trichopoulos D, Warren K, Nardi G. Coffee and cancer of the pancreas.N Engl J Med. 1981; 304:630–633. doi: 10.1056/NEJM198103123041102.CrossrefMedlineGoogle Scholar14. Wachter RM. The Digital Doctor: Hope, Hype, and Harm at the Dawn of Medicine's Computer Age. New York: McGraw-Hill Education; 2015.Google Scholar Previous Back to top Next FiguresReferencesRelatedDetailsCited BySabbah M, Fayyaz A, de Denus S, Felker G, Borlaug B, Dasari S, Carter R and Redfield M (2020) Obese-Inflammatory Phenotypes in Heart Failure With Preserved Ejection Fraction, Circulation: Heart Failure, 13:8, Online publication date: 1-Aug-2020. Lopez-Jimenez F, Attia Z, Arruda-Olson A, Carter R, Chareonthaitawee P, Jouni H, Kapa S, Lerman A, Luong C, Medina-Inojosa J, Noseworthy P, Pellikka P, Redfield M, Roger V, Sandhu G, Senecal C and Friedman P (2020) Artificial Intelligence in Cardiology: Present and Future, Mayo Clinic Proceedings, 10.1016/j.mayocp.2020.01.038, 95:5, (1015-1039), Online publication date: 1-May-2020. Iannattone P, Zhao X, VanHouten J, Garg A and Huynh T (2020) Artificial Intelligence for Diagnosis of Acute Coronary Syndromes: A Meta-analysis of Machine Learning Approaches, Canadian Journal of Cardiology, 10.1016/j.cjca.2019.09.013, 36:4, (577-583), Online publication date: 1-Apr-2020. Ross E and Mell M (2020) Evaluation of regional variations in length of stay after elective, uncomplicated carotid endarterectomy in North America, Journal of Vascular Surgery, 10.1016/j.jvs.2019.02.071, 71:2, (536-544.e7), Online publication date: 1-Feb-2020. Winhusen T, Theobald J, Kaelber D and Lewis D (2019) Medical complications associated with substance use disorders in patients with type 2 diabetes and hypertension: electronic health record findings, Addiction, 10.1111/add.14607, 114:8, (1462-1470), Online publication date: 1-Aug-2019. Georga E, Tachos N, Sakellarios A, Pelosi G, Rocchiccioli S, Parodi O, Michalis L and Fotiadis D (2019) A Multimodal Machine Learning Approach to Omics-Based Risk Stratification in Coronary Artery Disease World Congress on Medical Physics and Biomedical Engineering 2018, 10.1007/978-981-10-9023-3_158, (879-882), . Georga E, Tachos N, Sakellarios A, Kigka V, Exarchos T, Pelosi G, Parodi O, Michalis L and Fotiadis D (2019) Artificial Intelligence and Data Mining Methods for Cardiovascular Risk Prediction Cardiovascular Computing—Methodologies and Clinical Applications, 10.1007/978-981-10-5092-3_14, (279-301), . An S, Malhotra K, Dilley C, Han-Burgess E, Valdez J, Robertson J, Clark C, Westover M and Sun J (2018) Predicting drug-resistant epilepsy — A machine learning approach based on administrative claims data, Epilepsy & Behavior, 10.1016/j.yebeh.2018.10.013, 89, (118-125), Online publication date: 1-Dec-2018. Oksel C, Haider S, Fontanella S, Frainay C and Custovic A (2018) Classification of Pediatric Asthma: From Phenotype Discovery to Clinical Practice, Frontiers in Pediatrics, 10.3389/fped.2018.00258, 6 Nallamothu B (2017) Better—Not Just Bigger—Data Analytics, Circulation: Cardiovascular Quality and Outcomes, 10:7, Online publication date: 1-Jul-2017. Lourenço A and Leite-Moreira A (2017) Cardiovascular precision medicine: Bad news from the front?, Porto Biomedical Journal, 10.1016/j.pbj.2017.03.006, 2:4, (99-101), Online publication date: 1-Jul-2017. Krumholz H (2016) The Promise of Big Data, Circulation: Cardiovascular Quality and Outcomes, 9:6, (616-617), Online publication date: 1-Nov-2016. November 2016Vol 9, Issue 6 Advertisement Article InformationMetrics © 2016 American Heart Association, Inc.https://doi.org/10.1161/CIRCOUTCOMES.116.003097PMID: 28263942 Originally publishedNovember 1, 2016 Keywordsdatabaseanalysisalgorithmspatient-specific modelingmachine learningPDF download Advertisement SubjectsQuality and Outcomes
Referência(s)