We Don’t Need More Data, We Need the Right Data
2020; Lippincott Williams & Wilkins; Volume: 142; Issue: 3 Linguagem: Inglês
10.1161/circulationaha.120.045968
ISSN1524-4539
Autores Tópico(s)Meta-analysis and systematic reviews
ResumoHomeCirculationVol. 142, No. 3We Don't Need More Data, We Need the Right Data Free AccessArticle CommentaryPDF/EPUBAboutView PDFView EPUBSections ToolsAdd to favoritesDownload citationsTrack citationsPermissions ShareShare onFacebookTwitterLinked InMendeleyReddit Jump toFree AccessArticle CommentaryPDF/EPUBWe Don't Need More Data, We Need the Right Data Rashmee U. Shah, MD, MS Rashmee U. ShahRashmee U. Shah Rashmee Shah, MD, MS, 30 N 1900 E, Room 4A100, Salt Lake City, UT 84132. Email E-mail Address: [email protected] https://orcid.org/0000-0002-7823-8540 Division of Cardiovascular Medicine, University of Utah School of Medicine, Salt Lake City. Originally published20 Jul 2020https://doi.org/10.1161/CIRCULATIONAHA.120.045968Circulation. 2020;142:197–198In 2020, the world will generate 15 zettabytes of data.1 That is 15 trillion gigabytes—the storage capacity of 2 billion iPhones—all in 1 year.Technology companies like Google are partnering with health systems to create massive data sets; Cerner Corporation and Epic Systems are aggregating data for analyses. The hype surrounding these massive electronic medical record–based data sets is deafening. In reality, well-curated, smaller data sets would be more valuable for scientific discovery. Massive datasets are noisy; the data elements do not always represent clinical reality. If the goal is to find population-level trends, such as which patients will be costly for the health system, massive data sets will serve the purpose. But if the goal is to create data-driven algorithms that help individual patients—that is, deliver precision medicine—accurate data are critical.To illustrate the limitations of massive data sets, consider the following scenario: A 70-year-old woman being treated for lung cancer develops atrial fibrillation. She and her doctor need to make anticoagulation treatment decisions and choose a rate or rhythm control strategy. Randomized trials have yet to answer these questions for patients with cancer. Health system data could shed some light on decision making, but the following issues need consideration.First, we need the right denominator. Many of our data collection methods are intervention-based, thus the denominator includes patients who received the intervention, not all patients who could have received it. In our case, identifying patients who receive ablation or cardioversion is relatively easy because procedural billing codes tend to be accurate. But to compare different treatment strategies, we would have to find all patients with atrial fibrillation, and atrial fibrillation diagnosis billing codes can be inaccurate up to a third of the time.2 The same issue plagues other types of analyses. The Transcatheter Valve Therapeutics Registry, for example, captures all patients who undergo transcatheter aortic valve replacement in the United States. These data can identify transcatheter aortic valve replacement recipients at increased risk for procedural complications, but not whether patients with dialysis will benefit from transcatheter aortic valve replacement. To answer the latter question, the denominator must include all patients with aortic stenosis, which allows comparison of treatment versus no treatment. We need disease-based denominators, not intervention-based denominators.Second, we need the right exposure and outcome variables. To compare oral anticoagulation to no oral anticoagulation, we must know the treatment exposure. Electronic medical record orders are unreliable. Patients like the one in our scenario have often been seen at multiple institutions before referral. Linking patients across health systems would allow the receiving hospital—using both humans and algorithms—to review all relevant data and devise treatment strategies. Electronic medical record orders may also fail to capture whether a medication was stopped. A patient may have been prescribed an anticoagulant with a 90-day supply, but subsequently discontinued the medication, perhaps because of bleeding. The electronic medical record may falsely classify the patient as exposed through day 90. The same misclassification concerns apply to all patient characteristics, but exposure and outcome variables are particularly important.Third, we often need data that are absent from electronic sources. In our scenario, frailty is often an important consideration but is absent from administrative data sets. Predictive models for cardiogenic shock provide another example. The time between symptom onset and presentation is critical in prognosis but is generally absent from data collection. Social determinants of health, such as loneliness and isolation, fall into this trap as well.Each of these issues highlights another critical concern: In big data jargon, null does not equal negative. Without clear documentation, positive or negative (eg, a formally documented "frailty" indicator), an abstracted variable may appear as null, which risks being interpreted as negative or absent. Interpreting the absence of evidence as evidence of absence can have important downstream treatment implications.To create unbiased algorithms, we need unbiased data. Clinicians, just like everyone else, are prone to conscious and subconscious biases.3 Biases in documentation will result in biases in extracted data and in the results of a treatment algorithm informed by those data. When this error is scaled to massive data sets, the algorithm may learn and perpetuate our biases. For example, researchers have found that predictive policing worsened bias by sending police back to the same neighborhoods repeatedly.4 Each time the police made an arrest in a neighborhood, the algorithm "learned" that the neighborhood had more crime and sent police back, perpetuating the cycle.Big data and algorithms are undoubtedly changing the world. As algorithms grow in complexity and become more opaque, detecting and weeding out human bias will be increasingly difficult; the right data to feed the algorithm will be critical. Cleaning and verifying data is painstaking, time-consuming work that clinicians must lead, despite the current urge to churn out algorithms quickly. The resulting data sets may be smaller, but they will have more signal and less noise. If we want the right algorithms, we first need the right data.Sources of FundingDr Shah is supported by a grant from the National Heart, Lung, and Blood Institute under award number K08HL136850.DisclosuresDr Shah has received support from Women As One and honoraria from the American College of Cardiology.FootnotesThe opinions expressed in this article are not necessarily those of the editors or of the American Heart Association.https://www.ahajournals.org/journal/circRashmee Shah, MD, MS, 30 N 1900 E, Room 4A100, Salt Lake City, UT 84132. Email rashmee.[email protected]eduReferences1. Siegele L. A deluge of data is giving rise to a new economy [published online February 20, 2020].The Economist. https://www.economist.com/special-report/2020/02/20/a-deluge-of-data-is-giving-rise-to-a-new-economy. Accessed March 5, 2020.Google Scholar2. Shah RU, Mukherjee R, Zhang Y, Jones AE, Springer J, Hackett I, Steinberg BA, Lloyd-Jones DM, Chapman WW. Impact of different electronic cohort definitions to identify patients with atrial fibrillation from the electronic medical record.J Am Heart Assoc. 2020; 9:e014527. doi: 10.1161/JAHA.119.014527LinkGoogle Scholar3. Chapman EN, Kaatz A, Carnes M. Physicians and implicit bias: how doctors may unwittingly perpetuate health care disparities.J Gen Intern Med. 2013; 28:1504–1510. doi: 10.1007/s11606-013-2441-1CrossrefMedlineGoogle Scholar4. Ensign D, Friedler SA, Neville S, Scheidegger C, Venkatasubramanian S. Runaway feedback loops in predictive policing.Friedler SA, Wilson C, eds. In: Proceedings of the 1st Conference on Fairness, Accountability and Transparency. New York: PMLR; 2018:160–171.Google Scholar Previous Back to top Next FiguresReferencesRelatedDetailsCited ByBrown J, Ricket I, Reeves R, Shah R, Goodrich C, Gobbel G, Stabler M, Perkins A, Minter F, Cox K, Dorn C, Denton J, Bray B, Gouripeddi R, Higgins J, Chapman W, MacKenzie T and Matheny M (2022) Information Extraction From Electronic Health Records to Predict Readmission Following Acute Myocardial Infarction: Does Natural Language Processing Using Clinical Notes Improve Prediction of Readmission?, Journal of the American Heart Association, 11:7, Online publication date: 5-Apr-2022. Salybekov A, Wolfien M, Kobayashi S, Steinhoff G and Asahara T (2021) Personalized Cell Therapy for Patients with Peripheral Arterial Diseases in the Context of Genetic Alterations: Artificial Intelligence-Based Responder and Non-Responder Prediction, Cells, 10.3390/cells10123266, 10:12, (3266) Catalán V, Rodríguez A, Becerril S, Unamuno X, Mentxaka A, Gómez‐Ambrosi J and Frühbeck G (2021) The 'new normality' in research? What message are we conveying our medical students?, European Journal of Clinical Investigation, 10.1111/eci.13586, 51:9, Online publication date: 1-Sep-2021. Li R, Yang Y and Lin H (2021) The critical need to establish standards for data quality in intelligent medicine, Intelligent Medicine, 10.1016/j.imed.2021.04.004, 1:2, (49-50), Online publication date: 1-Aug-2021. Belenkov Y, Arutunov G, Barbarash O, Bondareva I, Villevalde S, Galyavich A, Gilarevsky S, Duplyakov D, Koziolova N, Lopatin Y, Mareev Y, Martsevich S, Panchenko E, Fomin I, Yavelov I and Yakhontov D (2021) Value of comparative studies of "real clinical practice" in modern cardiology. Position paper based on the expert council discussion dated 12/18/2020, Kardiologiia, 10.18087/cardio.2021.5.n1646, 61:5, (79-81) July 21, 2020Vol 142, Issue 3 Advertisement Article InformationMetrics © 2020 American Heart Association, Inc.https://doi.org/10.1161/CIRCULATIONAHA.120.045968PMID: 32687448 Originally publishedJuly 20, 2020 Keywordsmachine learningelectronic health recordsartificial intelligencePDF download Advertisement SubjectsDigital HealthHealth ServicesMachine Learning and Artificial IntelligenceQuality and Outcomes
Referência(s)