Role of Big Data in Cardiovascular Research

Revisão Acesso aberto Revisado por pares

Role of Big Data in Cardiovascular Research

2019; Wiley; Volume: 8; Issue: 14 Linguagem: Inglês

10.1161/jaha.119.012791

ISSN

2047-9980

Autores

William S. Weintraub,

Tópico(s)

Cardiovascular Health and Risk Factors

Resumo

HomeJournal of the American Heart AssociationVol. 8, No. 14Role of Big Data in Cardiovascular Research Open AccessResearch ArticlePDF/EPUBAboutView PDFView EPUBSections ToolsAdd to favoritesDownload citationsTrack citations ShareShare onFacebookTwitterLinked InMendeleyRedditDiggEmail Jump toOpen AccessResearch ArticlePDF/EPUBRole of Big Data in Cardiovascular Research William S. WeintraubMD William S. WeintraubWilliam S. Weintraub MedStar Heart & Vascular Institute, Washington, DC Originally published11 Jul 2019https://doi.org/10.1161/JAHA.119.012791Journal of the American Heart Association. 2019;8:e012791Big Data resemble people: interrogate them just so, and they will tell you whatever you want to hear.Perhaps you, gentle reader, have noticed that there seems to be an awful lot more data in recent years, but perhaps not a lot more knowledge. Welcome to the world of Big Data. Just what is Big Data, and how is it changing the world of cardiovascular medicine? Big Data may be defined as large sets of data that are given to analytic approaches that may reveal underlying patterns, associations, or trends. Big Data has also been characterized by the 4 Vs of volume (a lot of data), variety (data from different sources and in different forms), velocity (data are accumulated rapidly), and veracity (uncertainty as to whether the data are correct). However, these characteristics do not adequately define Big Data; one might say that if you have seen one set of Big Data you have seen one set of Big Data. It is perhaps more useful to think about Big Data as it relates to sources, repositories, and use (Figure 1, Table). Where do such data come from? How are they stored? How can they be analyzed and visualized? What can we learn from Big Data?Download figureDownload PowerPointFigure 1. Flow of Big Data from sources to storage, analytics, and visualization. EHR indicates electronic health record.Table 1. Examples of Big Data Studies From Various SourcesTypeSpecific SourceExampleEHRAcademic medical centerEHR data used to develop and validate a prediction mode for adverse outcomes3Administrative dataMedicare Part AComparative effectiveness of percutaneous coronary intervention and coronary bypass surgery21National registryGet With The Guidelines‐Resuscitation RegistryDerivation and validation of a mortality prediction tool after pediatric cardiac arrest25ImagingMRIsCardiac MRI acquitition plane recognition30Integration of multiple data typesMRI, genetic, biomarker, and clinical registryHypertrophic Cardiomyopathy Registry31Social media/internetSmart watch and internet applicationIdentification of cardiac arrhythmias using a smartwatch: the Apple Heart Study33Embedded clinical trialNational Cardiovascular Data RegistryComparison of radial and femoral approaches for percutaneous coronary intervention in women: SAFE‐PCI for Women trial45Machine learningHospital heart failure registryClassification of patients with heart failure57EHR indicates electronic health record; MRI, magnetic resonance imaging; SAFE‐PCI, Study of Access Site for Enhancement of Percutaneous Coronary Intervention.The Electronic Health RecordPerhaps the most ubiquitous source of Big Data in the realm of human health is the electronic health record (EHR). After several decades of discussion, financial incentives resulted in the relatively rapid conversion from handwritten charts to EHRs.1 The EHR can improve communication and facilitate much in the way of care, such as electronic pharmaceutical prescribing and communication between providers, both locally and regionally, via health information exchanges. However, perhaps the area with greatest potential is for the EHR to be a source of data that may be aggregated as Big Data.2, 3 Much of these data, such as laboratory testing, is readily available in digital, structured form. However, much of the information, such as admission summaries and procedural notes, generally remains in the form of text. For such data to be analyzable, it generally, although perhaps not always, needs to be converted from text to a more structured form, with or without natural language processing, an area of information science concerned with computer recognition and analysis of human language.4 The potential of such data is certainly grand, allowing patient‐level data collected at the point of care to be aggregated into Big Data. Such data can be from a single healthcare provider, such as an outpatient clinic or hospital, but may also be aggregated across healthcare systems. This can be scaled to impressive levels. For instance, the EHR data from the entire Department of Veterans Affairs Health System have been, for several years, on a single EHR platform, permitting querying across the entire system.Such data can be used for evaluation of quality, business intelligence, and medical research. However, EHR data are not collected with the aim of creating data sets for scientific discovery. Rather, the purpose remains to document and coordinate care and perform administrative functions. EHR data lack the particular rigor of prospectively collected data sets, in which the design is focused on creating data sets for analysis. Thus, there can be serious limitations to EHR data, including missing data and misclassification, as well as use of nonstandard definitions. As the data were not collected for analytical purposes with predefined hypotheses and end points, analysis of EHR data is given to asking open‐ended questions, more of a "fishing expedition," that may lead to unreliable, nonreproducible results.5, 6Much can be done to improve Big Data sets coming from EHRs and, thus, provide information that can result in better analytics and more meaningful studies.7 The first is to move toward more structured data and less free text. This would limit the amount of natural language processing that needs to be done, but also can result in less missing data, depending on how the processes for collecting structure data are implemented. The use of structured data collection can also foster the use of data standards, such as those developed by the American Heart Association/American College of Cardiology Task Force on Data Standards.8 Laboratory data are already largely standardized by LOINC, and pharmaceutical data are standardized by RxNorm.9, 10 Using standard definitions for terms allows improved interpretation and more meaningful aggregation of data. Behind the clinical definitions lie the metadata (detailed description of the data) of how the data standards are translated into code that can be programmed into EHRs. Careful implementation of such standards can greatly improve interoperability (ie, data with shared metadata, content, and meaning across systems), facilitating communication; an example would be diabetes mellitus having the same name, definition, and computer representation across systems.11 Data from EHRs can also be retrieved using formal structure to the data, such as Fast Healthcare Interoperability Resources (FHIR), which was developed by the Health Level 7 (HL7) International healthcare standards organization.12, 13 There are also structured approaches for aggregating data from EHRs, such as the Observational Medical Outcomes Partnership (OMOP) Common Data Model, which can provide a framework for integrating data from disparate sources that would then facilitate analysis.14 A Common Data Model is a metadata system providing consistency of data and their meaning across applications and processes that store data in conformance with the model. Although tools such as the FHIR and the OMOP can greatly facilitate analysis of electronic health data, they do not substitute entirely for the use of structured using data standards within EHRs. However, collecting structured data within workflow is difficult to implement, requiring a substantial investment in informatics resources.15 There is also the danger in EHRs of structured data being carelessly copied from note to note without proper evaluation.16Administrative DataThe counterpart to the EHR is the administrative data set. Such data are collected by healthcare providers largely for billing. The foremost example in the United States is the UB‐04, which is a standardized billing form for institutional providers, containing diagnostic, procedural, and charge data. The counterpart of the UB‐04 for professional billing is the Medicare Professional Claim Form (CMS‐1500). The primary purpose of the UB‐04 and the CMS‐1500 is for billing, by Medicare, Medicaid, and other payers. These data can also be aggregated, meeting anybody's definition of Big Data, for analytic purposes. The most important aggregation of these data sets is the Medicare set of databases. Medicare is also linked to the National Death Index, greatly facilitating analysis. There probably has been more analytic work, both published in the peer‐reviewed literature as well as used for assessment of quality and outcome, from Medicare databases than from any other source. In particular, Medicare data have been used to develop risk models, alone or linked to registry data, for clinical and economic outcomes.17, 18, 19 Medicare data have also been used in comparative effectiveness studies of observational evaluations of therapy.20, 21 However, administrative data have also been criticized for misclassification and missing data (eg, hypertension not coded when it should have been), for not capturing severity of illness, and for generally being less accurate than clinical registries.22Clinical RegistriesBeginning in the 1980s, professional societies began to organize clinical registries. The best known are from the Society of Thoracic Surgeons, the American College of Cardiology (National Cardiovascular Data Registry), and the American Heart Association (Get With The Guidelines).23, 24, 25 These registries collect structured data in various clinical settings (eg, percutaneous coronary intervention or cardiac surgery) using standardized definitions. However, the informatics metadata for interoperability of these registries are generally not available. These registries have grown tremendously, such that the National Cardiovascular Data Registry now has >60 million records, qualifying as Big Data. These databases are used for benchmarking quality at participating institutions and have resulted in many hundreds of publications. As noted above, these databases have also been linked to the Medicare databases, permitting assessment of long‐term outcome and economic evaluations.18, 20, 26 A major limitation of the society registries is that data collection is not generally linked to workflow. The data are often collected at a later time, meaning that data must be abstracted either partially or totally by hand from the EHR. In principal, it would be possible to build a collection of structured data into the workflow, so that data would be collected once, and then parsed where they need to go (Figure 2).15 However, in most healthcare systems, this remains aspirational, and implementation would require considerable investment in informatics to develop and deploy. Thus, data end up being collected several times and generally at least partially hand transposed from one system to another. It is not clear that such an approach will remain viable.Download figureDownload PowerPointFigure 2. Structured data collection and reporting. EHR indicates electronic health record; FDA, US Food and Drug Administration; HIE, health information exchange.Imaging DataWe are accustomed to thinking about data as existing in a structure such as flat file spreadsheets or relational databases, where the data are transformed from the original format to structured fields for analysis by classic statistical approaches, whether frequentist or bayesian. However, today, almost all medical images, whether static or moving, are stored in digital rather than analog formats. Images are stored as pixels or voxels describing a small area or volume, with millions of pixels or voxels forming one image. Such data can be analyzed and interpreted using careful image annotation and artificial intelligence approaches, such as neural networks.27, 28, 29 This fits the paradigm of variability of data type in the Big Data framework. Cardiac magnetic resonance imaging yields large data sets, both for image analysis as well as incorporation with other clinical data into registries.30, 31Genomic and Other 'Omic DataPerhaps the prototypical form of Big Data in medical research comes from large‐scale databases of genomic, proteomic, and metabolomic data. This has produced the entire field of bioinformatics, which seeks to develop methods to store, retrieve, and analyze such data sets. This has been made possible by ever more rapid approaches to genetic sequencing. The human genome has >3 billion base pairs, each subject to variation. Furthermore, variation need not involve a single base pair within a single gene. Thus, the potential variation is large. Databases are now available with the full genome on millions of variations. Some 88 million have been identified, with ≈12 million being common. There currently are efforts to sequence 1 million genomes.32We can expect increasing integration of 'omic databases with EHR, registry, imaging, and other clinical data, again providing the variety of Big Data (Figure 1).31Big Healthcare Data From Outside the Healthcare SystemHealthcare providers may be forgiven for thinking that they would remain the host of healthcare data. However, it is now entirely possible to assemble large healthcare studies developed partially or totally independent of the healthcare system, with subjects recruited from the internet, often by means of an application and/or social media. This will include data from wearable devices as well as environmental data, such as air pollution. Perhaps the best known example is the Apple Heart Study, with 400 000 participants self‐enrolling, to study the ability of the Apple Watch to detect atrial fibrillation.33 Although this study may not have offered much insight into atrial fibrillation, it was certainly proof of concept that this could be done. We can expect all the major technology companies to be involved in Big Data healthcare projects involving data gathering, storage, and analytics.It is of historical interest that the first scientific epidemiological study was of environmental exposure, in this case to cholera in London, UK, in the 1850s. Studies of environmental exposure to air and water pollution, toxic exposure (eg, lead or asbestos), and infectious disease (influenza) can result in large data sets, especially as the studies become more detailed.34, 35What Type of Health and Healthcare Issues Are Addressed by Big Data?We see successful application of Big Data in our daily lives when internet map direction finders constantly update the information to alert us to time of travel, whether there are accidents, and even if there are speed detectors along the way. We see this when doing an internet search and being rewarded often with a highly informative response within a fraction of a second. The issues on medicine, however, can be a bit more complicated.The types of issues that can be addressed by Big Data are not necessarily different conceptually from other types of data, in that they can be descriptive, predictive, or prescriptive. The major advantages of Big Data relate to the size of the data sets, their variety, and speed of accumulation. Thus, we could see descriptive data about healthcare use within a hospital or healthcare system, with updating on an essentially instantaneous level. We can imagine an integrated rapid response system, in which a patient experiencing an ST‐segment–elevation myocardial infarction outside the hospital would have the diagnosis made in the field with the ECG compared with previous ones for the patient. Then, the ambulance would be directed to not just the closest hospital, but also to the one with an available catheterization laboratory, ensuring the shortest door‐to‐balloon time, thus maximizing overall efficiency of the system.Predictive models have been created from Big Data, where the size and reach of the data sets permit assessment of the influence of covariates, not otherwise available, in predicting outcome.19, 26 The size of the data sets also permits many more risk factors to be considered. This offers the potential of identifying patients in whom intervention may be warranted. In principal, comparative effectiveness studies with hundreds of thousands or even millions of patients could be performed at relatively low cost, permitting detailed assessment of subgroups.20 That is, Big Data offers the potential for precision medicine. There are, however, significant barriers and limitations. In particular, using Big Data for comparative effectiveness studies comes with unique, particular difficulties. When comparing nonrandomized diagnostic or therapeutic strategies, there is the potential for treatment selection bias, or bias by indication, resulting in the comparison of groups who are not similar, perhaps varying in critical ways. Various statistical techniques are used to try to overcome this bias, but none can account for unmeasured confounders.36 Furthermore, treatment selection bias will not be overcome or even reduced by the size of the data set. In addition, Big Data may have more misclassification and missing baseline covariate data, which cannot be assumed to be missing at random, making the analytic problems greater. Simply put, it is necessary to be skeptical of observational nonrandomized comparisons of diagnostic or therapeutic strategies, whether the source is from a small number of patients or from Big Data.Clinicians and medical scientists are used to thinking about data as they pertain to understanding disease and patient care. Data used for health services research are also being mined and analyzed by other entities concerned with health and the business of health care. Big Data in large healthcare systems is now used and increasingly in the future will be used for business intelligence analysis. Decisions will be made on salaries and staffing, capital purchases, and strategic planning. Such decisions can have a profound impact on healthcare delivery and public health. Physician leadership, especially in resource‐intensive areas such as cardiovascular medicine, will need to become skilled in interpreting such studies and understanding their impact on patient care. Big Data is currently used by multiple entities involved in health care, especially insurance companies and the Centers for Medicare and Medicaid Services. Health insurance companies use Big Data to guide rates and analyze markets for strategic investments (eg, the merger of Aetna and CVS). The Centers for Medicare and Medicaid Services uses Big Data to assess quality, watch for fraud and abuse, and guide incentives.Embedded Randomized Clinical TrialsBig Data would seem to be the ideal platform on which to conduct randomized clinical trials (RCTs).37 Embedded clinical trials, compared with stand‐alone trials, offer the economy of capitalizing on infrastructure and data collection already in place and ongoing, reducing development time and data collection effort. The main, perhaps only, advantage to RCTs is the ability to overcome treatment selection bias. Observational Big Data can have the most bias when considering alternative strategies.38 As noted above, size cannot overcome bias. What Big Data does offer is data that come from wide sources, perhaps offering greater generalizability. The size of Big Data sets will reduce stochastic error, and may offer power to examine subsets, a frequent limitation of even large mega trials. However, the rules still apply.39 Subgroups should be defined in advance, there should be a small number, and there should be a physiologic reason for selecting the subgroup.40 Similarly, the choice of end points and analytic approaches should also be defined in advance. Multiplicity, where investigators consider multiple end points and subgroups, must be accounted for.41, 42 When good design principals are not carefully followed, assessment of alternative therapies or strategies can devolve into an unreliable "fishing expedition."An area that has already demonstrated that RCTs can effectively be conducted within larger data gathering environments is the RCT embedded in a clinical registry.39, 43 Examples of include the TASTE (Thrombus Aspiration in Myocardial Infarction) trial,44 embedded in the SWEDEHEART Registry; and the SAFE‐PCI (Study of Access Site for Enhancement of Percutaneous Coronary Intervention) for Women trial,45 embedded in the National Cardiovascular Data Registry. In the TASTE trial, 7244 patients with ST‐segment–elevation myocardial infarction were randomized to thrombus aspiration with PCI versus PCI alone. Thrombus aspiration was not shown to reduce the incidence of death or the composite of death, recurrent myocardial infarction, and stent thrombosis. In the SAFE‐PCI for Women trial, 1718 women undergoing diagnostic cardiac catheterization or PCI were randomized to radial versus femoral arterial access. No advantage to radial access was noted for bleeding or vascular complications. Although these trials themselves are not particularly remarkable, they are excellent examples of proof of concept of RCTs embedded in registries.Another platform for RCTs is the EHR.46 This offers the possibility of using the EHR both to screen for patients and as a platform for data gathering. Although appealing in principle, it has to date been relatively unusual to be put into practice. EHRs are not designed to capture data in the way RCTs are, and adapting each to the needs of the other so that EHRs can enable embedded RCTs remains something of a challenge. An example of full integration is a point‐of‐care trial comparing insulin administered by a sliding scale versus a weight‐based approach, which was conducted in the Veterans Affairs Healthcare System.47, 48Data Storage, Integration, and RetrievalHospitals can be expected to produce on the order of a petabyte of data, or a million, million bytes, a year. Nobody knows, but systems concerned with human health can be expected to generate multiple petabytes of data a day. How can such voluminous, rapidly accumulating data of multiple types be managed? The systems to store data must also allow ready access and integrate with applications to analyze Big Data. The systems also need to be able to scale to large sizes. Increasingly, health data, much like all other data, are stored in the cloud, with 3 companies leading: Amazon Web Services (https://aws.amazon.com), Azure from Microsoft (https://azure.microsoft.com/en-us), and Google Cloud (https://cloud.google.com), among others (https://www.softwaretestinghelp.com/cloud-computing-service-providers) (Figure 3). Cloud services from these companies are widely and publicly available. Much effort must go into curating data to make them retrievable and understandable. Sophisticated software, such as Hadoop, has been developed over the past several decades, which permits distribution of Big Data across clusters of computers, allowing failure of subsystems without loss of data integrity or availability.49Download figureDownload PowerPointFigure 3. Leading providers of cloud data storage. aws indicates Amazon Web Services.Analytic ApproachesLarge, complicated data sets may not be readily approachable with the types of statistical methods that have been developed over the past 50+ years. This is especially true for predictive or prescriptive studies. Such typical methods include logistic regression and Cox model analysis. These models are generally hand coded by statistical programmers. If hundreds of variables or subgroups are to be considered, this type of statistical programming becomes impractical. To address this issue, several data mining and statistical approaches have been developed that permit analysis of large data sets. An example is random forest, a machine‐learning approach to classification and prediction.50 Random forest can be seen as an extension of decision tree approaches, such as classification and regression trees).51 Classification and regression tree models can be assembled by hand coding, yet offer the ability to more rapidly consider interactions that logistic regression or Cox model. In random forest, multiple decision trees are created from a large data set with replacement, a process that has been called bagging or bootstrap aggregation. Random forest is a method for both classification and regression. This approach can reduce overfitting by averaging multiple trees, reducing the chance finding of a predictive variable that performs well in a test data set, but poorly in validation.Predictive models may be further developed from covariates developed from data‐mining approaches using advanced regression approaches. The first step might be to take variables generated from data mining and then use classic statistical methods like logistic regression and the Cox model. However, these methods may still not work well when there are potentially many variables and potential interactions. In such settings, machine‐learning approaches, such ridge, lasso, and elastic net allow "penalized" approaches to systematically restrict the number of variables and interactions in models.52 Ridge regression is most useful when there is potential for multicollinearity between variables, bringing variables into a model, but penalizing or limiting their influence in a continuous manner, rather than keeping them or eliminating them completely as in logistic regression. Lasso regression is conceptually similar, penalizing variables, but with many coefficients eliminated; thus, unlike ridge regression, this method allows reduction in the number of retained variables. Lasso allows selection of a small number of variables, whereas ridge works best when there are many predictors. However, in general, the best parameter estimates for a model are unknown; and both lasso and ridge may be too limiting, each in their own way. Elastic net integrates the approaches of lasso and ridge, allowing the degree of penalization of variables to vary between these extremes and allowing both some degree of penalization and some degree of elimination of variables. This can require multiple simulations to find the best model, which may not be easily defined by an algorithm (ie, it ultimately may require expert appraisal of a final model). There are potentially completely different approaches to modeling, such as support vector machine and bayesian machine learning, but the point should be clear. These methods are complex and require considerable experience and expertise to implement and use appropriately.There is a point of view that analysis of Big Data sets, by systematically searching, may uncover previously unknown predictive variables; or by allowing the incorporation of more variables the precision of the modeling process may be enhanced, in particular by incorporating EHRs, imaging, and environmental and genomic or other 'omic data (Figure 1). This may, at times, enhance prediction, and in particular genomic data have been shown to enhance cardiovascular risk prediction.53 However, at times, great precision may not be particularly helpful, given the uncertainties in any model, which become greater the further we try to predict into the future. Rather than the model with all statistically significant variables, a more parsimonious model with a small number of variables may be considerably easier to implement and may capture most of the ability to predict outcome.VisualizationLarge complicated data sets that may change rapidly over time offer particular challenges and opportunities for visualization. The static bar graph in black and white will not work in this new type of environment. It is also no longer reasonable and often not possible to visualize data using tools that require analysis first and then graphing the data, often done by hand. Thus, visualization tools also have sophisticated analytic engines permitting integration of analysis and display. Visualization has moved beyond 2‐dimensional printed graphs to 3‐dimensional or multidimensional graphs and may involve rotation on any axis or graphs that move over time. The results can make complex sets of data that are uninterpretable in raw form into easily understandable information.Although the results can be dramatic, and it is relatively easy to begin using certain tools, creating and optimizing visual displays requires considerable expertise. Thus, there is a whole new field for professionals who specialize in Big Data visualization. In addition, considerable high‐speed computer processing may be required to transform data into the desired display. Some Big Data displays are suitable for many personal computers, whereas others in clinical medicine may require specific workstations with specialized software. Big Data visualization often requires both the ability to manipulate the specialized software as well as thorough understanding of the medical issues.As with any issue related to data in general and Big Data in particular, the display is only as accurate and meaningful as the underlying data. Thus, before analysis display, careful curation is generally necessary. This may be difficult when the data sets are large and difficult to understand in raw form and the velocity is high. Visualization, once developed for an application, may be easy to run, giving what may appear to be simple, visually appealing results that are actually erroneous. Nonetheless, we should expect that, on a regular basis, data will be offered in visual format and increasingly embedded into health care with displays presented to us from multiple sources and in multiple ways.

Ver no editor

Altmetric

PlumX

Entrar

Lembrar minha senha

Receber meu e-mail de confirmação

Role of Big Data in Cardiovascular Research