Deus ex machina? Demystifying rather than deifying machine learning

Revisão Acesso aberto Revisado por pares

Deus ex machina? Demystifying rather than deifying machine learning

2021; Elsevier BV; Volume: 163; Issue: 3 Linguagem: Inglês

10.1016/j.jtcvs.2021.02.095

ISSN

1097-685X

Autores

Michael Domaratzki, Biniam Kidane,

Tópico(s)

Anomaly Detection Techniques and Applications

Resumo

Central MessageML is a widely applicable class of computational tools that can aid in making predictions from data. It is important to learn the basic premise and common pitfalls of these techniques.See Commentaries on pages 1138 and 1140. ML is a widely applicable class of computational tools that can aid in making predictions from data. It is important to learn the basic premise and common pitfalls of these techniques. See Commentaries on pages 1138 and 1140. The concept of “deus ex machina” refers to a plot device in storytelling wherein some supernatural force suddenly comes in and resolves all plot lines, often in an improbable fashion that defies the internal logic of the story up to that point. Literally meaning “god from the machine,” deus ex machina draws its origins from ancient Greek theatre where actors playing gods would descend onto the stage (ostensibly from the heavens) with the aid of machines at the end of the play, thereby magically resolving all plot lines; the audience is expected to accept that all improbable resolutions were possible because the gods could do anything they wanted. Machine learning (ML) is being increasingly used and reported in many areas, including surgery, giving the ability to make predictions for new, unseen data based on previous observations. While the use of ML is growing, there are many aspects of this new technology that are not fully understood by clinical researchers. Despite the increasing presence of large datasets suitable for ML, the use of the models can be foreign to new users. ML also has the potential to be abused if not understood; like biostatistics, care must be taken to not derive conclusions not supported by the tools. Moreover, many ML models can be opaque, in that the reasons for their predictions are not easily understood. In this review, we aim to illustrate the fundamentals of ML, the models that are commonly used, and some pitfalls; ultimately, our aim is to help clinicians and readers better understand and interpret (with caution) the growing use of ML in the literature. One particular issue with ML is the increasing prevalence of computational tools that allow experimentation with models without a thorough understanding of the underlying algorithms. This potential for relatively new users to build models adds to the risk of drawing improper conclusions from data sets. ML is a widely applicable class of computational tools that are suitable to aid in learning from existing data sets. Applications are common in many areas of society; in medicine alone, many applications have been reported in surgical outcome prediction,1Prasad V. Guerrisi M. Dauri M. Coniglione F. Tisone G. De Carolis E. et al.Prediction of postoperative outcomes using intraoperative hemodynamic monitoring data.Sci Rep. 2017; 7: 16376Crossref PubMed Scopus (4) Google Scholar, 2Cao Y. Fang X. Ottosson J. Näslund E. Stenberg E. A comparative study of machine learning algorithms in predicting severe complications after bariatric surgery.J Clin Med. 2019; 8: 668Crossref Scopus (24) Google Scholar, 3Lee H.C. Yoon H.K. Nam K. Cho Y.J. Kim T.K. Kim W.H. et al.Derivation and validation of machine learning approaches to predict acute kidney injury after cardiac surgery.J Clin Med. 2018; 7: 2018Google Scholar including cardiothoracic surgery,4Hernandez-Suarez D.F. Kim Y. Pedro Villablanca P. Gupta T. Wiley J. Nieves-Rodriguez B.G. et al.Machine learning prediction models for in-hospital mortality after transcatheter aortic valve replacement.JACC Cardiovasc Interv. 2019; 12: 1328-1338Crossref PubMed Scopus (41) Google Scholar as well as in other areas such as precision medicine.5Bellot P. de los Campos G. Pérez-Enciso M. Can deep learning improve genomic prediction of complex human traits?.Genetics. 2018; 210: 809-819Crossref PubMed Scopus (66) Google Scholar, 6Ho D.S.W. Schierding W. Wake M. Saffery R. O'Sullivan J. Machine learning SNP based prediction for precision medicine.Front Genet. 2019; 10: 267Crossref PubMed Scopus (59) Google Scholar, 7Montaez CAC, Fergus P, Montaez AC, Hussain A, Al-Jumeily D, Chalmers C. Deep learning classification of polygenic obesity using genome wide association study SNPs. 2018 International Joint Conference on Neural Networks (IJCNN). Available at: https://arxiv.org/abs/1804.03198. Accessed August 24, 2018.Google Scholar, 8Fergus P. Montanez A. Abdulaimma B. Lisboa P. Chalmers C. Pineles B. Utilising deep learning and genome wide association studies for epistatic-driven preterm birth classification in African-American women.IEEE/ACM Trans Comput Biol Bioinform. 2020; 17: 668-678PubMed Google Scholar Because examples of use of ML for prediction related to cardiothoracic surgery are relatively uncommon, we have also used examples from other surgical disciplines to illustrate the concepts in this review. The same ML tools would be immediately applicable to cardiothoracic surgery as well, given suitable datasets. ML is not a single tool, but a collection of different approaches. The selection of which approach is most suitable for a problem depends on many factors, including the availability of data, the type of data, and the type of prediction that needs to be made. In many ways, modern ML models are elaborations of regression techniques (ie, generalized linear models, generalized estimating equations) with increased computational capacity and efficiency for handling large and complex datasets, and there is no clear dividing line between classic regression techniques in ML. For instance, logistic regression (LR), especially stepwise techniques available on any statistical software and commonly used by researchers, is actually a form of ML. For example, some tasks described in this review such as feature selection can be used with LR as the base ML model, with positive results.9Rajeswaran J. Blackstone E. Identifying risk factors: challenges of separating signal from noise.J Thorac Cardiovasc Surg. 2017; 153: 1136-1138Abstract Full Text Full Text PDF PubMed Scopus (30) Google Scholar,10Karim M. Epi M. Tran R.C.L. Cochrane A. Billah B. Variable selection methods for multiple regressions influence the parsimony of risk prediction for cardiac surgery.J Thorac Cardiovasc Surg. 2017; 153: 1128-1135Abstract Full Text Full Text PDF PubMed Scopus (6) Google Scholar This is critical in moving ML results from black box predictions to usable clinical tools. In considering whether to apply ML, it should be considered whether the problem contains sufficiently complex or large enough data to warrant ML, or whether conventional regression techniques might be more appropriate. So when should ML be considered as a tool instead of more classic approaches such as regression analysis? No hard rule exists, and classic methods can be compared with ML methods when possible. However, when considering ML, some aspects of the problem should be kept in mind. As noted next, supervised learning techniques require a gold standard data set, and many of the techniques noted require a relatively large data set. Further, ML can be considered when the number of variables of interest is high (eg, 100s to 1000s), because ML can easily incorporate a high number of variables. Finally, as we note, the use of ML should be considered when the method for arriving at a prediction is not the primary interest. Use of ML models should also keep in mind the computational requirements for the models. In some cases, with substantial datasets and lacking specific computer hardware, training the models will be prohibitively expensive. In this methodological review, we discuss several ML models and techniques for improving results obtained through ML. We will discuss the data in the datasets that are used to train the ML models. To avoid confusion, Table 1 defines several terms used throughout this review, which may not be familiar to readers more accustomed to traditional biostatistics.Table 1Common terms in machine learningTermMeaningInstanceAn observation or a series of observations for 1 discrete individual unit of study. In datasets and in the parlance of conventional statistics, this would be a single row that represents the unit of study (ie, a patient, a cell line, an animal, or a cluster).FeatureA feature is a variable of interest. It is typically represented by a single column in a dataset. In statistics, it is what is commonly thought of as a predictor variable; however, it could very well be an outcome variable as well depending on the focus of the study question. In this review, we refer specifically to the “outcome variable” as the quantity we are interested in predicting in both regression and classification tasks.ParameterA model parameter is a model variable that is intrinsic to the model, and its value is derived or “learned” from the available dataset. In traditional regression techniques, the coefficients are considered parameters.Hyper-parameterA model hyper-parameter is a “higher-level” parameter that is extrinsic to the model. It is not derived from the dataset. It is chosen by the analyst. If one were to think of ML as a radio trying to pick up a signal, hyper-parameters are like the dials that can be adjusted manually by the user in order to “tune” the radio. The dials (ie, hyper-parameters) are not part of the signal (ie, data) but are a part of the machine that is trying to detect and interpret the signal.TuningTuning is the process of varying and manipulating hyper-parameters to optimize the predictive performance of the intrinsic model parameters. Open table in a new tab Supervised ML is the application of a model where there is an outcome or output value that is labeled by a human supervisor. For example, providing a dataset including a wide array of different kinds of fruits with a labeled outcome variable (ie, presence of a Granny Smith apple) and example cases will allow a supervised ML algorithm/model to learn what input variables (ie, shape, color, texture, taste) predict the labeled outcome variable (ie, presence of a Granny Smith apple). It is supervised learning because the model/algorithm is “told” what is and is not a Granny Smith apple. In unsupervised ML, there is no labeling of the outcome/output variable, and thus the model/algorithm is not “told” what is and is not a Granny Smith apple. The unsupervised ML models/algorithms work to group together cases that share attributes with each other; they then extrapolate this to new cases and try to decide which group the new case belongs to, based on which attributes these new cases possess. Thus, these unsupervised ML models/algorithms will group together green, apple-shaped objects that are crunchy and taste sour/tart together as a specific cluster. Any new case that fits these attributes is then classified as part of this cluster. Unsupervised learning can allow for discovery of relationships that were not previously identified. In traditional biostatistics, this would be akin to cluster analysis and principal component analysis. In this review, we will only consider supervised ML techniques. Recent interest in ML has focused on supervised learning, particularly due to the introduction of deep learning,11LeCun Y. Bengio Y. Hinton G. Deep learning.Nature. 2015; 521: 436Crossref PubMed Scopus (35617) Google Scholar a relatively new class of tools that use significantly more sophisticated models (often so-called deep neural networks [DNNs]) and high computing power to learn more complex patterns in datasets. The descriptor “deep” is used because many layers of networks are used to transform data as you go from input to output (ie, results or prediction). In supervised learning, a training data set for the model is required. In the absence of a gold standard data set, supervised learning is not possible, and identifying an appropriate data set for supervised ML is a critical first step for researchers to determine if supervised learning can be applied. This training set consists of several instances with data on several features as well as the outcome variable. For example, in a medical setting, the training set may consist of surgical, perioperative, and demographic data (ie, the features), as well as postoperative outcome (the outcome variable), such as occurrence of a complication or death. Data for several instances (several surgeries) are needed to train the model. Supervised learning can thus be considered as similar in spirit to well-known statistical modeling tools such as regression analysis, where the outcome and predictor variables as well as their states (ie, absent or present) are known and well defined. This is differentiated from unsupervised learning that might be used in cases where the categories that the instances belong to are not known. Methods in unsupervised learning include clustering techniques such as k-means clustering and data mining techniques. The size of the training set will also affect the suitability of ML. Even when a training set exists, a small size may restrict its use. Classic ML tools, such as Random Forests (RFs)12Breiman L. Random forests.Mach Learn. 2001; 45: 5-32Crossref Scopus (56876) Google Scholar and Support Vector Machines (SVMs)13Cortes C. Vapnik V. Support-vector networks.Mach Learn. 1995; 20: 273-297Crossref Scopus (0) Google Scholar can be applied with smaller training sets, whereas modern tools, such as DNNs,11LeCun Y. Bengio Y. Hinton G. Deep learning.Nature. 2015; 521: 436Crossref PubMed Scopus (35617) Google Scholar typically require larger training sets, with hundreds of individuals needed for training at a minimum (E1Probst P. Boulestieux A.L. To tune or not to tune the number of trees in Random Forest.J Mach Learn Res. 2018; 18: 1-18Google Scholar, E2Rice T. Blackstone E. Rusch V. A cancer staging primer: esophagus and esophagogastric junction.J Thorac Cardiovasc Surg. 2010; 139: 527-529Abstract Full Text Full Text PDF PubMed Scopus (31) Google Scholar, E3Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. In: KDD'16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY: Association for Computing Machinery; 2016:785-94.Google Scholar, E4Freund Y. Schapire R. A decision-theoretic generalization of on-line learning and an application to boosting.J Comput Syst Sci. 1997; 55: 119-139Crossref Scopus (9632) Google Scholar, E5Hsu CW, Chang CC, Lin CJ. A practical guide to support vector classification [technical report]. Taipei, Taiwan: Department of Computer Science, National Taiwan University; 2003. Available at: https://www.researchgate.net/profile/Chenghai-Yang/publication/272039161_Evaluating_unsupervised_and_supervised_image_classification_methods_for_mapping_cotton_root_rot/links/55f2c57408ae0960a3897985/Evaluating-unsupervised-and-supervised-image-classification-methods-for-mapping-cotton-root-rot.pdf. Accessed April 1, 2021.Google Scholar show definitions of these ML models as well as Figure 1, Figure 2, Figure 3). Larger data sets representing a wide variety of instances typically allow the model to generalize and improve prediction results.Figure 2Two-dimensional representation of an SVM. In this example, each instance is represented by only 2 features, and the instances are plotted as points in a 2-dimensional plane. The 2 colors of dots represent the 2 categories to be learned by the SVM. The solid line represents the learned hyperplane (in the 2-dimensional case, a line) that classifies examples. The so-called margin is denoted by the area between the 2 dotted lines; the SVM seeks to maximize the distance between the separating hyperplane and data instances on either side of the hyperplane. The instances on the dotted lines are called the support vectors.View Large Image Figure ViewerDownload Hi-res image Download (PPT)Figure 3Example of a feed-forward neural network. There are 3 input nodes and 1 output node. There are also 7 nodes that are neither input nor output; these are typically referred to as hidden nodes in the context of feed-forward networks. The presence of several layers of hidden nodes in a neural network is what typically characterizes that network as deep; for some tasks, a DNN can have more than 1000 hidden layers of nodes.View Large Image Figure ViewerDownload Hi-res image Download (PPT) On the other hand, increasing the size of the training set will also increase training time. While training times are affected by the number of features, training time is dominated by the number of instances in the training set. However, once a model is trained, obtaining new predictions for new instances is not computationally expensive. When new data were collected, predictions for these new instances can typically be obtained instantaneously. Supervised ML models are based on algorithms that, although computationally intensive, have been designed by humans to obtain a well-defined outcome. While many contain an element of randomness, such as selection of elements from the training set during the training algorithm, the algorithms for training can be understood and justified by humans. As with numerical computation, computers excel at performing the training of ML models quickly. There is no mystery as to how the algorithms that the ML models used for training have been designed. In this sense, ML models are transparent to human users. However, one potential drawback of ML is that although the process for learning is transparent and replicable, some of the most well-known ML algorithms do not provide any insight into the structure of the data or the features that are relevant to prediction. This is true, for instance, of DNNs. Kuhn and Johnson14Kuhn M. Johnson K. Applied Predictive Modeling. 26. Springer, New York, NY2013Google Scholar refer to this as a “tension between prediction and interpretation,” which is especially present in medical fields: More accurate ML models are more complex and are thus less likely to be interpretable. For this reason, ML models are typically viewed as black box tools14Kuhn M. Johnson K. Applied Predictive Modeling. 26. Springer, New York, NY2013Google Scholar that are better able to perform prediction than understand causality. In some cases (see “Feature Importance”), the features that are relevant to a prediction may be extracted from an ML model. However, in general, when using ML, one should expect to obtain a model where new predictions are the primary achievement, rather than insight into the structure of the complex datasets used to train the model. Recent progress in this area has developed into the field of explainable artificial intelligence,15Holzinger A. Biemann C. Pattichis C.S. Kell D.B. What do we need to build explainable AI systems for the medical domain?.arXiv. December 28, 2017; (Preprint. Posted online) (1712.09923)Google Scholar,16Gordon L. Grantcharov T. Rudzicz F. Explainable artificial intelligence for safe intraoperative decision support.JAMA Surg. 2019; 154: 1064-1065Crossref PubMed Scopus (29) Google Scholar which focuses on developing tools to allow humans to understand the results obtained by ML and other artificial intelligence tools. To evaluate a supervised ML model, the performance of the model on a portion of the data available for training is considered. In other words, we are able to see how well the model will predict outcomes for instances where the outcome is known (but which has not been used for training). Thus, in supervised ML, analysts will take the data set that is available and divide it into separate sets, often referred to as the training set and the test set (ie, a validation set). Additionally, as part of the training, the model may require differentiation between subsets of the training set, as described in cross-validation next.17Liu Y. Chen C.P.H. Krause J. Peng L. How to read articles that use machine learning: users' guides to the medical literature.JAMA. 2019; 322: 1806-1816Crossref PubMed Scopus (165) Google Scholar The training set is used to train the model; this is how the model learns from existing data. The test set is not used to train the model and is instead used to evaluate the prediction of the trained model. Because the model will have not used the test set for training, it will evaluate the ability of the model to generalize its predictions to new instances and evaluate whether the model has been overfit to the training data. Overfitting is a common pitfall in conventional prediction biostatistics, and it is especially pertinent to ML because of the generally excellent capacity of many ML techniques to draw a line between predictors and outcome. A model may be perfect at predicting outcomes in a particular dataset but be useless at predicting outcomes in other datasets; this is a consequence of overfitting the model to the idiosyncrasies of the training or derivation dataset. This is why it is important to have both a training/derivation dataset and a separate test/validation dataset. This requires splitting existing datasets into 2 such datasets. A common technique to mitigate the loss of data for creating the test set and for training ML models in general is cross-validation. In cross-validation, the entire dataset is randomly split into k disjoint, equal-sized subsets, called the “folds,”18Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In: IJCAI'95: Proceedings of the 14th International Joint Conference on Artificial Intelligence. Vol 2. San Francisco, CA: Morgan Kaufmann Publishers; 1995:1137-43.Google Scholar for some integer k > 1. Then, for each of the k folds, training of the model is performed on the remaining k-1 folds, and tuning is performed on the remaining fold that was not used for training. This gives k estimations of the accuracy of the ML model, which are averaged to give an overall view of the accuracy of the model. For classification problems, stratified cross-validation may be further used to ensure that each of the k folds has a proportion of each of the levels of the outcome variable that matches the full dataset.18Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In: IJCAI'95: Proceedings of the 14th International Joint Conference on Artificial Intelligence. Vol 2. San Francisco, CA: Morgan Kaufmann Publishers; 1995:1137-43.Google Scholar When analysts use the tuning dataset as part of the test dataset (ie, validation dataset), this “double-dipping” can increase the risk of systematic errors by overfitting the model to the data used. This has the potential of compounding errors, both types I and II. Readers should keep a wary eye on the use of training, tuning, and testing sets; if they all appear to be coming from the same source (especially if it is a small dataset to begin with), this should raise a red flag. The goal of an ML model is to minimize both type I and II errors. However, care must be taken in how to evaluate the success of an ML algorithm. An imbalanced dataset used as a training set for an ML algorithm will generally cause the model to predict the more prevalent outcome. This can inflate some typical measures of performance, such as sensitivity or specificity, depending on the imbalance. As such, measures such as F1 score, accuracy, or AUC for the ROC should be considered, especially when imbalanced datasets are present. In particular, the F1 score is considered robust in cases where a binary outcome variable appears in unbalanced proportion; it is calculated as the harmonic mean of precision (ie, TP/(TP + FP)) and recall (ie, TP/(TP + FN), also known as “sensitivity”). In some cases, where the cost of misdiagnosis is asymmetrical, sensitivity and specificity (ie, TN/(TN + FP)) can provide insight into the relative performance of an ML model. In general, there is a classic tension between sensitivity and specificity. In some situations, it is more important to avoid incorrectly ruling out a condition than it is to be able to accurately diagnose it. This is a concept to which ML is agnostic and must be decided by the clinicians/scientists designing the ML models. Most ML algorithms work on a feature vector, which is a representation of instances in both the training and testing sets. A feature vector can be thought of as the combination of features in an instance, that is, a feature vector is a row of data that represent the individual of study. The feature vectors are typically the basic unit of training in ML models: Feature vectors for instances are provided to the model, which then learns from the instance. The process can be either iterative (ie, progressively refined and improved models from seeing additional inputs) or through other techniques such as repeated sampling or solving an optimization problem. In the majority of these cases, the feature vector for an instance is the basic unit of the training phase, and as the representation of an instance, the feature vectors must be consistent with each other. For example, all the feature vectors need to be the same length. In some cases, the data that have been collected can be used directly as the feature vector, such as clinical data (eg, heart rate over time). If a fixed, consistent number of features are present for each instance, then these can be used to train the ML model. These features can be continuous numerical data or categorical data. Modern ML algorithms can handle large feature vectors with thousands of features, provided adequate computing power and training time. However, in many situations, summarizing instances into appropriate feature vectors simplifies the analysis. One particular case where summarization may be considered is when different instances have differing amounts of data. This can occur with time series data, where the length of the data may differ between instances. In our research, for example, we explored the association between intraoperative heart rate variability and postoperative outcomes after thoracic surgery. Because of the variation in operative time, the heart rate data inputs had different lengths (ie, 1 hour worth of data vs 3 hours). Many ML algorithms, including NNs, RFs and SVMs, require all instances to have the same number of features, and in the case of uneven features, summarization of the data is necessary. For example, Prasad and colleagues,1Prasad V. Guerrisi M. Dauri M. Coniglione F. Tisone G. De Carolis E. et al.Prediction of postoperative outcomes using intraoperative hemodynamic monitoring data.Sci Rep. 2017; 7: 16376Crossref PubMed Scopus (4) Google Scholar in predicting postoperative outcomes for orthotopic liver transplant using perioperative data such as blood pressure and heart rate, summarized time series data using the mean of the observations and the median absolute deviation. No rules exist for designing feature vectors. Changes to the feature vector can result in significant changes to the performance of ML models. However, a general guideline is to use raw data for feature vectors. Using summarized data does not allow the ML algorithm to learn from the raw data, and as a result, it provides a chance for preconceptions about the data to be incorporated into the model. In other words, if we replace raw data in a feature vector with summary statistics, clinical scoring systems, or similar derived data, the ML model must learn from these data, which may include preconceptions or bias, rather than the raw data itself. A major benefit of ML techniques in surgical outcomes research is the ability to model multitudinous and large amounts of intraoperative data inputs. However, these inputs themselves present modeling challenges. Cao and colleagues2Cao Y. Fang X. Ottosson J. Näslund E. Stenberg E. A comparative study of machine learning algorithms in predicting severe complications after bariatric surgery.J Clin Med. 2019; 8: 668Crossref Scopus (24) Google Scholar discuss the performance of ML models using data from a large Swedish bariatric surgery registry. They note that previous studies of the same dataset using multivariable LR had poor performance in prediction of postoperative complications. The authors note that preoperative predictor variables have proven to be largely insufficient for driving predictions, whether one uses ML or classic regression techniques. They note that the inclusion of intraoperative variables would likely improve prediction rates. However, there are 2 barriers to incorporating intraoperative variables/features to increase prediction accuracy. First, reliance on intraoperative data for model creation would preclude the ability to provide prediction at the preoperative stage; this is a major, if not prohibitive, limitation if the sole purpose of prediction tools is to inform decisions before an operation is performed. The second barrier is more of a technologic issue and relates to the difficulty of reliably capturing all important intraoperative variables and data. However, advances in intraoperative monitoring and black box technology16Gordon L. Grantcharov T. Rudzicz F. Explainable artificial intelligence for safe intraoperative decision support.JAMA Surg. 2019; 154: 1064-1065Crossref PubMed Scopus (29) Google Scholar,19Jung J.J. Jüni P. Lebovic G. Grantcharov T. First-year analysis of the operating room black box study.Ann Surg. 2020; 271: 122-127Crossref PubMed Scopus (70) Google Scholar may make it possible to capture and analyze real-time intraoperative data. A significant aspect of preparing data for use in ML is generally known as data cleaning. This process, designed to handle inconsistencies in the data, may involve manual intervention to delete instances that contain errors or repair inconsistent data. One automated aspect of cleaning is imputation, where missing features from an instance are inferred from the dataset. Relatively elementary techniques, such as using the mode of nonmissing values for that feature, typically suffice for sufficiently detailed datasets. In many medical applications, datasets will be imbalanced, without equal incidence of outcome states. For instance, in a binary outcome variable for presence/absence of a disease or a complication, we expect to have fewer positive cases than negative cases. These imbalanced datasets present challenges to ML tools; the dominant negative cases are overrepresented in the data, and ML algorithms will be inclined to predict new instances as negative ones (ie, it is easier to predict not having the disease or not having the complication). Thus, such models will have high sensitivity (with few false-negatives). However, they will have a higher risk of false-positives and thus have a moderate rather than a high degree of specificity. To solve this issue of dataset imbalance, additional tools can be applied to the dataset before the ML algorithm is trained. In E6Fernández A. Garcia S. Herrera F. Chawla N.V. SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary.J Artif Intell Res. 2018; 61: 863-905Crossref Scopus (470) Google Scholar, we describe some of these tools. Many different models of supervised ML algorithms exist, and it is impossible to survey all of them. In E1Probst P. Boulestieux A.L. To tune or not to tune the number of trees in Random Forest.J Mach Learn Res. 2018; 18: 1-18Google Scholar, E2Rice T. Blackstone E. Rusch V. A cancer staging primer: esophagus and esophagogastric junction.J Thorac Cardiovasc Surg. 2010; 139: 527-529Abstract Full Text Full Text PDF PubMed Scopus (31) Google Scholar, E3Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. In: KDD'16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY: Association for Computing Machinery; 2016:785-94.Google Scholar, E4Freund Y. Schapire R. A decision-theoretic generalization of on-line learning and an application to boosting.J Comput Syst Sci. 1997; 55: 119-139Crossref Scopus (9632) Google Scholar, E5Hsu CW, Chang CC, Lin CJ. A practical guide to support vector classification [technical report]. Taipei, Taiwan: Department of Computer Science, National Taiwan University; 2003. Available at: https://www.researchgate.net/profile/Chenghai-Yang/publication/272039161_Evaluating_unsupervised_and_supervised_image_classification_methods_for_mapping_cotton_root_rot/links/55f2c57408ae0960a3897985/Evaluating-unsupervised-and-supervised-image-classification-methods-for-mapping-cotton-root-rot.pdf. Accessed April 1, 2021.Google Scholar, we describe 5 tools that take different approaches to prediction: Naïve Bayes, RFs, Gradient Boosting, SVM, and Deep Learning. These 5 models are not intended to be an exhaustive review, and many variations of the basic descriptions given here exist in the literature. In E1Probst P. Boulestieux A.L. To tune or not to tune the number of trees in Random Forest.J Mach Learn Res. 2018; 18: 1-18Google Scholar, E2Rice T. Blackstone E. Rusch V. A cancer staging primer: esophagus and esophagogastric junction.J Thorac Cardiovasc Surg. 2010; 139: 527-529Abstract Full Text Full Text PDF PubMed Scopus (31) Google Scholar, E3Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. In: KDD'16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY: Association for Computing Machinery; 2016:785-94.Google Scholar, E4Freund Y. Schapire R. A decision-theoretic generalization of on-line learning and an application to boosting.J Comput Syst Sci. 1997; 55: 119-139Crossref Scopus (9632) Google Scholar, E5Hsu CW, Chang CC, Lin CJ. A practical guide to support vector classification [technical report]. Taipei, Taiwan: Department of Computer Science, National Taiwan University; 2003. Available at: https://www.researchgate.net/profile/Chenghai-Yang/publication/272039161_Evaluating_unsupervised_and_supervised_image_classification_methods_for_mapping_cotton_root_rot/links/55f2c57408ae0960a3897985/Evaluating-unsupervised-and-supervised-image-classification-methods-for-mapping-cotton-root-rot.pdf. Accessed April 1, 2021.Google Scholar, we describe some applications of these technologies, including predicting mortality risk in cardiac surgery patients,20Kilic A. Goyal A. Miller J. Predictive utility of a machine learning algorithm in estimating mortality risk in cardiac surgery.Ann Thorac Surg. 2020; 109: 1811-1819Abstract Full Text Full Text PDF PubMed Scopus (31) Google Scholar complication in cardiothoracic surgery,4Hernandez-Suarez D.F. Kim Y. Pedro Villablanca P. Gupta T. Wiley J. Nieves-Rodriguez B.G. et al.Machine learning prediction models for in-hospital mortality after transcatheter aortic valve replacement.JACC Cardiovasc Interv. 2019; 12: 1328-1338Crossref PubMed Scopus (41) Google Scholar acute kidney injury after surgery,3Lee H.C. Yoon H.K. Nam K. Cho Y.J. Kim T.K. Kim W.H. et al.Derivation and validation of machine learning approaches to predict acute kidney injury after cardiac surgery.J Clin Med. 2018; 7: 2018Google Scholar and esophageal cancer staging.21Ishwaran H. Blackstone E. Apperson-Hanson C. Rice T. A novel approach to cancer staging: application to esophageal cancer.Biostatistics. 2009; 10: 603-620Crossref PubMed Scopus (56) Google Scholar In general, ML models are viewed as black box solutions that are incapable of yielding insight into the reasons for the predictions the model gives. However, there are some tools that can be used to gain some insight into the relative importance of the features for prediction. These tools are referred to as feature importance tools. To differentiate feature importance from feature selection, feature importance tools are generally model-specific and are used after a model is trained. The importance is derived from the strength that a feature has in aiding the model to make a prediction, so training is necessary. Feature importance tools also can be used along with feature selection tools: Statistically significant or theoretically important features can be selected before training, and the remaining features can be ranked using a feature importance tool. Examples of techniques for feature importance are recursive feature elimination, which can be used with tools such as linear kernel SVMs, and permutation techniques,12Breiman L. Random forests.Mach Learn. 2001; 45: 5-32Crossref Scopus (56876) Google Scholar which can be applied more generally. In this latter technique, originally designed for RFs, a model is evaluated on data that have had some features permuted to evaluate the importance of features that are not. Variable importance (VIMP) for RFs uses impurity of nodes in trees to estimate importance for features.22Breiman L. Friedman J.H. Olshen R.A. Stone C.J. Classification and Regression Trees. Wadsworth, Belmont, CA1984Google Scholar In a recent study, Wojnarski and colleagues23Wojnarski C.M. Roselli E.E. Idrees J.J. Zhu Y. Carnes T.A. Lowry A.M. et al.Machine-learning phenotypic classification of bicuspid aortopathy.J Thorac Cardiovasc Surg. 2018; 155: 461-469.e4Abstract Full Text Full Text PDF PubMed Scopus (28) Google Scholar used RFs to predict aortic phenotype for 656 patients with bicuspid aortic valves. The features were 56 preoperative features, including demographic data, noncardiac morbidity, laboratory data, valve pathology, and others. The authors demonstrated through VIMP that the most relevant features are generally related to echocardiographic measurements, including peak and mean aortic valve gradient (mm Hg) and left ventricular inner diameter (diastole). In another example, Lu and Ishwaran24Lu M. Ishwaran H. A prediction-based alternative to P values in regression models.J Thorac Cardiovasc Surg. 2018; 155: 1130-1136Abstract Full Text Full Text PDF PubMed Scopus (13) Google Scholar used VIMP with Cox regression to identify relevant features for predicting the risk of systolic heart failure in a dataset of 2231 patients who had undergone cardiopulmonary stress tests. The authors introduce the use of a VIMP index to replace P values in determining the importance of a feature to a regression model. In predicting systolic heart failure, the VIMP index identifies peak oxygen consumption, blood urea nitrogen, and exercise time as of top importance. ML techniques are being more commonly used in the medical/surgical literature. Although they are to a certain extent “black box” techniques, they can be understood and should be interpreted with caution.16Gordon L. Grantcharov T. Rudzicz F. Explainable artificial intelligence for safe intraoperative decision support.JAMA Surg. 2019; 154: 1064-1065Crossref PubMed Scopus (29) Google Scholar There is a growing interest in this, and some recent articles are excellent resources.17Liu Y. Chen C.P.H. Krause J. Peng L. How to read articles that use machine learning: users' guides to the medical literature.JAMA. 2019; 322: 1806-1816Crossref PubMed Scopus (165) Google Scholar Readers and reviewers can and should educate themselves on the basic premises and common pitfalls of ML techniques so that they do not fall victim to the ruse of the deus ex machina.

Ver no editor

Altmetric

PlumX

Entrar

Lembrar minha senha

Receber meu e-mail de confirmação

Deus ex machina? Demystifying rather than deifying machine learning