Logistic Regression
2008; Lippincott Williams & Wilkins; Volume: 117; Issue: 18 Linguagem: Italiano
10.1161/circulationaha.106.682658
ISSN1524-4539
Autores Tópico(s)Statistical Methods and Inference
ResumoHomeCirculationVol. 117, No. 18Logistic Regression Free AccessReview ArticlePDF/EPUBAboutView PDFView EPUBSections ToolsAdd to favoritesDownload citationsTrack citationsPermissions ShareShare onFacebookTwitterLinked InMendeleyReddit Jump toFree AccessReview ArticlePDF/EPUBLogistic Regression Michael P. LaValley Michael P. LaValleyMichael P. LaValley From the Department of Biostatistics, Boston University School of Public Health, Boston, Mass. Originally published6 May 2008https://doi.org/10.1161/CIRCULATIONAHA.106.682658Circulation. 2008;117:2395–2399Like contingency table analyses and χ2 tests, logistic regression allows the analysis of dichotomous or binary outcomes with 2 mutually exclusive levels.1 However, logistic regression permits the use of continuous or categorical predictors and provides the ability to adjust for multiple predictors. This makes logistic regression especially useful for analysis of observational data when adjustment is needed to reduce the potential bias resulting from differences in the groups being compared.2Use of standard linear regression for a 2-level outcome can produce very unsatisfactory results. Predicted values for some covariate values are likely to be either above the upper level (usually 1) or below the lower level of the outcome (usually 0). In addition, the validity of linear regression depends on the variability of the outcome being the same for all values of the predictors. This assumption of constant variability does not match the behavior of a 2-level outcome. So, linear regression is not adequate for such data, and logistic regression has been developed to fill this gap.Some recent examples of use of logistic regression in Circulation include the assessment of gender as a predictor of operative mortality after coronary artery bypass grafting surgery,3 an evaluation of the relationship between the TaqlB genotype and risk of cardiovascular disease in a meta-analysis,4 and an examination of the relationship between lipoprotein abnormalities and the incidence of diabetes.5The Logistic Regression ModelThe logistic regression model has its basis in the odds of a 2-level outcome of interest. For simplicity, I assume that we have designated one of the outcome levels the event of interest and in the following text will simply call it the event. The odds of the event is the ratio of the probability of the event happening divided by the probability of the event not happening. Odds often are used for gambling, and "even odds" (odds=1) correspond to the event happening half the time. This would be the case for rolling an even number on a single die. The odds for rolling a number <5 would be 2 because rolling a number 0.05). Table 1. Unadjusted and Adjusted Odds Ratios for Development of AnginaPredictorUnadjustedAdjustedOdds Ratio95% CIPOdds Ratio95% CIPOdds ratios, 95% CIs, and probability values for predictors of angina in the Framingham data. Columns 2 through 4 present results from the unadjusted model; columns 5 through 7 show results from the adjusted model. The respective SDs for cholesterol, body mass index, and heart rate are 44.622 mg/dL, 4.077 kg/m2, and 12.033 bpm.Cholesterol (1 SD)1.412(1.297, 1.537)<0.0011.404(1.284–1.535)<0.001Sex1.415(1.173–1.705)<0.001Current smoking1.035(0.854–1.255)0.728Diabetes1.437(0.891–2.320)0.138Age (10 y)1.088(0.973–1.216)0.139Body mass index (1 SD)1.299(1.190–1.419) 3 and <−3 would be considered potential problems, although for large data sets we should expect some values beyond those limits. There also are several measures of influence for logistic regression. Here, I use the logistic regression version of Cook's distance, which provides a measure of how much the model estimates change when each point is removed. Neither outliers nor influence points should be discarded automatically, but having knowledge of their presence can be used for targeted data checking and cleaning, or sensitivity analyses.The Figure is a residual plot for the adjusted model. The horizontal axis shows the predicted probability of angina for each observation; the vertical axis shows the Pearson residual. The size of the plotted circle is proportional to the Cook's distance for the observation. The higher curve is of subjects who developed angina, and the lower curve is of subjects who did not. Because the number of subjects who developed angina is smaller, their observations are generally more influential, and their circles tend to be larger. From the Figure, we can identify several possible problems. First, there are 2 observations with predicted probabilities of angina between 0.75 and 0.80. These come from 2 subjects with unusually high cholesterol values (600 and 696 mg/dL). The subject with 696 mg/dL did not develop angina, making a rather poor fit to the model and the most influential observation in these data, shown by having the largest circle. There are also subjects who developed angina despite having a very low predicted probability in the model. The low predicted probabilities for these subjects were primarily due to low cholesterol values. The mismatch between the observed angina rates and low predicted probability of angina in the regression model for these subjects creates large residuals, and these are the points in the upper left region of the Figure. A substantial number of these subjects have residual values >3 and might be considered outliers. Download figureDownload PowerPointFigure. Residual plot from the adjusted model for angina in the Framingham data. The horizontal axis shows the predicted probability of angina; vertical axis, the value of the Pearson residual. The size of the plotted circle is proportional to the influence of an observation.So, although we cannot reject that the adjusted model fits the data according to the Hosmer and Lemeshow test, the R2 and c values are still rather low. In addition, the Figure makes it clear that there are some subjects with low cholesterol who develop angina and are not well fit by the model. There are also some subjects with very high cholesterol who may have excessive influence on the model estimates. As a sensitivity analysis, we might want to remove subjects with cholesterol of ≥600 mg/dL and see if the model results change substantially. We also might consider adding more predictors or allowing a nonlinear effect of cholesterol to see if we can better predict angina for subjects with low cholesterol levels.Extensions to the Logistic Regression ModelHere, I have considered only outcomes with 2 levels, but there are extensions to the logistic regression model that allow analysis of outcomes with ≥3 ordered levels such as no pain, moderate pain, or severe pain. Such data often are analyzed with proportional odds logistic regression,22 although other models also are possible.23,24 Multinomial logistic regression may be used if the outcome consists of ≥3 unordered categories.1 The standard form of logistic regression presented here also presumes that observations are independent. This would not be the case for longitudinal or clustered data, and analyzing such data as independent could give misleading conclusions.25 Methods such as generalized estimating equations26 or random-effects models27 can be used for such data. Finally, survival analysis methods14 provide an extension for studies in which subjects have been followed up for events across extended and varying follow-up times.DisclosuresNone.FootnotesCorrespondence to Dr Michael P. LaValley, Department of Biostatistics, Boston University School of Public Health, 715 Albany St, Crosstown Center Room 322, Boston, MA 02118. E-mail [email protected] References 1 Hosmer DW, Lemeshow S. Applied Logistic Regression. 2nd ed. New York, NY: John Wiley & Sons, Inc; 2000.Google Scholar2 Kirkwood BR, Sterne JAC. Essential Medical Statistics. Oxford, UK: Blackwell Science Ltd; 2003.Google Scholar3 Blankstein R, Ward RP, Arnsdorf M, Jones B, Lou YB, Pine M. Female gender is an independent predictor of operative mortality after coronary artery bypass graft surgery: contemporary analysis of 31 Midwestern hospitals. Circulation. 2005; 112 (suppl): I-323–I-327.LinkGoogle Scholar4 Boekholdt SM, Sacks FM, Jukema JW, Shepherd J, Freeman DJ, McMahon AD, Cambien F, Nicaud V, de Grooth GJ, Talmud PJ, Humphries SE, Miller GJ, Eiriksdottir G, Gudnason V, Kauma H, Kakko S, Savolainen MJ, Arca M, Montali A, Liu S, Lanz HJ, Zwinderman AH, Kuivenhoven JA, Kastelein JJ. Cholesteryl ester transfer protein TaqIB variant, high-density lipoprotein cholesterol levels, cardiovascular risk, and efficacy of pravastatin treatment: individual patient meta-analysis of 13,677 subjects. Circulation. 2005; 111: 278–287.LinkGoogle Scholar5 Festa A, Williams K, Hanley AJ, Otvos JD, Goff DC, Wagenknecht LE, Haffner SM. Nuclear magnetic resonance lipoprotein abnormalities in prediabetic subjects in the Insulin Resistance Atherosclerosis Study. Circulation. 2005; 111: 3465–3472.LinkGoogle Scholar6 Bland JM, Altman DG. Statistics notes: the odds ratio. BMJ. 2000; 320: 1468.CrossrefMedlineGoogle Scholar7 Breslow NE, Day NE. Statistical methods in cancer research, volume I: the analysis of case-control studies. IARC Sci Publ. 1980: 5–338.MedlineGoogle Scholar8 Holcomb WL Jr, Chaiworapongsa T, Luke DA, Burgdorf KD. An odd measure of risk: use and misuse of the odds ratio. Obstet Gynecol. 2001; 98: 685–688.MedlineGoogle Scholar9 Davies HT, Crombie IK, Tavakoli M. When can odds ratios mislead? BMJ. 1998; 316: 989–991.CrossrefMedlineGoogle Scholar10 McNutt LA, Wu C, Xue X, Hafner JP. Estimating the relative risk in cohort studies and clinical trials of common outcomes. Am J Epidemiol. 2003; 157: 940–943.CrossrefMedlineGoogle Scholar11 Lee J. An insight on the use of multiple logistic regression analysis to estimate association between risk factor and disease occurrence. Int J Epidemiol. 1986; 15: 22–29.CrossrefMedlineGoogle Scholar12 Harrell FE. Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis. New York, NY: Springer-Verlag; 2001.Google Scholar13 Fox CS, Pencina MJ, Meigs JB, Vasan RS, Levitzky YS, D'Agostino RB Sr. Trends in the incidence of type 2 diabetes mellitus from the 1970s to the 1990s: the Framingham Heart Study. Circulation. 2006; 113: 2914–2918.LinkGoogle Scholar14 Hosmer DW, Lemeshow S. Applied Survival Analysis. New York, NY: John Wiley & Sons; 1999.Google Scholar15 Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol. 1996; 49: 1373–1379.CrossrefMedlineGoogle Scholar16 Hosmer DW, Taber S, Lemeshow S. The importance of assessing the fit of logistic regression models: a case study. Am J Public Health. 1991; 81: 1630–1635.CrossrefMedlineGoogle Scholar17 Bagley SC, White H, Golomb BA. Logistic regression in the medical literature: standards for use and reporting, with particular attention to one medical domain. J Clin Epidemiol. 2001; 54: 979–985.CrossrefMedlineGoogle Scholar18 Bender R, Grouven U. Logistic regression models used in medical research are poorly presented. BMJ. 1996; 313: 628.Google Scholar19 Hosmer DW, Lemeshow S. A goodness-of-fit test for the multiple logistic regression model. Commun Stat. 1980; A10: 1043–1069.Google Scholar20 Pepe MS. The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford, UK: Oxford University Press; 2003.Google Scholar21 Friendly M. Visualizing Categorical Data. Cary, NC: SAS Institute Inc; 2000.Google Scholar22 Bender R, Grouven U. Ordinal logistic regression in medical research. J R Coll Physicians Lond. 1997; 31: 546–551.MedlineGoogle Scholar23 Harrell FE Jr, Margolis PA, Gove S, Mason KE, Mulholland EK, Lehmann D, Muhe L, Gatchalian S, Eichenwald HF. Development of a clinical prediction model for an ordinal outcome: the World Health Organization Multicentre Study of Clinical Signs and Etiological Agents of Pneumonia, Sepsis and Meningitis in Young Infants: WHO/ARI Young Infant Multicentre Study Group. Stat Med. 1998; 17: 909–944.CrossrefMedlineGoogle Scholar24 Scott SC, Goldberg MS, Mayo NE. Statistical assessment of ordinal outcomes in comparative studies. J Clin Epidemiol. 1997; 50: 45–55.CrossrefMedlineGoogle Scholar25 Cannon MJ, Warner L, Taddei JA, Kleinbaum DG. What can go wrong when you assume that correlated data are independent: an illustration from the evaluation of a childhood health intervention in Brazil. Stat Med. 2001; 20: 1461–1467.CrossrefMedlineGoogle Scholar26 Lipsitz SR, Kim K, Zhao L. Analysis of repeated categorical data using generalized estimating equations. Stat Med. 1994; 13: 1149–1163.CrossrefMedlineGoogle Scholar27 Twisk JWR. Applied Longitudinal Data Analysis for Epidemiology. Cambridge, UK: Cambridge University Press; 2003.Google Scholar eLetters(0)eLetters should relate to an article recently published in the journal and are not a forum for providing unpublished data. Comments are reviewed for appropriate use of tone and language. Comments are not peer-reviewed. Acceptable comments are posted to the journal website only. Comments are not published in an issue and are not indexed in PubMed. Comments should be no longer than 500 words and will only be posted online. References are limited to 10. Authors of the article cited in the comment will be invited to reply, as appropriate.Comments and feedback on AHA/ASA Scientific Statements and Guidelines should be directed to the AHA/ASA Manuscript Oversight Committee via its Correspondence page.Sign In to Submit a Response to This Article Previous Back to top Next FiguresReferencesRelatedDetailsCited By Tian G, Harrison P, Sreenivasan A, Carreras-Puigvert J and Spjuth O (2023) Combining molecular and cell painting image data for mechanism of action prediction, Artificial Intelligence in the Life Sciences, 10.1016/j.ailsci.2023.100060, 3, (100060), Online publication date: 1-Dec-2023. Quan J, Li Y, Wang L, He R, Yang S and Guo L (2023) EEG-based cross-subject emotion recognition using multi-source domain transfer learning, Biomedical Signal Processing and Control, 10.1016/j.bspc.2023.104741, 84, (104741), Online publication date: 1-Jul-2023. Li T, Chen S, Zhang Y, Zhao Q, Ma K, Jiang X, Xiang R, Zhai F and Ling G (2023) Ensemble learning-based gene signature and risk model for predicting prognosis of triple-negative breast cancer, Functional & Integrative Genomics, 10.1007/s10142-023-01009-z, 23:2, Online publication date: 1-Jun-2023. Abdel-Qadir H, Austin P, Sivaswamy A, Chu A, Wijeysundera H and Lee D (2023) Comorbidity-stratified estimates of 30-day mortality risk by age for unvaccinated men and women with COVID-19: a population-based cohort study, BMC Public Health, 10.1186/s12889-023-15386-4, 23:1 Barboza L, Mello R, Modell M and Teixeira E (2023) Blockly-DS: Blocks Programming for Data Science with Visual, Statistical, Descriptive and Predictive Analysis LAK 2023: 13th International Learning Analytics and Knowledge Conference, 10.1145/3576050.3576097, 9781450398657, (644-649), Online publication date: 13-Mar-2023. Alam Suha S and Islam M (2023) Exploring the Dominant Features and Data-driven Detection of Polycystic Ovary Syndrome through Modified Stacking Ensemble Machine Learning Technique, Heliyon, 10.1016/j.heliyon.2023.e14518, (e14518), Online publication date: 1-Mar-2023. Khan A, Qureshi M, Daniyal M, Tawiah K and Maugeri A (2023) A Novel Study on Machine Learning Algorithm-Based Cardiovascular Disease Prediction, Health & Social Care in the Community, 10.1155/2023/1406060, 2023, (1-10), Online publication date: 20-Feb-2023. Narsa N, Afifa L and Wardhaningrum O (2023) Fraud triangle and earnings management based on the modified M-score: A study on manufacturing company in Indonesia, Heliyon, 10.1016/j.heliyon.2023.e13649, 9:2, (e13649), Online publication date: 1-Feb-2023. Feng R, Wang S, Chang G, Zhang W, Liu Q, Wang X, Chen W and Wang S (2023) The feasibility of small-caliber veins for autogenous arteriovenous fistula creation: A single-center retrospective study, Frontiers in Cardiovascular Medicine, 10.3389/fcvm.2023.1070084, 10 Alshybani I, Jaberi F, Murillo M and Tian Y (2023) Assessment of Machine Learning Classification Based
Referência(s)