Use of Brier score to assess binary predictions
2010; Elsevier BV; Volume: 63; Issue: 8 Linguagem: Inglês
10.1016/j.jclinepi.2009.11.009
ISSN1878-5921
Autores Tópico(s)Global Cancer Incidence and Screening
ResumoThe use of the Brier score [[1]Brier G.W. Verification of forecasts expressed in terms of probability.Mon Weather Rev. 1950; 78: 1-3Crossref Google Scholar] in medical research to assess and compare the accuracy of binary predictions or prediction models is increasingly popular; see, for example Refs. [2Itoh S. Ikeda M. Mori Y. Suzuki K. Sawaki A. Iwano S. et al.Lung: feasibility of a method for changing tube current during low-dose helical CT.Radiology. 2002; 224: 905-912Crossref PubMed Scopus (22) Google Scholar, 3Vergouwe Y. Steyerberg E.W. Eijkemans M.J. Habbema J.D. Substantial effective sample sizes were required for external validation studies of predictive logistic regression models.J Clin Epidemiol. 2005; 58: 475-483Abstract Full Text Full Text PDF PubMed Scopus (372) Google Scholar, 4Huo D. Senie R.T. Daly M. Buys S.S. Cummings S. Ogutha J. et al.Prediction of BRCA mutations using the BRCAPRO model in clinic-based African American, Hispanic, and other minority families in the United States.J Clin Oncol. 2009; 27: 1184-1190Crossref PubMed Scopus (33) Google Scholar, 5Steyerberg E.W. Clinical prediction models. Springer, New York, NY2009Crossref Google Scholar]. In [[6]Harrison D.A. Brady A.R. Parry G.J. Carpenter J.R. Rowan K. Recalibration of risk prediction models in a large multicenter cohort of admissions to adult, general critical care units in the United Kingdom.Cri Care Med. 2006; 34: 1378-1388Crossref PubMed Scopus (136) Google Scholar, Box 1] an overview of a variety of measures of model performance is offered and [[7]Steyerberg E.W. Harrell F.E. Borsboom G.J. Eijkemans M.J. Vergouwe Y. Habbema J.D. Internal validation of predictive models: efficiency of some procedures for logistic regression analysis.J Clin Epidemiol. 2001; 54: 774-781Abstract Full Text Full Text PDF PubMed Scopus (1713) Google Scholar] propose cutoffs for appraising the value of a computed score. How Brier scores can be formally compared is detailed in Ref. [[8]Redelmeier D.A. Bloch D.A. Hickam D.H. Assessing predictive accuracy: how to compare Brier scores.J Clin Epidemiol. 1991; 44: 1141-1146Abstract Full Text PDF PubMed Scopus (91) Google Scholar]. Because of a growing number of applications and in light of the description in [[9]Lix L.M. Yogendran M.S. Leslie W.D. Shaw S.Y. Baumgartner R. Bowman C. et al.Using multiple data features improved the validity of osteoporosis case ascertainment from administrative databases.J Clin Epidemiol. 2008; 61: 1250-1260Abstract Full Text Full Text PDF PubMed Scopus (61) Google Scholar, p. 1253], we would like to briefly discuss the Brier score and its connection to Spiegelhalter's calibration test [[10]Spiegelhalter D.J. Probabilistic prediction in patient management and clinical trials.Stat Med. 1986; 5: 421-433Crossref PubMed Scopus (230) Google Scholar]. For n predictive probabilities p=(p1,…,pn) with 0≤pi≤1 and n realizations x=(x1,…, xn) of Bernoulli random variables Xi∼Ber(πi) with 0≤πi≤1,π=(π1,…,πn) and xi∈{0, 1}, the Brier score, defined asB(p,x)=n−1∑i=1n(xi−pi)2=n−1∑i=1n(xi−pi)(1−2pi)+n−1∑i=1npi(1−pi),(1) equals the mean squared error of prediction. As a proper scoring rule, the Brier score simultaneously addresses calibration, the statistical consistency between the predicted probability and the observations as well as sharpness, which refers to the concentration of the predictive distribution, see Ref. [[11]Gneiting T. Raftery A.E. Strictly proper scoring rules, prediction, and estimation.J Am Stat Assoc. 2007; 102: 359-378Crossref Scopus (2263) Google Scholar]. This feature is also nicely illustrated by Murphy's decomposition of the Brier score, see Ref. [[12]Murphy A.H. Scalar and vector partitions of the probability score: Part I. Two-state situation.J Appl Meteorol. 1972; 11: 273-282Crossref Google Scholar]. When assessing the predictive accuracy of different binary predictions, for example, several logistic regression models, the Brier score can be used to compare model performances, see [[6]Harrison D.A. Brady A.R. Parry G.J. Carpenter J.R. Rowan K. Recalibration of risk prediction models in a large multicenter cohort of admissions to adult, general critical care units in the United Kingdom.Cri Care Med. 2006; 34: 1378-1388Crossref PubMed Scopus (136) Google Scholar, Box 1]. Being mainly a relative measure, a lower score points to a superior model; the actual value of the score seems of limited value. In the decomposition (1) the first summand has expectation 0 under perfect calibration, that is, if p=π. This is exploited in the construction of Spiegelhalter's z-statistic [10Spiegelhalter D.J. Probabilistic prediction in patient management and clinical trials.Stat Med. 1986; 5: 421-433Crossref PubMed Scopus (230) Google Scholar, 13StataCorp. STATA, reference A-F. Stata Corporation; 2003.Google Scholar] that enables formal assessment of calibration of binary predictions. The z-statistic is defined asZ(p,x)=∑i=1n(xi−pi)(1−2pi)∑i=1n(1−2pi)2pi(1−pi).(2) The null hypothesis of calibration, that is, p=π is rejected at the significance level α if |Z(p,x)|>q1−α/2, where qα is the α-quantile of the standard normal distribution. This short summary makes it clear that•calibration is neither equal to prediction error•nor does a lower Brier score necessarily indicate better calibration,as suggested by [[9]Lix L.M. Yogendran M.S. Leslie W.D. Shaw S.Y. Baumgartner R. Bowman C. et al.Using multiple data features improved the validity of osteoporosis case ascertainment from administrative databases.J Clin Epidemiol. 2008; 61: 1250-1260Abstract Full Text Full Text PDF PubMed Scopus (61) Google Scholar, p. 1253]. Indeed, suppose two Bernoulli experiments are performed and two forecasters issue predictions p1=(0.2, 0.2) and p2=(0.4, 0.5), and x=(1, 0) materializes. The resulting Brier scores and values of Spiegelhalter's z-statistic for these two competing models are provided in Table 1.Table 1Value of Brier score and Spiegelhalter's z-statistic for the two models and the realization x=(1, 0)ModelpB(p, x)Z(p, x)1(0.2, 0.2)0.341.062(0.4, 0.5)0.301.22 Open table in a new tab According to the value of the Brier score, the second model is to be preferred over the first; however, because |Z(p2,x)|>|Z(p1,x)| this second model is less calibrated than the first. This simple example reveals that it is not generally true that a lower Brier score implies better model calibration. The reason is that the Brier score simultaneously addresses calibration and sharpness, see the discussion above. To exclusively address calibration, Spiegelhalter's z-test should be used. To us, it is therefore unclear what the purpose is of testing the Brier score on the value of 0, as indicated by [[9]Lix L.M. Yogendran M.S. Leslie W.D. Shaw S.Y. Baumgartner R. Bowman C. et al.Using multiple data features improved the validity of osteoporosis case ascertainment from administrative databases.J Clin Epidemiol. 2008; 61: 1250-1260Abstract Full Text Full Text PDF PubMed Scopus (61) Google Scholar, Table 2]. Instead, we hazard a guess that the authors actually intended to say that their models marked with a “∗” in [[9]Lix L.M. Yogendran M.S. Leslie W.D. Shaw S.Y. Baumgartner R. Bowman C. et al.Using multiple data features improved the validity of osteoporosis case ascertainment from administrative databases.J Clin Epidemiol. 2008; 61: 1250-1260Abstract Full Text Full Text PDF PubMed Scopus (61) Google Scholar, Table 2] are not well calibrated. However, this does not correspond to have a Brier score value that is significantly different from 0 but Spiegelhalter's z-statistic different from 0. To conclude, we advocate the use of the Brier score to assess predictive accuracy of binary prediction models, and we also agree that calibration of such models is an important issue and should be addressed when comparing models via the Brier score. With this short note, we intended to clarify some aspects when using these tools. Using multiple data features improved the validity of osteoporosis case ascertainment from administrative databasesJournal of Clinical EpidemiologyVol. 61Issue 12PreviewThe aim was to construct and validate algorithms for osteoporosis case ascertainment from administrative databases and to estimate the population prevalence of osteoporosis for these algorithms. Full-Text PDF Brier score summarizes model calibration and discrimination - ReplyJournal of Clinical EpidemiologyVol. 63Issue 8PreviewThank you for the opportunity to respond to the comments of Dr Rufibach. In the article by Lix et al. [1], the c-statistic (equal to the area under the receiver operator characteristic curve for a binary outcome variable) and Brier score were used to evaluate algorithms for classifying osteoporosis cases and noncases identified from a bone mineral density database. The algorithms were constructed using a number of variables defined from hospital, physician, and prescription administrative databases. Full-Text PDF
Referência(s)