Artigo Revisado por pares

Correlation, Agreement, and Bland–Altman Analysis: Statistical Analysis of Method Comparison Studies

2009; Elsevier BV; Volume: 148; Issue: 1 Linguagem: Inglês

10.1016/j.ajo.2008.09.032

ISSN

1879-1891

Autores

Catey Bunce,

Tópico(s)

Reliability and Agreement in Measurement

Resumo

Technology seems to evolve at a rapid pace these days, and new methods for measuring ocular characteristics seem to be emerging constantly. Although once there might have been a single method to assess intraocular pressure (IOP), ophthalmic researchers today are presented with a variety of tools—dynamic contour tonometry, Goldmann applanation tonometry, hand-held tonometers such as the Tono-Pen XL, Perkins tonometer, Draeger tonometer, etc. The same is true for visual field assessment (Humphrey Field Analyzer [Humphrey Instruments, Dublin, California, USA], Octopus perimeter [Interzeag, Schlieren, Switzerland] with their various threshold strategies), optic disc evaluation (confocal laser ophthalmoscope, scanning laser polarimeter), etc. It would be unwise simply to assume that measures made on the same person using different methods of measurement will agree, and so studies are designed to address the question. Such studies may compare measurement with a new piece of equipment (perhaps cheaper, faster, or smaller) with the so-called true measurement, but more often they compare two different measuring devices where neither can be said to offer the truth. In 1983, Altman and Bland set out their views regarding the correct analysis of the data gathered in studies of this type and drew attention to a common misconception that computation of the Pearson correlation coefficient between the two measurements is appropriate.1Altman D.G. Bland J.M. Measurement in medicine: the analysis of method comparison studies.Statistician. 1983; 32: 307-317Crossref Google Scholar, 2Bland J.M. Altman D.G. Statistical methods for assessing agreement between two methods of clinical measurement.Lancet. 1986; 1: 307-310Abstract PubMed Scopus (38610) Google Scholar They explained that the Pearson correlation coefficient measures linear association rather than agreement and pointed out that methods can correlate well yet disagree greatly, as would occur if one method read consistently higher than the other. Bland and Altman commented that correlation typically depends on the range of measures being assessed, with wider ranges being assessed often resulting in higher correlations but not as a result of better agreement between the methods being assessed. They concluded that correlation coefficients can be misleading in method agreement studies and put forward their alternative method, the limits of agreement (LoA) technique. Bland and Altman stress the need to assess two aspects of agreement: how well the methods agree on average and how well the measurements agree for individuals. If one method reads lower than the other for half of the subjects but higher than the other for the other subjects, then overall the average discrepancy (the difference between measures on the same subject) may be close to 0, despite discrepancy for individuals being high. Average agreement, or bias, can be estimated by the mean of the differences for individuals, and commonly a t test is conducted against the null hypothesis of no bias. Estimates of bias then can be reported with 95% confidence intervals (CIs) computed as the mean difference ± 1.96 × standard error of the differences. Agreement for individuals is summarized in terms of LoA, which involves an examination of the variability of the differences. If the distribution of the differences is reasonably normal, that is, symmetric and without long tails (assessed by a histogram), and provided that the level of discrepancy does not depend on the level of the characteristic being measured, then 95% LoA can be computed as the mean of the differences ± 1.96 × standard deviation (SD) of the differences. The second assumption is assessed by examination of the Bland–Altman plot, a scatterplot of the difference between measurements against their average. The plot should be looked at to see whether there seems to be any relationship between discrepancy and the level of measurement (eg, increasing discrepancy between standard and test A with increasing IOP [Figure 1] or increasing variability of differences between instruments with increasing IOP [Figure 2]).Figure 3 shows the situation where there is no relationship between discrepancy and level of measurement, in which case 95% LoA would be appropriate. Where relationships are observed, Bland and Altman make recommendations as to how to remove these by transformation or regression.3Bland J.M. Altman D.G. Measuring agreement in method comparison studies.Stat Methods Med Res. 1999; 8: 135-160Crossref PubMed Scopus (6420) Google ScholarFIGURE 2Bland–Altman plot showing the difference against the average of test B and standard measurements with LoA (broken lines)—simulated data. This plot shows evidence of increasing variability of differences between instruments, with increasing IOP—here the LoA clearly are too wide at lower levels of IOP.View Large Image Figure ViewerDownload Hi-res image Download (PPT)FIGURE 3Bland–Altman plot showing the difference against the average of test C and standard measurements with LoA (broken lines)—simulated data. This plot shows no relationship between discrepancy and the level of measurement, so that LoA are valid.View Large Image Figure ViewerDownload Hi-res image Download (PPT) Ninety-five percent LoA quantify the range of values that can be expected to cover agreement for most of the subjects, thereby guiding the clinician as to whether methods agree sufficiently for use in clinical assessment. For example, 95% LoA between two methods of (−1 mm Hg, 6 mm Hg) would mean that for 95% of individuals, a measurement made by one method would be between 1 mm Hg less and 6 mm Hg more than a measurement made by the other method. It should be understood that “how small LoA should be to conclude that methods agree sufficiently” is a clinical, not a statistical, decision, and it is a decision that ideally is made in advance of the analysis.4Bland J.M. Altman D.G. Applying the right statistics: analyses of measurement studies.Ultrasound Obstet Gynecol. 2003; 22: 85-93Crossref PubMed Scopus (1064) Google Scholar It is not possible to provide a formulaic approach that automatically classifies agreement into good or poor or to provide guidance on which method to use when disagreement is considerable, because this will depend on the particular purpose for which measurements are being made. The question that needs consideration is whether the largest likely differences are small enough for the particular purpose for which measurements are wanted. Because one method comparison study provides a single estimate of LoA, ideally these should be reported with their 95% CIs computed as the lower or upper limit ± 1.96 standard error (limit), where the standard error (limit) is given by approximately root(3s2Bland J.M. Altman D.G. Statistical methods for assessing agreement between two methods of clinical measurement.Lancet. 1986; 1: 307-310Abstract PubMed Scopus (38610) Google Scholar/n), s being the SD of the differences between measurements by the two methods and n being the sample size. It is important that studies comparing methods of measurements are adequately sized—if the number of subjects is small, then even large discrepancies between methods may not be detected. Such studies typically require 100 to 200 subjects. Without large numbers, there is a very real potential for incorrectly finding a new method acceptable and for such methods to be recommended for widespread use without justification. Although there does seem to be evidence of increasing awareness of the methodology put forward by Altman and Bland,5Patton N. Aslam T. Murrary G. Statistical strategies to assess reliability in ophthalmology.Eye. 2006; 20: 749-754Crossref PubMed Scopus (29) Google Scholar there also seems to be evidence of some common misunderstandings.6Dewitte K. Fierens C. Stöckl D. Thienpont L.M. Application of the Bland-Altman plot for interpretation of method-comparison studies: a critical investigation of its practice.Clin Chem. 2002; 48: 799-801PubMed Google Scholar Many authors use correlation in addition to limits of agreement, suggesting that they view these as complementary rather than alternatives.7King A.J. Taguri A. Wadood A.C. Azuara-Blanco A. Comparison of two fast strategies, SITA Fast and TOP, for the assessment of visual fields in glaucoma patients.Graefes Arch Clin Exp Ophthalmol. 2002; 240: 481-487Crossref PubMed Scopus (25) Google Scholar, 8Allen R.J. Dev Borman A. Saleh G.M. Applanation tonometry in silicone hydrogel contact wearers.Cont Lens Anterior Eye. 2007; 30: 267-269Abstract Full Text Full Text PDF PubMed Scopus (21) Google Scholar Other authors seem to believe that the Bland–Altman plot is the analysis—rather than a check on the assumptions necessary for validation of the LoA. The methods put forward by Bland and Altman seem simple and yet their message that the Pearson correlation coefficient is not the correct tool for assessing method agreement does not seem to have been fully acknowledged. The Bland–Altman plot is commonly included, yet its raison d'etre—to check that assumptions necessary for valid use of LoA are adhered to—does not seem to be understood. Ophthalmic researchers are encouraged to not use the Pearson correlation coefficients when analyzing data from method agreement studies and to report LoA and bias with their respective CIs. This study was supported by the National Institute for Healthcare Research (NIHR), London, United kingdom and Development funding. The author indicates no financial conflict of interest. The author was involved in the design of study; collection; management; analysis and interpretation of data; and preparation and review of the manuscript.

Referência(s)