Carta Acesso aberto Revisado por pares

The difference between reliability and agreement

2011; Elsevier BV; Volume: 64; Issue: 6 Linguagem: Inglês

10.1016/j.jclinepi.2010.12.001

ISSN

1878-5921

Autores

Jan Kottner, David L. Streiner,

Tópico(s)

Meta-analysis and systematic reviews

Resumo

In their article, Costa-Santos et al. [[1]Costa-Santos C. Bernardes J. Ayres-de-Campos D. Costa A. Costa C. The limits of agreement and the intraclass correlation coefficient may be inconsistent in the interpretation of agreement.J Clin Epidemiol. 2011; 64: 264-269Abstract Full Text Full Text PDF PubMed Scopus (70) Google Scholar] provide a valuable example about the difficulties in comparing and interpreting reliability and agreement coefficients arising from the same measurement situation. Debates and proposals about what the correct coefficients to measure agreement and reliability are can be traced back to the early 1980s [2House A.E. House B.J. Campbell M.B. Measures of interobserver agreement: calculation formulas and distribution effects.J Behav Assess. 1981; 3: 37-57Crossref Scopus (144) Google Scholar, 3Goodwin L.D. Prescott P.A. Issues and approaches to estimating interrater reliability in nursing research.Res Nurs Health. 1981; 4: 323-337Crossref PubMed Scopus (42) Google Scholar]. Various approaches were discussed to overcome the "limitations" and "drawbacks" of reliability measures (e.g., Refs. [4Zwick R. Another look at interrater agreement.Psycol Bull. 1988; 103: 374-378Crossref PubMed Scopus (172) Google Scholar, 5Cicchetti D.V. Feinstein A.R. High agreement but low kappa: II. Resolving the paradoxes.J Clin Epidemiol. 1990; 43: 551-558Abstract Full Text PDF PubMed Scopus (1366) Google Scholar]), and even today, new alternatives are proposed (e.g., Refs. [6Gwet K.L. Computing inter-rater reliability and its variance in the presence of high agreement.Br J Math Stat Psychol. 2008; 61: 29-48Crossref PubMed Scopus (1023) Google Scholar, 7Costa-Santos C. Antunes L. Souto A. Bernardes J. Assessment of disagreement: a new information-based approach.Ann Epidemiol. 2010; 20: 555-561Abstract Full Text Full Text PDF PubMed Scopus (25) Google Scholar]). However, it seems that much of the confusion around reliability and agreement estimation was and is caused by conceptual ambiguities. There are important differences between the concepts of agreement and reliability (e.g., Refs. [8de Vet H.C. Terwee C.B. Knol D.L. Bouter L.M. When to use agreement versus reliability measures.J Clin Epidemiol. 2006; 59: 1033-1039Abstract Full Text Full Text PDF PubMed Scopus (1269) Google Scholar, 9Streiner D.L. Norman G.R. Health measurement scales.4th edition. Oxford University Press, Oxford, UK2008Crossref Scopus (7411) Google Scholar]). Agreement points to the question, whether diagnoses, scores, or judgments are identical or similar or the degree to which they differ. In this situation, the absolute degree of measurement error is of interest. Consequently, any variability between subjects or the distribution of the rated trait in the population does not matter. For instance, percent agreement for nominal data or limits of agreement for interval and ratio data are excellent measures because they provide this very kind of information in a simple manner. On the other hand, there are the reliability coefficients. Reliability is typically defined as the ratio of variability between scores of the same subjects (e.g., by different raters or at different times) to the total variability of all scores in the sample. Therefore, reliability coefficients (e.g., kappa, intraclass correlation coefficient) provide information about the ability of the scores to distinguish between subjects. From this, it also follows that reliability coefficients must be low when there is little variability among the scores or diagnoses obtained from the instrument under investigation. This occurs when the range of obtained scores is restricted or prevalence is very high or very low. For example, if all raters rate medical students as "excellent," the agreement is perfect, but the reliability of the scale is zero because there is no between-subject variance. It should also be noted that exact agreement among raters or over time does not enter into (most of) the formulas for reliability because all that matters is that the subjects are rank ordered similarly across time or by different raters. Interestingly in their introduction, Costa-Santos et al. [[1]Costa-Santos C. Bernardes J. Ayres-de-Campos D. Costa A. Costa C. The limits of agreement and the intraclass correlation coefficient may be inconsistent in the interpretation of agreement.J Clin Epidemiol. 2011; 64: 264-269Abstract Full Text Full Text PDF PubMed Scopus (70) Google Scholar] refer to Burdock et al. [[10]Burdock E.I. Fleiss J.L. Hardesty A.S. A new view of inter-observer agreement.Person Psychol. 1963; 16: 373-384Crossref Scopus (147) Google Scholar] stating that these authors proposed a "cutoff value of 0.75 … to signify good agreement" (page XX). In fact, Burdock et al. said the following: "A high intraclass correlation coefficient, e.g. R≥0.75, means that there is relatively little residual variability to confound good discriminations among subjects…" [[10]Burdock E.I. Fleiss J.L. Hardesty A.S. A new view of inter-observer agreement.Person Psychol. 1963; 16: 373-384Crossref Scopus (147) Google Scholar, p. 376], that is, reliability—not agreement. In accordance with Vach [[11]Vach W. The dependence of Cohen's kappa on the prevalence does not matter.J Clin Epidemiol. 2005; 58: 655-661Abstract Full Text Full Text PDF PubMed Scopus (135) Google Scholar] and many others, we suggest ending the debate regarding the "inconsistencies" between reliability and agreement measures because both provide different types of information. We disagree with the conclusion that "Agreement remains a difficult concept to represent mathematically, and further development of statistical methods needs to be considered" [[1]Costa-Santos C. Bernardes J. Ayres-de-Campos D. Costa A. Costa C. The limits of agreement and the intraclass correlation coefficient may be inconsistent in the interpretation of agreement.J Clin Epidemiol. 2011; 64: 264-269Abstract Full Text Full Text PDF PubMed Scopus (70) Google Scholar, p. XX]. Compared with many other statistical techniques, computation and interpretation of agreement seems rather simple and leads to straightforward interpretations. Today, we can also choose among several reliability coefficients for various types of data and sampling designs. A clear distinction between the conceptual meanings of agreement and reliability is necessary to select appropriate statistical approaches and enable adequate interpretations. Observer reliability and agreement: differences, difficulties, and controversiesJournal of Clinical EpidemiologyVol. 64Issue 6PreviewWe thank Kottner and Streiner for their interest and useful comments on our article "The limits of agreement and the intraclass correlation coefficient may be inconsistent in the interpretation of agreement," in Journal of Clinical Epidemiology, 2010 [1]. Full-Text PDF

Referência(s)