Journal Club: The Measurement of Reliability
2009; Elsevier BV; Volume: 54; Issue: 1 Linguagem: Inglês
10.1016/j.annemergmed.2009.05.012
ISSN1097-6760
AutoresFrank C. Day, David L. Schriger,
Tópico(s)Health Systems, Economic Evaluations, Quality of Life
ResumoEditor's Capsule Summary for Cruz et al1Cruz C.O. Meshberg E.B. Shofer F.S. et al.Interrater reliability and accuracy of clinicians and trained research assistants performing prospective data collection in emergency department patients with potential acute coronary syndrome.Ann Emerg Med. 2009; 54: 1-7Abstract Full Text Full Text PDF PubMed Scopus (13) Google ScholarWhat is already known on this topicValid clinical research requires high-quality data collection. Physicians are commonly considered the standard by which valid prospective data are obtained.What question this study addressedThis study determined whether non–medically trained research assistants could reliably collect subjective historical data from emergency department patients with chest pain.What this study adds to our knowledgeThis prospective comparative study included 33 research assistants, 39 physicians, and 143 patients. Research assistants demonstrated fair to excellent reliability (as defined by crude agreement and kappa) when obtaining cardiac histories and cardiac risk factors.How this might change clinical practiceThe results of this study will not change clinical practice. They do, however, provide evidence to support the use of trained research assistants for the collection of certain types of clinical data. 1Cruz C.O. Meshberg E.B. Shofer F.S. et al.Interrater reliability and accuracy of clinicians and trained research assistants performing prospective data collection in emergency department patients with potential acute coronary syndrome.Ann Emerg Med. 2009; 54: 1-7Abstract Full Text Full Text PDF PubMed Scopus (13) Google Scholar Valid clinical research requires high-quality data collection. Physicians are commonly considered the standard by which valid prospective data are obtained. This study determined whether non–medically trained research assistants could reliably collect subjective historical data from emergency department patients with chest pain. This prospective comparative study included 33 research assistants, 39 physicians, and 143 patients. Research assistants demonstrated fair to excellent reliability (as defined by crude agreement and kappa) when obtaining cardiac histories and cardiac risk factors. The results of this study will not change clinical practice. They do, however, provide evidence to support the use of trained research assistants for the collection of certain types of clinical data. Discussion Points1Cruz et al contains 2 parts, a comparison of the values gathered by trained research assistants and physicians about historical information in chest pain patients and the comparison of these participants' recordings with a “correct” value for each item.A. For each part, indicate whether the authors are studying reliability or validity and explain the difference between these concepts.B. What did the authors use as their criterion standard for the validity analysis?C. What are potential problems with their method of defining the criterion (gold) standard? Can you think of alternative approaches?D. The authors report crude agreement and interquartile range for their validity analysis. What part of a distribution is described by the interquartile range? List other statistics used to describe the validity of a measure and why they might be preferable to reporting crude agreement.2Crude percentage agreement is a simple way to report reliability. Consider the contingency table for the question, Was the quality of the chest pain crushing? (Yes or no):Tabled 1MD Recorded “Yes”MD Recorded “No”TotalRA recorded yes1176123RA recorded no18220Total1358143MD, Medical doctor; RA, Research assistant. Open table in a new tab A. Calculate the crude percentage agreement for this table. What is the range of possible values for percentage agreement?B. Calculate Cohen's κ for this table. What is the formula for κ for raters making a binary assessment (eg, yes/no or true/false)? Discuss the purpose of Cohen's κ, its range, and the interpretations of key values such as –1, 0, and 1.C. What other measures can be used to measure reliability for binary, categorical, and continuous data?3 Cruz et al quote the oft-cited Landis and Koch2Landis J.R. Koch G.G. The measurement of observer agreement for categorical data.Biometrics. 1977; 33: 159-174Crossref PubMed Scopus (49726) Google Scholar article stating that a κ of “less than 0.2 represents poor agreement; 0.21 to 0.40, fair agreement; 0.41 to 0.60, moderate agreement; 0.61 to 0.80, good agreement; and 0.81 to 1.00, excellent agreement.” Consider studies of the agreement of airline pilots deciding whether it is safe to land and psychologists deciding whether interviewees have type A or type B personalities. If the studies produced the same numeric κ value, would the adjectives assigned by Landis and Koch be equally appropriate?4 A. Imagine 2 blindfolded, intelligent individuals who are sitting in distant corners of a room and listening to 100 easy true/false statements such as “red is a color,” “2+2=5,” etc, over a loudspeaker. Each indicates his or her choice by pressing a button in the left hand for “false” and in the right for “true.” Questions are not repeated and the respondents are expected to offer a response for each statement. Verify that if they agree on all 100 answers, percentage agreement is 100 and κ is 1.0, regardless of how many statements are true and how many are false. Now imagine that the testing site is under the final approach for a major airport, and, at times, noise from jets flying overhead drowns out the statements from the loudspeaker. When this occurs, respondents agree, on average, only half the time (as one would expect). Recalculate percentage agreement and κ for the same 100 statement test conducted under the following sets of conditions: (1) half the statements are true and 1% of the statements are rendered incomprehensible by the planes; (2) 90% of the statements are true and 1% of the statements are rendered incomprehensible by the planes; (3) half the statements are true and 20% of the statements are rendered incomprehensible by the planes; (4) 90% of the statements are true and 20% of the statements are rendered incomprehensible by the planes. Discuss the meaning of percentage agreement and κ in these 4 settings.Imagine that the right-hand table was from the true/false experiment described above and that planes were flying so frequently that every question was somewhat difficult to hear. Imagine 2 scenarios: in the first, both raters are told that there are 80 true statements and 20 false statements. In the second, raters are told that there could be 100 true statements with no false statements, 100 false statements with no true statements, or any combination in between, with each having an equal probability of occurring. Does κ mean the same thing in these 2 situations?View Large Image Figure ViewerDownload Hi-res image Download (PPT)View Large Image Figure ViewerDownload Hi-res image Download (PPT)5Finally, the following graph shows percentage agreement versus κ for the first 50 items in Table 1 of Cruz et al. The points are shaded to indicate how many subjects fall into the smallest cell in the 2 × 2 table.A. Four lines in the table are denoted with square markers (near the arrow) on the graph (Is pain burning? Does it radiate to the back? Does it radiate to the jaw? Does it radiate to the left arm?). Create (approximate) 2 × 2 tables for these 4 points. Can you explain why these tables have similar percentage agreement but varying κs? Which do you believe is the better measure? Why do the κs differ?B. Can you comment on the relationship between the size of the smallest cell in the 2 × 2 table and the extent to which κ may deviate from percentage agreement?C. Given the problems with both percentage agreement and κ illustrated in these examples, do you think it would be better if investigators presented the 4 numbers in the inner cells of each 2 × 2 table, instead of reporting the percentage agreement or κ? Discussion Points1Cruz et al contains 2 parts, a comparison of the values gathered by trained research assistants and physicians about historical information in chest pain patients and the comparison of these participants' recordings with a “correct” value for each item.A. For each part, indicate whether the authors are studying reliability or validity and explain the difference between these concepts.B. What did the authors use as their criterion standard for the validity analysis?C. What are potential problems with their method of defining the criterion (gold) standard? Can you think of alternative approaches?D. The authors report crude agreement and interquartile range for their validity analysis. What part of a distribution is described by the interquartile range? List other statistics used to describe the validity of a measure and why they might be preferable to reporting crude agreement.2Crude percentage agreement is a simple way to report reliability. Consider the contingency table for the question, Was the quality of the chest pain crushing? (Yes or no):Tabled 1MD Recorded “Yes”MD Recorded “No”TotalRA recorded yes1176123RA recorded no18220Total1358143MD, Medical doctor; RA, Research assistant. Open table in a new tab A. Calculate the crude percentage agreement for this table. What is the range of possible values for percentage agreement?B. Calculate Cohen's κ for this table. What is the formula for κ for raters making a binary assessment (eg, yes/no or true/false)? Discuss the purpose of Cohen's κ, its range, and the interpretations of key values such as –1, 0, and 1.C. What other measures can be used to measure reliability for binary, categorical, and continuous data?3 Cruz et al quote the oft-cited Landis and Koch2Landis J.R. Koch G.G. The measurement of observer agreement for categorical data.Biometrics. 1977; 33: 159-174Crossref PubMed Scopus (49726) Google Scholar article stating that a κ of “less than 0.2 represents poor agreement; 0.21 to 0.40, fair agreement; 0.41 to 0.60, moderate agreement; 0.61 to 0.80, good agreement; and 0.81 to 1.00, excellent agreement.” Consider studies of the agreement of airline pilots deciding whether it is safe to land and psychologists deciding whether interviewees have type A or type B personalities. If the studies produced the same numeric κ value, would the adjectives assigned by Landis and Koch be equally appropriate?4 A. Imagine 2 blindfolded, intelligent individuals who are sitting in distant corners of a room and listening to 100 easy true/false statements such as “red is a color,” “2+2=5,” etc, over a loudspeaker. Each indicates his or her choice by pressing a button in the left hand for “false” and in the right for “true.” Questions are not repeated and the respondents are expected to offer a response for each statement. Verify that if they agree on all 100 answers, percentage agreement is 100 and κ is 1.0, regardless of how many statements are true and how many are false. Now imagine that the testing site is under the final approach for a major airport, and, at times, noise from jets flying overhead drowns out the statements from the loudspeaker. When this occurs, respondents agree, on average, only half the time (as one would expect). Recalculate percentage agreement and κ for the same 100 statement test conducted under the following sets of conditions: (1) half the statements are true and 1% of the statements are rendered incomprehensible by the planes; (2) 90% of the statements are true and 1% of the statements are rendered incomprehensible by the planes; (3) half the statements are true and 20% of the statements are rendered incomprehensible by the planes; (4) 90% of the statements are true and 20% of the statements are rendered incomprehensible by the planes. Discuss the meaning of percentage agreement and κ in these 4 settings.Imagine that the right-hand table was from the true/false experiment described above and that planes were flying so frequently that every question was somewhat difficult to hear. Imagine 2 scenarios: in the first, both raters are told that there are 80 true statements and 20 false statements. In the second, raters are told that there could be 100 true statements with no false statements, 100 false statements with no true statements, or any combination in between, with each having an equal probability of occurring. Does κ mean the same thing in these 2 situations?View Large Image Figure ViewerDownload Hi-res image Download (PPT)5Finally, the following graph shows percentage agreement versus κ for the first 50 items in Table 1 of Cruz et al. The points are shaded to indicate how many subjects fall into the smallest cell in the 2 × 2 table.A. Four lines in the table are denoted with square markers (near the arrow) on the graph (Is pain burning? Does it radiate to the back? Does it radiate to the jaw? Does it radiate to the left arm?). Create (approximate) 2 × 2 tables for these 4 points. Can you explain why these tables have similar percentage agreement but varying κs? Which do you believe is the better measure? Why do the κs differ?B. Can you comment on the relationship between the size of the smallest cell in the 2 × 2 table and the extent to which κ may deviate from percentage agreement?C. Given the problems with both percentage agreement and κ illustrated in these examples, do you think it would be better if investigators presented the 4 numbers in the inner cells of each 2 × 2 table, instead of reporting the percentage agreement or κ? 1Cruz et al contains 2 parts, a comparison of the values gathered by trained research assistants and physicians about historical information in chest pain patients and the comparison of these participants' recordings with a “correct” value for each item.A. For each part, indicate whether the authors are studying reliability or validity and explain the difference between these concepts.B. What did the authors use as their criterion standard for the validity analysis?C. What are potential problems with their method of defining the criterion (gold) standard? Can you think of alternative approaches?D. The authors report crude agreement and interquartile range for their validity analysis. What part of a distribution is described by the interquartile range? List other statistics used to describe the validity of a measure and why they might be preferable to reporting crude agreement.2Crude percentage agreement is a simple way to report reliability. Consider the contingency table for the question, Was the quality of the chest pain crushing? (Yes or no):Tabled 1MD Recorded “Yes”MD Recorded “No”TotalRA recorded yes1176123RA recorded no18220Total1358143MD, Medical doctor; RA, Research assistant. Open table in a new tab A. Calculate the crude percentage agreement for this table. What is the range of possible values for percentage agreement?B. Calculate Cohen's κ for this table. What is the formula for κ for raters making a binary assessment (eg, yes/no or true/false)? Discuss the purpose of Cohen's κ, its range, and the interpretations of key values such as –1, 0, and 1.C. What other measures can be used to measure reliability for binary, categorical, and continuous data?3 Cruz et al quote the oft-cited Landis and Koch2Landis J.R. Koch G.G. The measurement of observer agreement for categorical data.Biometrics. 1977; 33: 159-174Crossref PubMed Scopus (49726) Google Scholar article stating that a κ of “less than 0.2 represents poor agreement; 0.21 to 0.40, fair agreement; 0.41 to 0.60, moderate agreement; 0.61 to 0.80, good agreement; and 0.81 to 1.00, excellent agreement.” Consider studies of the agreement of airline pilots deciding whether it is safe to land and psychologists deciding whether interviewees have type A or type B personalities. If the studies produced the same numeric κ value, would the adjectives assigned by Landis and Koch be equally appropriate?4 A. Imagine 2 blindfolded, intelligent individuals who are sitting in distant corners of a room and listening to 100 easy true/false statements such as “red is a color,” “2+2=5,” etc, over a loudspeaker. Each indicates his or her choice by pressing a button in the left hand for “false” and in the right for “true.” Questions are not repeated and the respondents are expected to offer a response for each statement. Verify that if they agree on all 100 answers, percentage agreement is 100 and κ is 1.0, regardless of how many statements are true and how many are false. Now imagine that the testing site is under the final approach for a major airport, and, at times, noise from jets flying overhead drowns out the statements from the loudspeaker. When this occurs, respondents agree, on average, only half the time (as one would expect). Recalculate percentage agreement and κ for the same 100 statement test conducted under the following sets of conditions: (1) half the statements are true and 1% of the statements are rendered incomprehensible by the planes; (2) 90% of the statements are true and 1% of the statements are rendered incomprehensible by the planes; (3) half the statements are true and 20% of the statements are rendered incomprehensible by the planes; (4) 90% of the statements are true and 20% of the statements are rendered incomprehensible by the planes. Discuss the meaning of percentage agreement and κ in these 4 settings.Imagine that the right-hand table was from the true/false experiment described above and that planes were flying so frequently that every question was somewhat difficult to hear. Imagine 2 scenarios: in the first, both raters are told that there are 80 true statements and 20 false statements. In the second, raters are told that there could be 100 true statements with no false statements, 100 false statements with no true statements, or any combination in between, with each having an equal probability of occurring. Does κ mean the same thing in these 2 situations?5Finally, the following graph shows percentage agreement versus κ for the first 50 items in Table 1 of Cruz et al. The points are shaded to indicate how many subjects fall into the smallest cell in the 2 × 2 table.A. Four lines in the table are denoted with square markers (near the arrow) on the graph (Is pain burning? Does it radiate to the back? Does it radiate to the jaw? Does it radiate to the left arm?). Create (approximate) 2 × 2 tables for these 4 points. Can you explain why these tables have similar percentage agreement but varying κs? Which do you believe is the better measure? Why do the κs differ?B. Can you comment on the relationship between the size of the smallest cell in the 2 × 2 table and the extent to which κ may deviate from percentage agreement?C. Given the problems with both percentage agreement and κ illustrated in these examples, do you think it would be better if investigators presented the 4 numbers in the inner cells of each 2 × 2 table, instead of reporting the percentage agreement or κ? Interrater Reliability and Accuracy of Clinicians and Trained Research Assistants Performing Prospective Data Collection in Emergency Department Patients With Potential Acute Coronary SyndromeAnnals of Emergency MedicineVol. 54Issue 1PreviewClinical research requires high-quality data collection. Data collected at the emergency department evaluation is generally considered more precise than data collected through chart abstraction but is cumbersome and time consuming. We test whether trained research assistants without a medical background can obtain clinical research data as accurately as physicians. We hypothesize that they would be at least as accurate because they would not be distracted by clinical requirements. Full-Text PDF CorrectionAnnals of Emergency MedicineVol. 54Issue 6PreviewIn the July 2009 issue, in the Journal Club (“The Measurement of Reliability,” page 10, table on left in answer 4C), in column 3, row 3, it should have said “4,” not “40.” We apologize for this error. Full-Text PDF
Referência(s)