Artigo Acesso aberto Revisado por pares

Examining Subgroup Differences on the Computer-based Case-Simulation Component of USMLE Step 3

2002; Lippincott Williams & Wilkins; Volume: 77; Issue: Supplement Linguagem: Inglês

10.1097/00001888-200210001-00027

ISSN

1938-808X

Autores

Melissa J. Margolis, Brian E. Clauser, Polina Harik, M. JEANNE GUERNSEY,

Tópico(s)

Innovations in Medical Education

Resumo

During recent decades, consequential validity has become an increasingly important standard by which tests are evaluated.1 Consequential validity is that component of the overall validity argument that deals with the social consequences of testing. One aspect of consequential validity that should be addressed is the question of how a test may impact examinees from different definable groups (e.g., gender, ethnicity, English-language status). Research into gender differences on achievement tests and undergraduate, graduate, and professional school admission tests documents a long-standing history of performance differences between men and women.2,3,4 Additional evidence suggesting that essay tests tend to yield smaller gender differences than do multiple-choice assessments provides valuable information for test developers wanting to address potential gender-subgroup differences.3,5 Investigations of performance in a high-stakes, medical licensure context also report differences among gender subgroups6,7,8 and native/non-native English speakers.8,9 On Step 1 of the United States Medical Licensing Examination (USMLE), performance advantages were observed for men,7,8 while on USMLE Step 2, performances of men and women tended to be quite similar.6,7 For both Steps 1 and 2, native English speakers performed better than examinees for whom English is not the native language.8,9 Although a majority of the available research deals with the impact of subgroup differences on scores from traditional multiple-choice question (MCQ) and essay examinations, the growing inclusion of performance assessments in high-stakes tests makes investigating the consequential validity of these components an important step in validating the use of these alternative testing formats in a high-stakes context. Beginning in 1999, a dynamic computer simulation was added to the USMLE Step 3 examination. This test had previously been composed entirely of MCQ items. This paper reports efforts to collect consequential validity evidence regarding gender, English as a second language (ESL) status, and Liaison Committee on Medical Education (LCME) status relevant to scores produced using this simulation. Method The National Board of Medical Examiners has developed a dynamic simulation of the patient-care environment. This simulation comprises individual computer-based case simulations (CCSs), and it was developed to assess physicians' patient-management skills (the format has been described in detail previously).10 Briefly, each case begins with an opening scenario describing the patient's location and presentation. The examinee then has the opportunity to (1) access history and physical examination information; (2) order tests, treatments, and consultations by making free-text entries; (3) advance the case through simulated time; and (4) change the patient's location. The system recognizes over 12,000 abbreviations, brand names, and other terms that represent more than 2,500 unique actions. Within the dynamic framework, the patient's condition changes based both on the actions taken by the examinee and the underlying problem. The simulations are scored using a computer-automated scoring algorithm that produces a score designed to approximate that which would have been received if the examinee's performance had been reviewed and rated by a group of experts.11 Step 3 is usually taken during the first or second year of post-graduate training. In its current configuration, the examination requires two days of testing. Each form of the examination includes nine case simulations and 500 MCQ items. Multiple forms are used to enhance test security, and each form is constructed to meet complex content specifications with respect to sampling of MCQs and case simulations. The simulations are completed in the afternoon of the second day, with a maximum of 25 minutes of testing time allocated for each case. The sample used in this research included the responses of over 20,000 MDs who completed the Step 3 examination during the first year of computer administration. Approximately 60% of the examinees were men. English was identified as the native language for 43% of examinees; 30% identified some other language as their native language, and 27% did not respond. Approximately 60% of the examinees came from LCME-accredited schools. Self-reported information about the residency programs in which the examinees had trained (i.e., discipline choice) was also available. The first step in the analysis was to calculate descriptive statistics for the MCQ and CCS scores for the groups of interest. Scaling and equating for the examination are implemented using the Rasch model. The Rasch model allows scores from different examinees to be put on the same scale, even when the examinees have taken different sets of test items. To put scores from alternate forms on the same scale, ability estimates were produced for the MCQ and CCS components of the test using the difficulty estimates from operational scaling. Analysis of variance (ANOVA) was used to examine the variability of scores across examinee subgroups. Because it was assumed that examinee groups would differ in proficiency, the scaled MCQ score was used as an examinee-background variable in a model designed to study performance differences across groups defined by gender, ESL status, and whether the examinee attended an LCME-accredited school. Additionally, previous research examining performance on MCQ items suggested that performance may vary significantly as a function of examinees' residency training. Dillon et al.12 reported results showing substantial performance differences across residency-based subgroups of examinees. Clauser, Nungester, and Swaminathan13 also showed that residency training was a significant explanatory variable with respect to differences in item-level performance by gender groups. Consistent with these previous findings, residency was dummy coded as an additional examinee background variable. To examine subgroup performance differences on CCSs, the resulting ANOVA model was used to estimate expected scores for subgroup levels after accounting for other differences in group status. In addition to test-score-level analyses, analyses were run at the individual case level for 18 CCS cases. Results Mean scores for all subgroups and for the total test are shown in Table 1. The mean CCS test score was 1.10 (SD = .76). The mean score was 1.09 for men and 1.12 for women. This difference was generally consistent with the performances on the MCQ component of the test (but slightly smaller in terms of standardized score units). Mean scores were .93 for candidates reporting that English was not their native language and 1.27 for those reporting that English was their native language. For LCME status, mean CCS scores were 1.26 for LCME graduates and .87 for non-LCME graduates. Mean scores by residency program ranged from .51 to 1.38.TABLE 1: Mean CCS Score by Candidate Subgroup for MDs Who Completed the USMLE Step 3 Examination during the First Year of Computer AdministrationaResults of the ANOVA for the test-level analysis show nonsignificant effects for gender, LCME, and English-language status.* Expected mean scores were 1.12 for men and 1.16 for women. The expected mean for both LCME-accredited and non-accredited schools was 1.14. Expected means for ESL status were 1.13 and 1.15 for native and non-native English speakers, respectively. Again, these modeled score means were not significantly different, suggesting that, after accounting for group differences, CCSs did not significantly impact examinee performance based on gender, LCME, or ESL status. There was a significant effect of residency training program, which was an anticipated result as this variable was part of the design and not a study variable. The results for 18 CCS cases indicated that the performances of men and women differed significantly on two cases. Mean scores favored women for both cases, while nonsignificant differences on other cases favored women in some instances and men in others. As noted previously, the dependent measure in these case-level analyses was a continuous raw case score. In the first case displaying a significant gender difference the estimated expected mean scores for men and women were 2.81 and 2.91, respectively.† For the second case found to have a significant difference between men and women the estimated expected mean scores were 3.58 and 3.79, respectively. In both cases the expected difference across gender was modest relative to the variability between examinees. A significant difference related to LCME status was found for one case. Estimated expected mean scores for LCME and non-LCME examinees were 3.02 and 2.71, respectively. A significant difference in ESL status was also found for one case. Expected mean scores were 3.93 for examinees who indicated that English was not their native language and 3.65 for examinees who indicated that English was their native language. Discussion The results reported in this paper have important implications for the use of CCSs as part of the Step 3 examination. Of central importance to the purpose of this paper is the finding that after accounting for examinee characteristics represented by the MCQ score and choice of residency training, there was no significant difference in test-level performances across subgroups defined by gender, ESL status, or LCME accreditation of the examinee's medical school. This is an important finding because examinees within these subgroups may differ in reading proficiency, English-language proficiency, or access to computers. If these classifications were associated with performance differences, those differences could be evidence of construct-irrelevant influences on the CCS scores. Such influences would be a threat to validity. The absence of evidence of such effects in this study must be considered supportive evidence of the validity of CCS scores. While evidence of this type does not shed light on what proficiency is measured by CCSs, it does set aside some important competing hypotheses that could undermine the interpretation of CCS scores. When analysis was implemented at the case level across the 54 significance tests (18 cases each tested for the effects of gender, ESL, and LCME status), significant group differences were found for a total of four cases. This type of case-level analysis is conceptually equivalent to the differential item functioning (DIF) analysis that is routinely performed in many standardized testing contexts.14 DIF is generally considered a necessary (but not sufficient) condition for determining that an item unfairly disadvantages some subgroup of examinees. That is, there are clearly circumstances in which subgroups differ in their performances on items that validly measure appropriate content material. For example, the case identified as favoring LCME graduates dealt with a diagnosis that may be more prevalent in the United States than in many other countries and is likely to receive more emphasis within U.S. medical curricula. Inclusion of such a diagnosis is clearly appropriate for a U.S. licensure examination. Similarly, one of the cases that favored women over men focused on a diagnosis that is more prevalent in women. It is not surprising that these cases might favor one sub-group over another and yet both focus on material that is appropriately represented on the USMLE. Although differential performance across groups is not necessarily evidence that a case is flawed, the potential for cases to perform in this way (based on content) highlights the importance of carefully-developed test specifications. Cases that confirm that some subgroups are more familiar with certain content areas are not inappropriate. However, a test form that disproportionately represented such cases would be problematic. This conclusion is even more critical when consideration is given to the results for subgroups defined by residency training. In this study, residency training was included as an examinee-back-ground variable rather than as a focus of study because it was assumed that a close match between the content of a case and the focus of training within a specific residency would give examinees within that residency an advantage on that specific case. In fact, in 14 of the 18 cases studied, residency proved to be a significant factor in explaining examinee performance. This result is consistent with previous studies examining performance on MCQs. The point is that even when a test is developed to measure competence for general practice, an examinee's background and training will inevitably make him or her more familiar with some content areas than others. Conclusion Evidence regarding examinee-group differences presented in this paper represents an important part of the overall validity argument. It is, however, only one part of that argument. In fact, within the context of the present study even the interpretation of the expected CCS group means (i.e., expected marginal means within the ANOVA framework) requires the assumption that the MCQs are measuring a relevant proficiency. The present research was an important step in providing support for the validity of the CCS component of the USMLE. In the context of a high-stakes examination, ongoing investigation of test validity is a critical part of the overall testing process. Although the results presented in this paper are part of a growing body of research to support the validity of CCSs, other aspects of the validity argument are still, and will continue to be, under study. For example, two recent studies15,16 examined the extent to which CCS scores can be shown to provide information not available from the MCQ component of Step 3. Of course, the ultimate and elusive piece of validity evidence would show a direct relationship between CCS scores and subsequent outcomes for the examinees' patients. Studies to provide this sort of evidence remain in the planning stage.

Referência(s)
Altmetric
PlumX