Bloom’s dichotomous key: a new tool for evaluating the cognitive difficulty of assessments
2017; American Physical Society; Volume: 41; Issue: 1 Linguagem: Inglês
10.1152/advan.00101.2016
ISSN1522-1229
AutoresKatharine Semsar, Janet L. Casagrand,
Tópico(s)Education and Critical Thinking Development
ResumoIlluminationsBloom's dichotomous key: a new tool for evaluating the cognitive difficulty of assessmentsKatharine Semsar and Janet CasagrandKatharine SemsarDepartment of Integrative Physiology, University of Colorado, Boulder, Boulder, Colorado and Janet CasagrandDepartment of Integrative Physiology, University of Colorado, Boulder, Boulder, ColoradoPublished Online:23 Feb 2017https://doi.org/10.1152/advan.00101.2016MoreSectionsPDF (198 KB)Download PDF ToolsExport citationAdd to favoritesGet permissionsTrack citations ShareShare onFacebookTwitterLinkedInWeChat one of the more widely used tools to both inform course design and measure expert-like skills is Bloom's taxonomy of educational objectives for the cognitive domain (2, 13, 22). This tool divides assessment of cognitive skills into six different levels: knowledge/remember, comprehension/understand, application/apply, analysis/analyze, synthesis/create, and evaluation/evaluate (2, 6). The first two levels are generally considered to represent lower levels of mastery (lower-order cognitive skills) and the last three represent higher-order levels of mastery involving critical thinking (higher-order cognitive skills) with apply-level questions often bridging the gap between the two (e.g., Refs. 5, 8, 10, 11, 23, and 24). While Bloom's taxonomy is widely used by science educators, learning and mastering the concepts of the cognitive domain to categorize educational materials into the six levels identified in Bloom's taxonomy are not trivial tasks.As with any complex task, experts and novices differ in the key abilities needed to cue into and evaluate information (4, 7, 9). Across disciplines, novices are less adept at noticing salient features and meaningful patterns, recognizing the context of applicability of concepts, and using organized conceptual knowledge rather than superficial cues to guide their decisions. Newer users of Bloom's taxonomy demonstrate similar difficulties as they work to gain expertise, leading to inconsistencies in Bloom's ratings (1, 8, 15) (see BDK Development for examples).To help novices gain expertise in a discipline, a common educational strategy is the use of scaffolding (7, 17, 21). Scaffolding aims to control the elements of a task, allowing a novice learner to complete the easier levels of the task and build up to the more complete and complex elements of the task (17). In the context of "Blooming," a scaffolding structure would help the rater cue into the salient and most important elements of a question relating to the skill level of the problem and aid in using those elements to categorize the specific skill being tested in the problem. A scaffolding tool therefore provides a structure with which the novice could model their identification of key elements and decision making.One such example of a scaffolding tool to use for Bloom's taxonomy is the Biology Blooming Tool (BBT) (8). The BBT is a conventional rubric for developing and identifying biology-specific skills and questions based on Bloom's taxonomy. Organized as a table, each column of the rubric table outlines the key skills assessed at a given Bloom's level (starting with the lowest level, "remember"), provides examples of exam questions, and delineates the type of exam questions that can be asked at that level. Unfortunately, in our own attempt to Bloom exam questions and course materials using a modified BBT, we had difficulty getting three independent raters to consistently rate materials. Therefore, we set out to design a new Bloom's training tool that would provide additional, specific scaffolding that directly addressed the inconsistencies among our raters and thus might lead to greater consistency among raters. Here, we present a description of the development and evaluation of that tool: Bloom's dichotomous key (BDK).BDK DevelopmentThe development and analysis of the BDK was conducted under Institutional Review Board protocol 0108.9 (exempt status).Rationale and initial independent rater training.The development of the BDK grew from an attempt to evaluate the Bloom's level of course content before and after course reform efforts in a neurophysiology course (J. Casagrand and K. Semsar,7a). One way we sought to assess the effectiveness of the course reform was to use Bloom's taxonomy to categorize the cognitive level of course exams and other course materials before and after reform as an indirect, retrospective measure of changes in student understanding. Thus, using Bloom's taxonomy as an indirect measure of student understanding could provide a way to gauge how the neurophysiology course had changed over time and whether students were able to demonstrate deeper levels of understanding of course content.To reduce potential bias while "Blooming" course materials, we began by recruiting three independent raters. One rater was a current graduate student in the department who had previously been a teaching assistant for the course, and two raters were former graduate students who remained in the department as postgraduates, one of whom had been a teaching assistant for the course and the other who had taken the course as an undergraduate. In selecting these raters, we were careful to choose raters who were familiar with the course content and knowledgeable of neurophysiology because it was important that they had a sufficient understanding of the course content to recognize what knowledge and problem-solving skills a student would need to answer each question, such as whether students were being asked to apply concepts in new contexts or remembering material exactly as presented in class. However, the three raters had varying expertise and experience using Bloom's taxonomy to assess the cognitive skill level of course material. One rater had been extensively trained in Bloom's taxonomy and used Bloom's taxonomy over several years of working as a science education specialist. Another rater was also working as a science education specialist but had received minimal training before working on this project. Our third rater had no prior exposure to Bloom's taxonomy before this project.To familiarize our raters with the process of "Blooming" course materials, we initially provided the raters with an overview of Bloom's taxonomy, associated terms, and sample questions in a conventional rubric modeled after the BBT. We had raters practice categorizing 26 sample neurophysiology questions. Unfortunately, we were dissatisfied with the degree of categorization similarity among the three raters. On average, the raters only matched the authors' question categorizations 46% of the time (Table 1). In addition, raters were deviating almost a full Bloom's category (0.65) from the average rating (Fig. 1), and the percentage of questions for which all three raters agreed was only 19% (Table 1). This prompted discussions between the authors and raters that suggested significant discrepancies and variability in the raters' reasoning processes in assigning ratings to the sample questions.Table 1. Percent rater agreementNumber of Exam QuestionsAgreement With Authors, %At Least Two Raters Agree, %*All Three Raters Agree, %Rubric26468819BDK26678838BDK (exams)155Not applicable8141BDK, Bloom's dichotomous key.*Not always the same two raters.Fig. 1.Comparison of the mean average deviation scores (±SD) for the 3 raters on the initial 26 sample questions with the conventional rubric (a modified Biology Blooming Tool) and with the Bloom's dichotomous key (BDK) and on the 155 exam questions [BDK(exam)]. Average deviation scores (i.e., how much each rater deviated from the average rating of the three raters) were calculated for each rater based on the method described in Zheng et al. (23), as follows: Average deviation = Σi=1n|(average rating of 3 raters)−(individual rater′s rating)|/nDownload figureDownload PowerPointThink-aloud interviews.These discussions led us to perform individual think-aloud interviews with each rater to better discern how raters were using the rubric to make decisions. During each think-aloud interview, the rater verbalized his/her thought processes and reasoning as he/she used the rubric to categorize each sample exam question. (Raters were familiar with the course and questions were labeled as to which exam they came from so that raters would know what was taught.) If the rater did not provide reasoning, he/she was prompted to explain their choice, but that was the only prompt given by the interviewer. When all three interviews with raters were complete, we examined the raters' reasoning during their decision-making processes, specifically looking at reasons given for categorizations for which raters disagreed with each other.During think-aloud interviews, we observed several inconsistencies in rater decision-making, most of which were similar with published accounts of difficulties in "Blooming" (1, 3, 5, 8, 15, 23). First, raters did not always take into consideration what information had been previously provided to students, an important aspect in determining the appropriate Bloom's level (also described in Refs. 1, 3, and 8). For example, if students are given the answer to a specific higher-level question in lecture and then asked the same question on an exam, answering the question only requires recall, not a higher level of understanding (see example 1 in Fig. 2). The remaining inconsistencies all centered around raters focusing on different information within the question. For example, as also described in detail by Lemons and Lemons (15), raters would sometimes categorize questions based on the perceived difficulty of a question rather than what students would need to do to answer the question (i.e., the cognitive skills required). If a rater thought it was a more difficult problem, the rater might skew the rating to a higher level without reference to what students were actually being asked to do with the information, and vice versa (see examples 2 and 3 in Fig. 2). Another common reason for inconsistency among our raters and similar to inconsistency described by others (5, 8, 23) stems from raters cueing in to different skills or information needed to answer a single question. This often involved questions in which more than one concept or piece of information was being tested, and the different concepts/information required different cognitive skills. Most often, raters would stop at the lower-level categories and not take into account that higher-order questions about a concept also include mastery of lower level cognitive skills related to that concept (see example 4 in Fig. 2). In addition, category inconsistencies were commonly related to the raters' use of buzzwords or action verbs for categorization rather than the specific information and context in the question. For example, questions asking for the best answer were sometimes categorized as evaluate due to the appearance of making a judgment, even if, based on the context of what was taught, the question was at a remember, comprehend, or apply level (see example 5 in Fig. 2). From other experiences, we know the term "predict" also commonly leads to similar inconsistencies (see example 6 in Fig. 2). Finally, we found that questions involving data were especially prone to large categorization variation, as did Crowe et al. (8). We found that questions with data sets sometimes led raters to jump directly to the Bloom's category of apply or analyze without giving close attention to the question. However, in our examination of how students can be asked to interpret data in different questions, nearly all Bloom's skill levels could be represented, from deciding whether data are consistent with a hypothesis (evaluate) to drawing conclusions about what the data mean (analyze) to simply redescribing the data (comprehend).Table 2. The BDKCategorize the question based on what students are being asked to do, not on how challenging the question may be. (For example, a "comprehend" question for a difficult concept could be a more challenging problem than an "analyze" question on an easier concept.)Evaluate questions with reference to what material we know students were exposed.Question 1. Could students memorize the answer to this specific question? Yes: go to question 2. No: go to question 4.Question 2. To answer the question, are students repeating nearly exactly what they have heard or seen in class materials (including lecture, textbook, laboratory, homework, clicker, etc.)? Yes → See Remember No: go to question 3.Question 3. Are students demonstrating a conceptual understanding by putting the answer in their own words, matching examples to concepts, representing a concept in a new form (words to graph, etc.), etc.? Yes → See Comprehend No: Go back to question 1. If you are sure the answer to question 1 is yes, the question should fit into "remember" or "comprehend."Question 4. Is there potentially more than one valid solution* (even if a "better" one exists or if there is a limit to what solutions can be chosen)? Yes: go to question 5. No: go to question 8.Question 5. Are students making a judgment and/or justifying their answer? Yes → See Evaluate No: go to question 6.Question 6. Are students synthesizing information into a bigger picture (coherent whole) or creating something they haven't seen before (a novel hypothesis, novel model, etc.)? Yes → See Synthesize/create No: go to question 7.Question 7. Are students being asked to compare/contrast information? Yes → See Analyze No: go to question 16.†Question 8. To answer the question, do students have to interpret data (graph, table, figure, story problem, etc.)? Yes: go to question 9. No: go to question 14.Question 9. Are students determining whether the data are consistent with a given scenario or whether conclusions are consistent with the data? Are students critiquing validity, quality, or experimental data/methods? Yes → see Evaluate No: go to question 10.Question 10. Are students building up a model or novel hypothesis from the data? Yes → See Synthesize/create No: go to question 11.Question 11. Are students coming to a conclusion about what the data mean (they may or may not be required to explain the conclusion) and/or having to decide what data are important to solve the problem (i.e., picking out relevant from irrelevant information)? Yes → See Analyze No: go to question 12.Question 12. Are students using the data to calculate the value of a variable? Yes → See Apply No: go to question 13.Question 13. Are students redescribing the data to demonstrate they understand what the data represent? Yes → See Comprehend No: go back to questions 4 and 8.Question 14. Are students putting information from several areas together to create a new pattern/structure/model/etc.? Yes → See Synthesize/create No: go to question 15.Question 15. Are students predicting the outcome or trend of a fairly simple change to a scenario? Yes → See Apply No: go to question 16.Question 16. Are students demonstrating that they understand a concept by putting it into a different form (new example, analogy, comparison, etc.) than they have seen in class? Yes→ See Comprehend No: go back through each category or refer to category descriptions to see which fits the best*This question originally had the word "answer" in place of the word "solution." In subsequent use of the BDK, we found that the word solution led to less confusion about the application of this question. This was not an issue in our initial use of the BDK for this report.†Originally, if answering "no" to question 7, we had reviewers go back to question 4 and if they were sure it was "yes," they should be able to answer "yes" to questions 5, 6, or 7. This did not lead to any difficulties in our initial use of the BDK for this report. However, in subsequent use of the key, we found examples of questions in which comprehension-level questions were also possible. Therefore, we revised the BDK to lead raters to question 16 here to account for those question types.Building the dichotomous key.Through the think-aloud process, we noticed the issues listed above paralleled the processes known to reflect cognitive processes of novices in general. For example, novices generally either fail to notice, or do not discriminate well, the salient features within complex patterns. Novices also organize their knowledge based on surface features rather than underlying structure. They also jump quickly to conclusions and do not always recognize the entire context of a problem. Likewise, in initial categorizations, our raters chose different features of problems to use in their categorization decision, relied on buzzwords to categorize items, and misclassified questions because they had not considered what had previously been taught (Fig. 2).To address these specific issues and provide raters with the additional scaffolding for the Bloom's categorization process, we developed a new training tool: the BDK (Table 2). For categorization processes such as these, a dichotomous key is a natural scaffolding tool because it allows users to identify and categorize items in a systematic and reproducible fashion (12). Different from a conventional rubric or flowchart, it is a series of steps, each with two choices, that focuses on key characteristics of a particular group to reproducibly sort them into taxonomic groups. While experts can make these categorizations quickly using patterns of knowledge, novices can use this step-wise series of questions to focus on salient information and consistently make identifications. For example, an expert in phylogenetic identification can use salient features and patterns of knowledge that have become second nature to identify organisms without needing the help of taxonomic descriptions. Meanwhile, novice biologists can use dichotomous keys to help them develop recognition of salient features that lead to taxonomic identification. Thus, rather than sifting through taxonomic descriptions of each species and then trying to match their specimen to the descriptions, the novice looks at the specimen and answers a series of questions. For example, the key may start with the following query: "Does the organism have cell walls?" If yes, Kingdom Plantae, go to question 2; if no, Kingdom Animalia, go to question 5. From there, the dichotomous key follows a series of such salient features that help narrow down the classification choices. In this way, the dichotomous key scaffolds the pattern recognition of identification into specific steps, feature by feature. In this same manner, we created the BDK to scaffold the process of categorizing cognitive skill levels using Bloom's taxonomy.When developing a dichotomous key, one first identifies classifying characteristics, those features of the items that create large distinctions among groups of items, until all items can be uniquely referenced. These classifying characteristics are then organized from the broadest to narrowest such that raters answer a series of "yes or no" questions that guide them through common elements of questions and ultimately to a Bloom's level for the question being categorized. Using our observations from the think-aloud interviews, we determined our three broadest classifying characteristics guiding Bloom's categorization decisions were 1) whether or not the answer to the specific question could have been memorized, 2) whether there was more than a single plausible/valid solution to a problem, and 3) whether the problem contained data interpretation. Yes or no responses to the prompts associated with these characteristics then lead to further distinguishing features of specific Bloom's taxonomic groups. The BDK begins with the two broadest classifying characteristics: whether or not the answer could be memorized (question 1; if yes, then the classification is "remember") and whether there was more than a single plausible solution (question 4; nearly every time there is more than a single correct way to approach a problem, one will be working at a higher cognitive level, such as "analyze," "evaluate," or "create/synthesize"). These BDK questions sort most exam/homework questions that fall into the lower-order cognitive skills or higher-order cognitive skills of Blooms taxonomy. From there, the BDK moves to the third broad classifying characteristic: whether the question requires interpretation of data (question 8). If the rater answers yes to this question, the BDK guides the rater through the different cognitive skills that can be tested under the broader context of interpreting data [e.g., describing data ("comprehend") or using data to calculate an answer ("apply")]. The last few BDK prompts help sort the remainder of the question types we encountered.In addition to using these classifying characteristics to aid the raters in their categorizations, we also designed the BDK to clarify other common sources of inconsistencies. First, to resolve the issue of raters categorizing questions based on perceived difficulty rather than the skills needed to solve a problem, all BDK prompts are specifically worded to ask raters what a student is being required to do to answer a question (e.g., recall a fact, calculate a number, or interpret data). Second, to address the fact that raters had sometimes stopped at the lower Bloom's levels when using the conventional rubric and did not take into account that higher-order questions include mastery of lower-order cognitive skills, we designed the BDK to guide raters to generally consider higher-order skills before lower-order skills within each section of the BDK. Third, some rater discrepancies were due to raters not taking into account all information that students had to work with. Thus, many of the BDK prompts specifically have the rater take into account whether students are considering only a single piece of information (generally lower-order cognitive skills) or multiple pieces of information (generally higher-order cognitive skills).To ensure the prompts were being interpreted appropriately and consistently, we then performed additional think-aloud interviews as our 3 raters used the BDK to rerate the original 26 sample questions. (No feedback was given to raters between their use of the conventional rubric and BDK during think-alouds.) Based on the interviews and additional rater feedback, the wording of some of the BDK prompts was revised. For example, the first prompt was changed from "Have students seen the answer to this question in the course materials?" to "Could students memorize the answer to this specific question?" In addition, we changed the fourth prompt, "Is there potentially more than one valid answer?" to "Is there potentially more than one valid solution?" This distinction was necessary to avoid confusion between cases in which a single solution to a question included multiple components that appeared like separate answers (see examples 7 and 8 in Fig. 2). Finally, from feedback we later received during workshop sessions using the BDK, we added a prompt (question 16: "Are students demonstrating that they understand a concept by putting it into a different form than they have seen in class?") to address question types that were not originally in our sample questions but were represented in other materials subsequently "Bloomed" with the BDK.BDK EvaluationStatistical analysis.To evaluate whether the BDK was meeting our goal of creating greater categorization similarity among raters, we statistically compared mean average deviation scores and SDs of deviation scores (23) between the use of the more conventional, BBT-styled rubric and BDK. Our raters produced significantly more similar categorizations when using the BDK than when using the conventional rubric to rate the same 26 sample questions. The mean average deviation score dropped from 0.65 to 0.48 (Fig. 1). In another measure of rater agreement, we looked at the percentage of questions for which multiple raters agreed on a categorization. Although at least two raters agreed on a categorization for 88% of the questions for both the conventional rubric and BDK, the percentage of questions for which all three raters agreed doubled to 40% when using the BDK (Table 1, comparable with Ref. 23). Furthermore, use of the BDK reduced the SD of average deviation scores by more than half, from 0.31 to 0.14, indicating that when a rater deviated from the average rating, he/she did not deviate as far from the average. While it is possible that these scores were simply getting more consistent with rater practice, we do not believe this is the case as the degree of variation was lower after use of the BDK than use of the rubric despite no additional discussions or training between these events. Finally, in addition to the raters becoming more consistent with each other, they also were more likely to match the authors' categorization of a question, with the average match between raters and the authors jumping from 46% to 67% (Table 1).Think-aloud interview observations.The statistical improvement in consistency that we saw when raters used the BDK for categorization of course materials was supported by the think-aloud observations conducted when raters used the BDK. During these think-aloud interviews, we observed that the BDK specifically brought rater attention to what had been taught and helped raters consider what had been taught in relation to question categorization. Second, use of the BDK focused raters on considering the skills being used by a student in answering the question rather than other nonsalient features, such as perceived item difficulty. [For example, while one rater was settling on "analyze" based on question 11 (rather than "comprehend," which "felt" more right based on perceived difficulty), they said "I'm not totally happy with this, it seems really simple. But they do have to decide what's important to solve the problem. That's more than just a calculation."] Third, use of the BDK improved the rating consistency of questions involving data. Pre-BDK, raters would often see data in the exam question and immediately jump to the Bloom's category of analyze because they conflated all data with "analyzing data." The BDK helped raters focus on what students were being asked to do with the data rather than just noting that data were involved in the question. (For example, while one rater was deciding on what was being done with information, they said "They're given information, but they aren't interpreting all of it [to answer the question].") Fourth, when using the BDK compared with the conventional rubric, raters focused on the highest cognitive level of the question rather than focusing on a lower-level component embedded within a higher-level question. In addition, all three raters reported that the dichotomous key was easier and faster to use than the conventional rubric. In addition to the observer's notes that raters spent much less time going back and forth between category descriptions, raters also said "This went a lot faster, and I'm more confident [in my answers] too and [Questions] seem to bin well, definitely quicker".Utility.We examined the utility of the BDK in two ways. First, we had raters use the BDK to categorize course materials from a neurophysiology course to examine the effectiveness of course reforms (K. Casagrand and J. Semsar, (7a)). Briefly, our three raters used the BDK to categorize 155 exam questions, ascertaining that the number of Bloom's higher-order questions on exams more than doubled, from 24% to 67%, after the introduction of several research-based teaching methods. In addition, a single rater used the BDK to categorize 394 homework and clicker questions to demonstrate the degree of alignment of course materials.Fig. 2.Challenges faced in using Bloom's taxonomy and example missteps made by raters. Quotes from raters (in italics) are closely paraphrased.Download figureDownload PowerPointSecond, we presented the BDK in several Bloom's taxonomy workshops. In one such workshop, attendees were surveyed about their prior level of experience with Bloom's taxonomy and their opinions about the utility of the BDK (Table 3). Overall, of the 25 attendees, 19 attendees had limited previous experience with Bloom's taxonomy before the workshop. All but one of these attendees rated the BDK as easy to use, and all but one attendee would both use the BDK themselves in the future and recommend the BDK to people who are new to Bloom's taxonomy. Of the four people who had extensive experience with Bloom's taxonomy, three individuals also agreed that they would recommend the BDK to people who are new to using Bloom's taxonomy. Meanwhile, the two people who had no previous experience with Bloom's found the tool difficult to use.Table 3. Feedback about the BDK from a Bloom's taxonomy training workshopPrior Experience with Bloom's TaxonomyExtensiveLimitedNoneNumber of workshop attendees with Bloom's experience4192Mean reported difficulty level in using the BDK (1 = very easy and 5 = very difficult)1.752.224Number of attendees who would use the BDK in the future to help "Bloom" materials3180Would you recommend the BDK to others who are new to "Blooming?" Answer: yes3180DiscussionThe BDK as a Bloom's taxonomy training tool.Learning to efficiently and fluently use Bloom's taxonomy is a challenging cognitive task. Thus, not surprisingly, many of the categorization inconsistencies demonstrated by both our rater team and others (5, 8, 23) are typical of those that novices face in general when performing cognitively complex tasks (7). While more conventional rubrics and guides like Anderson's guide to Bloom's taxonomy (2) and the BBT (8) can aid in learning the complexities of Bloom's taxonomy, they were not sufficient for our raters during
Referência(s)