Artigo Acesso aberto Revisado por pares

Mismeasurement in Social Work Practice: Building Evidence-Based Practice One Measure at a Time

2019; University of Chicago Press; Volume: 10; Issue: 3 Linguagem: Inglês

10.1086/704363

ISSN

2334-2315

Autores

Craig Winston LeCroy,

Tópico(s)

Resilience and Mental Health

Resumo

Previous articleNext article FreeMismeasurement in Social Work Practice: Building Evidence-Based Practice One Measure at a TimeCraig W. LeCroyCraig W. LeCroyArizona State University Search for more articles by this author Arizona State UniversityFull TextPDF Add to favoritesDownload CitationTrack CitationsPermissionsReprints Share onFacebookTwitterLinked InRedditEmailQR Code SectionsMoreThis invited article is based on the 2019 Aaron Rosen Lecture presented by Craig LeCroy at the Society for Social Work and Research 23rd Annual Conference—"Ending Gender Based, Family and Community Violence"—held January 16–20, 2019, in San Francisco, CA. The annual Aaron Rosen Lecture features distinguished scholars who have accumulated a body of significant and innovative scholarship relevant to practice, the research base for practice, or effective use of research in practice."While you and I have lips and voices which are for kissing and to sing with who cares if some oneeyed son of a bitch invents an instrument to measure Spring with?" Nobel Laureate E. E. Cummings wrote this line in 1926. This poem—one of 2,900 he crafted—captures important points I want to address: Measurement is elusive, it attempts to capture some very difficult things (like spring), and sometimes we miss the mark without fully realizing it.Social science research is dependent on the effective application of measurement principles. It is not a matter of coincidence that the cover of David Wootton's 2016 book The Invention of Science: A New History of the Scientific Revolution features an image of a measuring instrument. And it is not an overstatement to say that measurement is at the very heart of scientific discovery and innovation. According to Hauser (1969, p. 127), "… it is inadequate measurement, more than inadequate concept or hypothesis, that has plagued social researchers and prevented fuller explanations of the variances with which they are confounded." Similarly, Nunnally (1972) believed that measurement was the central issue for social science research.In my view, advances in social work have been limited by mismeasurements. If I were to ask researchers to identify the most important criteria in selecting a measure, they would likely respond that it must be reliable and valid. It is surprising (and problematic) how few scientists question the soundness and fittingness of their methods to their research question. To advance social work knowledge, we must take a more critical stance on how we do research (LeCroy, 2010). Other disciplines have rightly recognized mismeasurement as a topic of serious study: Tavris (1992), Mismeasure of Woman; Gould (1996), The Mismeasure of Man; Mosher, Miethe, and Hart (2011), The Mismeasure of Crime; and Horn and Wilburn (2013), The Mismeasure of Education. Without a doubt, my next book will be titled The Mismeasure of Social Work!I address the mismeasure of social work by responding to a series of three related questions:1. Why is evidence so elusive in much of social work practice?2. Why do rigorous social work studies often reveal mixed or no results?3. Why is evidence-based practice making, in some cases, limited progress?Some of the roadblocks researchers confront may be related to the way we think about and do research. In particular, I believe that convention is an enemy of the advancement of social work knowledge. Consider, for example, how students who are new to social work research can be influenced into adopting conventional ideas as starting points to ground their assumptions and hypotheses (Becker, 1998). When academic process (and progress) is dictated by adherence to conventional scientific methods, original scientific thinking is not cultivated. For social work research to advance, we must solve problems creatively and break from conventional thinking. The reality is that social work research frequently leads to "a banal analysis of variables that repeat dozens of similar studies. Young researchers often use past studies as a guide to justify their study, select similar variables, and use a similar conceptualization to analyze the results" (LeCroy, 2010, p. 321–322). The bigger picture of how a study contributes to knowledge building in social work becomes obscured.Measurement Issues in Home VisitationMy adventures with measurement and mismeasurement began in earnest when I worked for a home visitation program. Many years ago, I received a call from an executive director who had implemented a statewide home visitation program. She asked whether I would be willing to document evidence for the program, and I agreed. The program had collected baseline and follow-up data using a standardized primary outcome instrument. Although I was excited to report back to the executive director with information about the program's progress, I discovered there was no progress! As I examined the primary outcome measure that had been selected, I realized that the program participants were not making any meaningful changes. This was the beginning of a long quest to uncover significant issues with mismeasurement. I came to believe that the program outcomes were being mismeasured, and I wanted to do something about it.In my review of home visitation studies, I discovered that many of the measures used for program evaluation were well-established and well-known, such as the Adult-Adolescent Parenting Inventory (Bavolek, 1999), the Parenting Stress Index (Abdin, 1995), and the Child Abuse Potential Inventory (Milner, 1994). Although these measures each have good reliability and have been validated through hundreds of studies supporting their use, they are not good outcome measures. Moreover, researchers who use these measures as outcome indicators commit measurement malpractice: If researchers do not use the correct measure in the evaluation of social work interventions, the consequences are so significant that mismeasurement can be considered malpractice.Popular Home Visitation MeasuresThe Parenting Stress Index and the Child Abuse Potential Inventory measures are commonly used in home visitation research. The Parenting Stress Index was designed to measure a construct (parenting stress), not to measure impact or change resulting from an intervention. It consists of 120 items with individual subscales related to children and parenting that are, ultimately, combined by the researcher to create a total scale score. Many of the items are static, meaning they may measure stress but are not expected to change over time. For example, respondents are asked to indicate whether they have experienced events such as marriage, relocation, divorce, or death of a close family member or friend in the previous 12 months. If these items are repeated on a posttest or follow-up assessment, they would not be expected to change as the result of an intervention. Another concern is that the child items relate solely to underlying personality traits and temperament (LeCroy & Krysik, 2010). It stands to reason that these items tap into heritable traits that are stable and less responsive to change. An instrument designed as an outcome instrument should focus on items that are likely to be changed by an intervention.The Child Abuse Potential Inventory appears to be an ideal instrument to use when evaluating a home visitation program where the primary goal is to prevent the maltreatment of children. However, this instrument also includes several items—"I have a child who is clumsy," "My telephone number is unlisted," "I have a physical handicap," "As a child I was abused," and "As a child I was knocked around by my parents"—that are static and unlikely to change as the result of an intervention. Although these items may help to predict child maltreatment in the future, they are not appropriate items for an outcome instrument. As LeCroy and Krysik (2010, p. 1484) noted, "No amount or type of intervention is going to change one's status on these items over time." Therefore, using this instrument as an outcome indicator limits the opportunity to find the true impact of an intervention.To be fair, those who developed these instruments focused on measuring constructs or predicting future behaviors; their objective was not to document change. It is the application of these measures that is problematic, not the measures themselves. This begs the question: Why were these measures selected for use in an intervention study? It turns out that in social work research, most measures are selected because someone used them in the past, documented their use, and claimed they are good measures; they were not selected because they meet specific criteria. To advance the field of social work, we must improve the way measures are selected. Habituation—doing things out of habit—most likely influences the selection of measures. To select a good outcome measure, researchers must avoid habituation and resist doing things that simply appear to make sense (e.g., selecting a measure because someone else used the measure in a similar experiment).MismeasurementMisuse of Language in MeasurementIt is helpful to think about mismeasurement in the context of an overall study. For example, the Comprehensive Child Development Program (St. Pierre, Layzer, Goodson, & Bernstein, 1997) included evaluation of 21 sites and 4,410 families over 5 years, which found the program had no statistically significant impact on the families involved. Researchers who have examined this study and its lack of impact have focused on several common factors: It was a poorly defined program, the program was poorly implemented, the wrong intervention strategy was selected, and it was characterized by poor service quality. Given that the measures used in the study did not reveal any significant outcomes, researchers typically conclude that the program was not effective. Yet, there was no mention of measurement as an explanation for why none of the outcomes showed an effect. This is an example of the misuse of language. It is only accurate to conclude that the program showed no results on this or these measures! A different measure might have generated positive outcomes.Measurement instruments express information in numbers, and numbers are considered to reflect objectivity and trustworthiness. Converting concepts into numbers calls for expert interpretation. The bottom line is that we rely on measurement instruments for credible and useful information, and we believe they have immense authority to speak the truth. The subtext is that measurement constitutes a powerful discourse that has consequences for social work research. We often accept conclusions as truth, but if we want to advance social work knowledge, we should question the methods used in support of conclusions (LeCroy, 2010).Because we have much more to learn about measurement and outcomes, we must be more precise in how we judge the effectiveness of programs. The potential consequences of mismeasurement are frightening: Many projects—particularly federally funded efforts—want researchers to use the same measures. Imagine what would happen if everyone used the same inappropriate measures!Different Measures, Different ResultsConsider the following example from Lipsey's book Design Sensitivity (1990), which is a critical read for anyone doing intervention research. Lipsey presented four similar language-acquisition measures based on research by Schery (1981). Pre and post contrasts were conducted on special education students who had completed a test battery at time of enrollment. The same children were given follow-up assessments. Although similar measures were used, they showed dramatic differences in their ability to measure change (see Table 1). If a large, federally funded program selected the Wechsler Intelligence Scale for Children-Verbal as the primary outcome measure, the program is unlikely to show any effects. However, if another measure was chosen—such as the Illinois Test of Psycholinguistic Abilities—we might demonstrate strong effects.Table 1. Comparison of Language Acquisition MeasuresLanguage Acquisition MeasureEffect SizePeabody (scaled scores)0.20 (small)Peabody (raw scores)0.81 (large)ITPA (verbal)1.06 (large)WISC (verbal)0.00 (no effect)Note. ITPA = Illinois Test of Psycholinguistic Abilities; WISC = Wechsler Intelligence Scale for Children. Adapted from "Selecting assessment strategies for language disordered children" by T. K. Schery, 1981, Topics in Language Disorders, 1, pp. 59–73. Copyright 1981 Aspen Systems Corporation. View Table Image How Do We Choose Measures?Most social work researchers rely on the California Evidence-Based Clearinghouse for Child Welfare (California Department of Social Services Office of Child Abuse Prevention, 2019) to identify suitable methods. Listed on the website under measurement tools, I found the Child Abuse Potential Inventory, which received an "A" grade. Granted, the psychometrics are well-demonstrated. However, since this website is meant to curate evidence-based practices, the tacit implication is that this inventory may be a suitable tool to study social work interventions. Instead, I would suggest the use of another inventory, since the Child Abuse Potential Inventory was not designed as an outcome instrument.Similarly, I visited the U.S. Department of Health and Human Services Administration for Children and Families (n.d.) website section dedicated to evidence on the effectiveness of home visiting and examined the measures and outcomes obtained from various studies. Since the federal government is investing millions of dollars in home visitation, there is a financial incentive to carefully track outcomes. Interestingly, I found two studies that used either the Child Abuse Potential Inventory or the Parenting Stress Index as outcome measures. The results? Not surprisingly, both instruments found no significant differences between the home visitation program and the control group. Would these studies show more impact if instruments designed to measure outcomes had been used instead? My short answer is "possibly yes," and my longer answer follows.I recently worked with a nonprofit organization that received a federal grant, which included contracted evaluation oversight for every program. On a consultation call from the evaluation group, an expressed concern was that there might not be enough power for the experiment. Of interest, the evaluation consultants—who were experts in research and design—were worried about power, yet they only made a single suggestion: "Can you increase the numbers?" As it turned out, the nonprofit had an approved budget for the number they were proposing to serve and were already doing everything they could to recruit and retain as many participants as possible. What was not discussed or considered were aspects of the research design that could maximize power and improve the chances of obtaining true outcomes, such as making sure the measures accurately captured the intervention effects. One of the simplest and most effective ways to address the concern of an underpowered experiment is through the careful selection of measures.After reviewing research on home visitation experiments, I listed the outcome measures used in evaluation of home visitation programs. I stopped counting after I found 75 different measures that have been used. This is both promising and problematic: promising because we have to do more experiments with different outcomes and learn what impacts might result from different measures, and problematic insofar as the various measures seemed to comprise a haphazard list. Perhaps research is too easy these days: If you need a measure, there are websites, books, and laundry lists to choose from. You do not have to investigate long before you can select a measure that may determine whether the intervention or program is deemed evidence based or gets continued funding.Two measures that have been selected without much thought in home visitation research are the Conflict Tactics Scale (Straus, 1979) and emergency room (ER) visits. The Conflict Tactics Scale was not designed as an outcome measure and represents low-occurring events. It lacks sensitivity to change because it is scaled in units too uncommon to detect changes (e.g., slapped him in the face, threw him down, burned him on purpose). It can be modified to include less severe items and perhaps capture change, but if the original is used it ends up being a measure that is insensitive to change.I was recently on a call regarding home visitation research and the topic of including ER visits for evaluation of home visitation was presented. Many home visitation program evaluations have used emergency room visits as a potential outcome indicator, but what do ER visits measure? Are they a proxy for children who have been harmed by caregivers and need emergency treatment? In some communities, parents are told to go to the ER when they need help with their child and the doctor's office is not available. So, if I am a cautious and concerned parent and I take my child to the ER, is this bad or good? And, what is the definition of an ER? Does it include the Walgreens urgent-care clinic on the corner?If you cannot see a problem, you cannot solve it. In science, discovery and problem-solving often require seeing things we typically do not see. Discovery is the act of detecting something new, or something previously unrecognized as meaningful. Dunbar and Fugelsang (2005) estimated that 30%–50% of scientific discoveries are unexpected. To benefit from the unexpected, one must be attentive and clever (Baumeister, 2006). Similarly, Taleb (2011) called science "antifragile" in that gains can be made from unexpected events or even disorder. It follows that we should embrace null or negative findings in our experiments as an opportunity to think more critically about our work. Unexpected findings may lead to new conceptualizations or new approaches to methodology.Contributing original ideas to a body of knowledge is challenging, in part, because many scientists follow convention. To be sure, research is embedded within a circumscribed infrastructure. As LeCroy (2010) noted,This infrastructure pushes us to look to the conventional, work to identify the patterns, follow research that has been previously published—all the kinds of things that make it difficult to "see" important new information. To accomplish this requires a critical assessment of the concepts being used in the study, the measures being used to define the concepts, the logic of the study, the clarity of the premises, and the underlying assumptions. (p. 324)Sensitivity to ChangeA major cross-cutting theme to this discussion is sensitivity to change—that is, an assessment of a measure's ability to capture change from an intervention. In his book Design Sensitivity (1990), Mark Lipsey referred to sensitivity to change as validity for change. In other words, a social work researcher cannot detect a change in outcome if the measure being used cannot capture change. Reliability and validity are one thing; sensitivity to change is a separate but critical (and often neglected) part of the picture. As we have seen with many of the measures discussed in this paper, a measure can have good reliability and good validity but may not be able to detect the true impact of an intervention because it is not sensitive to change. What I am suggesting, then, is that an outcome measure's power to detect change is reduced relative to its sensitivity to change.Enhancing Sensitivity to Change in Outcome MeasuresSeveral strategies can be used to increase the sensitivity of measures to change. Fok and Henry (2015, p. 81) outlined four ways to do this: (a) increase the comprehensibility and cultural validity of items in the instrument, (b) measure the full range of latent traits in the population, (c) exclude items that do not seem relevant to what is being evaluated, and (d) ask directly about change.Specifically, if participants in an intervention have difficulty differentiating between anchor points on a scale that are close in meaning—such as "occasionally," "sometimes," and "not often"—the measure will be less able to detect meaningful change (Fok & Henry, 2015). For this reason, we need to improve measurement precision in outcome measures.Another strategy that can dramatically increase a measurement's capacity to capture change is to examine the items on any instrument being used as an outcome measure. Consider the following two items that may appear on the same measure: "I have negative thoughts," and "I have considered hurting myself."With respect to the first item, change would likely be detected if someone attended a cognitive–behavioral therapy workshop. However, change is unlikely to be detected on the second item unless participants represent a population with more serious symptoms of depression. Prior to deciding whether a scale will work for your study, it is imperative to evaluate the items that will appear on the scale and engage in perspective-taking. Or, pilot test the measure and examine the instrument's ability to capture change before initiating your experiment. The take-away message is that the measure can be improved by removing items that do not represent the outcome you are trying to impact.Another sensitivity-to-change issue common in social work research is using measures that have floor or ceiling effects (Lipsey, 1990). When items are scored from 1–5 on a Likert scale (with 5 being the desired response) and almost every item is rated as a 5, a limited amount of change can be captured over time.As noted earlier, comprehension and cultural validity can also limit a measurement's ability to capture change. Many of the measures we use were developed with college students and often are not transferable to social work populations. As a result, cultural adaptation and easier comprehension may be needed. For example, measures often contain negatively worded items that add measurement errors and decrease the power to detect changes. As Fok and Henry suggested (2015), if these items do not serve the intended purpose, they should be removed wholesale or—at the very least—carefully reviewed prior to inclusion. For instance, an item from one of Fok and Henry's standardized scales—"Family members rarely become openly angry"—was not clearly comprehensible by Alaska Native adolescents. After working with cultural consultants, this item was changed to "In our family we are really mad at each other a lot." This measure now has more change validity and greater reliability. Because negatively worded items require a cognitively complex response, social work researchers should carefully consider whether those items may add errors to the measure's true score.Challenging Psychometric TheoryHere is the issue with reliability and validity: Psychologists have led us astray. Whereas in psychology the process is to use measurements to establish reliability and then validity (Nunnally, 1972), to advance evidence-based social work practices, a measure must first be shown to be valid—to measure what it intends to measure—before its reliability is examined (Rossiter, 2018). Rossiter further noted that "far too often, researchers justify their use of a new measure by reporting only coefficient alpha, a reliability statistic, when the measure's validity should have been established first" (2018, p. 930).One issue with psychometric theory is that it relies on the scores for validation and neglects the content validation needed on the front end of measurement (Rossiter, 2011). According to Rossiter, psychometrics validates the measure in reverse, assessing the validity of a construct by examining the relationship between the measure and the scores produced. As a strong critic of psychometric theory, Rossiter went on to suggest that Cronbach and Meehl's (1955) concept of construct validity is a misnomer. A "construct" is a hypothetical entity, and "only a measure can be validated" (Rossiter, 2018, p. 930). To "validate" means to establish the truth of something. But a construct is a definition, and a definition can be judged as reasonable or unreasonable, not as true or false. Substantiating this claim, Rossiter presented the following analogy, which I have paraphrased and adapted for social work:A team of social workers used psychometric theory to measure the construct of a buffalo. They did not begin by defining what a buffalo is. Instead, they found a horse, which was convenient because many other social workers had published items used to measure the horse construct. They factor-analyzed the horse items and found the four-leg items correlated nicely, forming a unidimensional factor. They detected two items measuring the large head and the shaggy coat. They did not load well on the leg factor. Upon further investigation, the four legs converged well with other measures of land mammals. There was clear discriminant validity with measures of reptiles and fish. Reviews of the measure found good psychometric properties for the "four-legs measure" of the buffalo, and it received an "A" grade for measurement standards. It became the leading measure of buffalos in all of the top journals that had high impact factors. The statistical results saved a lot of time for social work researchers who did not want to bother with rational analysis and critical thinking. Instead, they used the Likert scales to measure agreement that the object had four legs. Everyone is describing a buffalo, wouldn't you agree? (Adapted from Rossiter, 2005, p. 23)Rather than create our own approach to measurement, in social work's search for legitimacy we have followed too closely in the footsteps of psychology. Psychology has long provided expertise in measurement, yet psychological work on measurement has misled social workers whose mission was not to measure constructs. A key focus of social work is to use measurement to evaluate treatments or services. Also, in our research we have taken established measures and accepted their definition as correct, which is problematic. For example, by accepting the Parenting Stress Index as a valid measure of parenting stress, we have accepted the measure's definition, which may not correspond with our intention to measure outcomes.Developing Socialworkmetric TheoryRather than continuing to employ (or adapt) psychometric measures, I recommend that social work researchers develop socialworkmetric measures. Socialworkmetric refers to the theory and technique of social work measurement with the goal of improving the lives of individuals, families, or groups. Designed to measure change, the socialworkmetric measure should represent the measurement of an outcome, have content validity, and be sensitive to change or have change validity.Suppose, for example, that to evaluate the outcome of my family communication therapy intervention I selected the Family Adaptability and Cohesion Evaluation Scale, also known as FACES (Olson, Gorall, & Tiesel, 2006). The FACES measure is widely used to examine the impact of family-based interventions and is designed to evaluate the adaptability and cohesion in family interactions. Examples of items from the FACES measure include "We approve of each other's friends," "It's hard to tell who does which household chores," and "Children have a say in their discipline." This instrument can be useful for family assessments, but if it were used to evaluate a family communication therapy intervention, the items would appear to have little content validity because they have no clear correspondence with the program objectives or goals. However, if we used the Family Communication Patterns Measure (Ritchie & Fitzpatrick, 1990)—which includes the items "In our family we often talk about our feelings and emotions," "My family encourages me to express my feelings," and "My family often has relaxed conversations"—we now have a socialworkmetric instrument that has solid content validity because of the strong correspondence between the program goals and intervention objectives.That said, using and modifying measures so they best correspond to the evaluation of a social work intervention is not conventional practice. Conventional practice follows explicit rules that endorse the need for fidelity—using measures exactly the way they were originally designed; it is less acceptable to develop one's own measure or change a measure, even if that is what is needed to properly evaluate an intervention. Indeed, a colleague recently sent me a statement written by the outgoing editors of a leading social science journal (Mustillo, Lizardo, & McVeigh, 2018) who noted concern regardingauthors who use validated scales in a manner that is inconsistent with published validity work on those scales. We receive submissions in which items are cherry-picked from the scale, or the coding of such items is changed, or the summing scheme is altered. Any of these changes may undermine the validity of the scale. (p. 2)This is tantamount to "no dogs allowed here." Such rules may not make sense when attempting to enhance a measure's capacity to be content valid and function as a sociaworkmetric measure. More researchers are needed to experiment with variations of established measures and tweak those measures so they function as intended in an outcome study—to document true outcomes. Researchers do not need to cherry-pick items. Rather, they should determine which items, a priori, make the most sense to include for the measure to accomplish what it is designed to accomplish: measuring outcomes.Do We Understand How People Change Over Time?Reviewing the issues that are inherent in outcome measurements feeds into a larger question: Do we adequately understand how to measure change in social work research? When we c

Referência(s)