Artigo Acesso aberto Revisado por pares

Mine Is Bigger Than Yours

2012; Elsevier BV; Volume: 141; Issue: 3 Linguagem: Inglês

10.1378/chest.11-2473

ISSN

1931-3543

Autores

David L. Streiner, Geoffrey R. Norman,

Tópico(s)

Musculoskeletal pain and rehabilitation

Resumo

Results of studies that use laboratory tests are often easy to interpret, because we are familiar with the units and how to interpret them. However, this is not the case when the results are presented as ORs, relative risks, correlation, or scores on an unfamiliar scale. This article explains these various indices of effect sizes—how they are calculated, what they mean, and how they are interpreted. Results of studies that use laboratory tests are often easy to interpret, because we are familiar with the units and how to interpret them. However, this is not the case when the results are presented as ORs, relative risks, correlation, or scores on an unfamiliar scale. This article explains these various indices of effect sizes—how they are calculated, what they mean, and how they are interpreted. absolute risk absolute risk reduction coronary artery bypass graft effect size number needed to treat relative risk In a previous article about P levels and CIs,1Norman GR Streiner DL Do confidence intervals give you confidence?.Chest. 2012; 141: 17-19Abstract Full Text Full Text PDF PubMed Scopus (8) Google Scholar we mentioned in passing that the important information in any study is actually the effect size. However, we did not say what we meant by "effect size"—what it is, how it is measured, and how we interpret it. We omitted these key points deliberately for two reasons: first, so as not to detract from the main message of the previous article; and second, to give ourselves an excuse to write this one. But, before we start describing it, let us begin by giving some examples of why it is (sometimes) needed.If an article states that a certain intervention reduces the Paco2 level from 50 mm Hg to 40 mm Hg, no further information is needed for you to determine that the intervention was successful; Paco2 moved from a level indicating acidosis to being within the normal range. You know what these values mean and how much of a change is clinically important. But let us consider some other results: 1.The correlation between salt intake and hypertension is 0.46.2Berglund G The role of salt in hypertension.Acta Med Scand Suppl. 1983; 672: 117-120PubMed Google Scholar2.The OR for having a myocardial infarction was 0.48 for physicians who took aspirin compared with those who did not (sorry, it did not affect mortality).3Steering Committee of the Physician's Health Study Research Group Final report on the aspirin component of the ongoing Physician's Health Study.N Engl J Med. 1989; 321: 129-135Crossref PubMed Scopus (2502) Google Scholar3.The mean Duke Activities Status Index was two points higher among patients treated in sites that have angiography compared with sites that do not.4Pilote L Lauzon C Huynh T et al.Quality of life after acute myocardial infarction among patients treated at sites with and without on-site availability of angiography.Arch Intern Med. 2002; 162: 553-559Crossref PubMed Scopus (28) Google ScholarDoes that correlation of 0.46 indicate a strong or a weak relationship between salt and hypertension? Does the OR of 0.48 mean we should all start swallowing aspirin immediately or not? Is that two-point difference a clinically meaningful one or do we need a magnifying glass to see it? The answers are not as obvious as with the Paco2 example, because we are not as familiar with these other units of measurement, be they correlation coefficients, ORs (or their cousin on their mother's side, relative risks [RRs]), or especially differences between groups on scales we have not encountered before. So, let us take a guided tour of indices that measure how big an effect is. For obvious reasons, these indices together are called measures of effect size, commonly abbreviated as ES.Ask a statistician anything, and he or she will immediately draw one of two things—either a normal curve or a 2 × 2 table. In this case, we will do the latter. The reason is that we have two types of data and two types of measures. The data can be divided into (1) dichotomous variables, such as dead or alive, better or worse, and the like; and (2) continuous variables, such as various serum levels, scores on a scale, and so on. Similarly, we have two types of measures: (1) those that indicate how big the difference is and (2) those that tell us if the variables are related to one another. The resulting four types of ES are shown in Table 1. Let us go over them one at a time.Table 1Four Types of Effect SizesType of DataType of MeasureDichotomousContinuousDifferenceORStandardized mean difference (d)Relative risk (RR)RelationshipPhi (ϕ)Correlation (r, R) Open table in a new tab When we have a dichotomous outcome and we want to know how big the difference is between groups, we have two alternatives: the RR and the OR, which is sometimes called the relative odds. Which one to use depends on the type of study we are dealing with; the RR is used with randomized controlled trials and cohort studies, and the OR is used with case-control studies. (If you need help regarding what these terms mean, or why we cannot use the RR with case-control studies, see 5Streiner DL Norman GR PDQ Epidemiology. 3rd ed. PMPH USA, Shelton, CT2009Google Scholar.) Both begin with a 2 × 2 table (yet again), as in Table 2, with the rows indicating group membership and the columns the outcome.Table 2Results of a Hypothetical StudyOutcomeGroupDeadAliveTotalTreatment20 (A)80 (B)100 (A + B)Control40 (C)60 (D)100 (C + D)Total60 (A + C)140 (B + D)200 Open table in a new tab In the treatment group, the absolute risk (AR) of death is the number who have died (20 in cell A) divided by the number of people in the treatment group (100; the total of cells A and B); or A/(A + B), or 0.20. Similarly, the AR of death in the control group is 40 (cell C) divided by the number of people in that group (100; cells C + D), which is 0.40. The RR is simply the ratio of these two risks, or: RR=A/(A+B)C/(C+D)=A(C+D)C(A+B)=ARTreatmentARControl=0.200.40=0.50This means that the risk of dying in the treatment group is half of that in the control group.When we are dealing with case-control studies or with the output from some statistical tests such as logistic regression, we get a related measure, called the OR. Here, we first calculate the odds of dying if a person is in one of the two groups. It is 20 (cell A) divided by 40 (cell C), or 0.5. Similarly, the odds of not dying in the two groups is 80 (cell B) divided 60 (cell D), or 1.33. So, the OR is the ratio of the two, or: OR=A/CB/D=ADBC=20×6080×40=0.375(We calculated the OR based on data for which we should have used the RR, but that is acceptable; we just cannot go the other way).Although the RR and OR are interpreted differently (again, see 5Streiner DL Norman GR PDQ Epidemiology. 3rd ed. PMPH USA, Shelton, CT2009Google Scholar for a fuller explanation), their importance is assessed similarly. An OR or RR of 1.0 means that nothing is going on; there is no difference in outcome between the two groups. As a rough rule of thumb (and realize that it is only a rule of thumb), RRs and ORs < 0.50 or > 2.0 are seen as clinically important. Note that we are not speaking about statistical significance, but rather importance. For example, comparing coronary artery bypass graft (CABG) to stents in people with stable angina or acute coronary syndromes, the OR for restenosis at 6 months was 0.29, leading the authors of the meta-analysis to conclude that CABG is more effective in reducing the rates of major adverse cardiac events.6Bakhai A Hill RA Dundar Y Dickson RC Walley T Percutaneous transluminal coronary angioplasty with stents versus coronary artery bypass grafting for people with stable angina or acute coronary syndromes.Cochrane Database Syst Rev. 2005; 1: CD004588Google ScholarAs we said in the previous article,1Norman GR Streiner DL Do confidence intervals give you confidence?.Chest. 2012; 141: 17-19Abstract Full Text Full Text PDF PubMed Scopus (8) Google Scholar statistical significance is greatly affected by the sample size, so in very large studies or surveys, an RR of 1.02 or 0.98 can be statistically significant, but the results are likely clinically trivial. Also bear in mind that if a disease has a bad outcome and the intervention is relatively noninvasive or inexpensive, ORs or RRs closer to 1.0 can point to useful treatments; rules of thumb should never trump clinical judgment.Before leaving the RR and OR, we must give a word of warning. These ES indices are much beloved by purveyors of expensive pharmaceuticals, which in itself should serve as a red flag that they are like bikinis—what they reveal is interesting, but what they conceal is crucial. Let us go back to Table 1, where we found that the RR was 0.5 and the intervention looked like it might be worthwhile. Imagine we do exactly the same study with 10,000 patients in each group rather than 100, but the number who died was exactly the same: 20 in the treatment group and 40 in the control group. If we go through the calculations, we find that the AR in the treatment group is 0.002, and it is 0.004 in the control group. And what is the RR? It is 0.50, just what we found before. Now, though, the intervention looks much less promising—it saves only 20 more people out of 20,000, rather than 20 out of 200. Put very briefly, the RR and OR do not take into account the baseline risk of succumbing to the disease, and that is the crucial part that is concealed. Never evaluate an intervention simply on the basis of the OR or RR; you also need some other index that takes the baseline into account. One simple way to get a better idea of how much real benefit results from the treatment is simply to take the difference in ARs rather than a ratio. This is called the absolute risk reduction (ARR), or the attributable risk.In the first example, the AR was 0.20 in the treatment group and 0.40 in the comparison group, so the ARR is the difference between them: 0.40 − 0.20 = 0.20. That means that 20% of patients avoided death. By contrast, in the second example, although the RR was the same (0.50), the ARR was way down at 0.2%: two patients in 1,000 avoided death.One other twist is to ask the question, "How many people do I have to treat to avoid one death?" This is called the number needed to treat, or NNT.7Laupacis A Sackett DL Roberts RS An assessment of clinically useful measures of the consequences of treatment.N Engl J Med. 1988; 318: 1728-1733Crossref PubMed Scopus (1299) Google Scholar We have already found in the first example that if we treat 100 people, we will avoid 20 deaths. If we treat 100/20 = 5 people we will avoid one death. So the NNT is simply the reciprocal of the ARR: NNT=1ARR=10.20=5If we go through the same calculations for the second example, we find that the ARR is 0.002, and the NNT is 500. Now the difference between the two studies becomes clear: despite identical RRs, they are clearly different in terms of effectiveness. (Why is the NNT not 1? For two reasons: first, not everyone who gets the treatment does well; more importantly, despite advertisements to the contrary, not everyone who does not get the treatment succumbs to the dread disease.)Now let us move to the next box, which is the ES for differences in continuous measures. The problem is that for measures with which we are unfamiliar, such as scales tapping pain, quality of life, activities of daily living, and so forth, we do not know how many points between groups, or within a group from preintervention to postintervention, is a big or a small difference. It could be two points for one scale and 20 points for a different scale. We get around this problem by "translating" them so they all use a common yardstick, called the standardized mean difference, or Cohen's d, in honor of its developer. The formula for d is: d=MT-McSDwhere MT is the mean of the treatment group, MC that of the control group, and SD the standard deviation. (We will leave aside the question of whether it is the SD of the control group or that of both groups combined, but most people use the latter.)The value of d is expressed in units of the SD of whatever scale was used in the study. So, if d is 0.5, that means that the groups differ by one-half an SD. That is simple enough, but is that good or bad? Cohen said that a value of d of 0.2 is small, 0.5 is moderate, and 0.8 and above is large (there is no upper limit). And why those values? Because Jake Cohen speaketh from on high, and we did tremble and listen. Actually, Cohen was not quite that dogmatic; he proposed these as guidelines only, which could change from one field to the next, or as we gained more knowledge. It is the rest of the world that is fairly dogmatic now and has engraved these numbers in stone.Jumping to cell D of Table 1, the simple correlation (r) and the multiple correlation (R) are used to tell us the strength of the relationship between two variables (r) or between one variable and two or more predictors (R). They can range between −1.0 and +1.0. Forgetting about the sign (which tells us only the direction of the relationship), the guidelines for interpreting r and R are: a value of 0.100 is small, 0.243 is moderate, and 0.371 and higher are large. These numbers are exactly equivalent to d values of 0.2, 0.5, and 0.8, respectively, and come from the formula that converts d to r.In our opinion, though, these suggested values are too small by about half. Our reasoning is that squaring r or R tells us the amount of variance in one variable explainable by the other variable(s). An r of 0.371 means that only 14% of the variance is explained; 86% thus remains unexplained, and this does not seem high to us at all. We would like at least 25% of one variable to be accounted for by another, which means a correlation of at least 0.50. But when we speak, nobody trembles and listens.Meyer et al8Meyer GJ Finn SE Eyde LD et al.Psychological testing and psychological assessment. A review of evidence and issues.Am Psychol. 2001; 56: 128-165Crossref PubMed Scopus (871) Google Scholar calculated r-type ES for a number of interventions and other phenomena. They found that the effect of antihypertensives in reducing stroke was 0.03, and the effectiveness of CABG on stable heart disease and survival at 5 years was 0.08. These are not overly impressive, especially in comparison with the effects of sildenafil for improving sexual functioning (0.38) or even the ratings of prominent movie critics and box office success (0.17).The last ES, ϕ, is used to reflect the magnitude of the relationship for dichotomous variables. It is based on the χ2 test: φ=χ2NIt yields a number between 0 and 1, although one drawback is that its maximum value is 1 only if (A + B) = (C + D) and (A + C) = (B + D). The guidelines for interpreting ϕ are the same as for r and R. We have rarely encountered this outside of the psychologic literature.So there it is (or rather, there they are). ES is a handy way of going beyond P levels (and even CIs) and answer the question, "So how big was the effect?" Some guidelines exist for saying whether the effect was small, moderate, or large, but these must be supplemented by your own clinical judgment and knowledge of the area. In a previous article about P levels and CIs,1Norman GR Streiner DL Do confidence intervals give you confidence?.Chest. 2012; 141: 17-19Abstract Full Text Full Text PDF PubMed Scopus (8) Google Scholar we mentioned in passing that the important information in any study is actually the effect size. However, we did not say what we meant by "effect size"—what it is, how it is measured, and how we interpret it. We omitted these key points deliberately for two reasons: first, so as not to detract from the main message of the previous article; and second, to give ourselves an excuse to write this one. But, before we start describing it, let us begin by giving some examples of why it is (sometimes) needed. If an article states that a certain intervention reduces the Paco2 level from 50 mm Hg to 40 mm Hg, no further information is needed for you to determine that the intervention was successful; Paco2 moved from a level indicating acidosis to being within the normal range. You know what these values mean and how much of a change is clinically important. But let us consider some other results: 1.The correlation between salt intake and hypertension is 0.46.2Berglund G The role of salt in hypertension.Acta Med Scand Suppl. 1983; 672: 117-120PubMed Google Scholar2.The OR for having a myocardial infarction was 0.48 for physicians who took aspirin compared with those who did not (sorry, it did not affect mortality).3Steering Committee of the Physician's Health Study Research Group Final report on the aspirin component of the ongoing Physician's Health Study.N Engl J Med. 1989; 321: 129-135Crossref PubMed Scopus (2502) Google Scholar3.The mean Duke Activities Status Index was two points higher among patients treated in sites that have angiography compared with sites that do not.4Pilote L Lauzon C Huynh T et al.Quality of life after acute myocardial infarction among patients treated at sites with and without on-site availability of angiography.Arch Intern Med. 2002; 162: 553-559Crossref PubMed Scopus (28) Google Scholar Does that correlation of 0.46 indicate a strong or a weak relationship between salt and hypertension? Does the OR of 0.48 mean we should all start swallowing aspirin immediately or not? Is that two-point difference a clinically meaningful one or do we need a magnifying glass to see it? The answers are not as obvious as with the Paco2 example, because we are not as familiar with these other units of measurement, be they correlation coefficients, ORs (or their cousin on their mother's side, relative risks [RRs]), or especially differences between groups on scales we have not encountered before. So, let us take a guided tour of indices that measure how big an effect is. For obvious reasons, these indices together are called measures of effect size, commonly abbreviated as ES. Ask a statistician anything, and he or she will immediately draw one of two things—either a normal curve or a 2 × 2 table. In this case, we will do the latter. The reason is that we have two types of data and two types of measures. The data can be divided into (1) dichotomous variables, such as dead or alive, better or worse, and the like; and (2) continuous variables, such as various serum levels, scores on a scale, and so on. Similarly, we have two types of measures: (1) those that indicate how big the difference is and (2) those that tell us if the variables are related to one another. The resulting four types of ES are shown in Table 1. Let us go over them one at a time. When we have a dichotomous outcome and we want to know how big the difference is between groups, we have two alternatives: the RR and the OR, which is sometimes called the relative odds. Which one to use depends on the type of study we are dealing with; the RR is used with randomized controlled trials and cohort studies, and the OR is used with case-control studies. (If you need help regarding what these terms mean, or why we cannot use the RR with case-control studies, see 5Streiner DL Norman GR PDQ Epidemiology. 3rd ed. PMPH USA, Shelton, CT2009Google Scholar.) Both begin with a 2 × 2 table (yet again), as in Table 2, with the rows indicating group membership and the columns the outcome. In the treatment group, the absolute risk (AR) of death is the number who have died (20 in cell A) divided by the number of people in the treatment group (100; the total of cells A and B); or A/(A + B), or 0.20. Similarly, the AR of death in the control group is 40 (cell C) divided by the number of people in that group (100; cells C + D), which is 0.40. The RR is simply the ratio of these two risks, or: RR=A/(A+B)C/(C+D)=A(C+D)C(A+B)=ARTreatmentARControl=0.200.40=0.50This means that the risk of dying in the treatment group is half of that in the control group. When we are dealing with case-control studies or with the output from some statistical tests such as logistic regression, we get a related measure, called the OR. Here, we first calculate the odds of dying if a person is in one of the two groups. It is 20 (cell A) divided by 40 (cell C), or 0.5. Similarly, the odds of not dying in the two groups is 80 (cell B) divided 60 (cell D), or 1.33. So, the OR is the ratio of the two, or: OR=A/CB/D=ADBC=20×6080×40=0.375(We calculated the OR based on data for which we should have used the RR, but that is acceptable; we just cannot go the other way). Although the RR and OR are interpreted differently (again, see 5Streiner DL Norman GR PDQ Epidemiology. 3rd ed. PMPH USA, Shelton, CT2009Google Scholar for a fuller explanation), their importance is assessed similarly. An OR or RR of 1.0 means that nothing is going on; there is no difference in outcome between the two groups. As a rough rule of thumb (and realize that it is only a rule of thumb), RRs and ORs < 0.50 or > 2.0 are seen as clinically important. Note that we are not speaking about statistical significance, but rather importance. For example, comparing coronary artery bypass graft (CABG) to stents in people with stable angina or acute coronary syndromes, the OR for restenosis at 6 months was 0.29, leading the authors of the meta-analysis to conclude that CABG is more effective in reducing the rates of major adverse cardiac events.6Bakhai A Hill RA Dundar Y Dickson RC Walley T Percutaneous transluminal coronary angioplasty with stents versus coronary artery bypass grafting for people with stable angina or acute coronary syndromes.Cochrane Database Syst Rev. 2005; 1: CD004588Google Scholar As we said in the previous article,1Norman GR Streiner DL Do confidence intervals give you confidence?.Chest. 2012; 141: 17-19Abstract Full Text Full Text PDF PubMed Scopus (8) Google Scholar statistical significance is greatly affected by the sample size, so in very large studies or surveys, an RR of 1.02 or 0.98 can be statistically significant, but the results are likely clinically trivial. Also bear in mind that if a disease has a bad outcome and the intervention is relatively noninvasive or inexpensive, ORs or RRs closer to 1.0 can point to useful treatments; rules of thumb should never trump clinical judgment. Before leaving the RR and OR, we must give a word of warning. These ES indices are much beloved by purveyors of expensive pharmaceuticals, which in itself should serve as a red flag that they are like bikinis—what they reveal is interesting, but what they conceal is crucial. Let us go back to Table 1, where we found that the RR was 0.5 and the intervention looked like it might be worthwhile. Imagine we do exactly the same study with 10,000 patients in each group rather than 100, but the number who died was exactly the same: 20 in the treatment group and 40 in the control group. If we go through the calculations, we find that the AR in the treatment group is 0.002, and it is 0.004 in the control group. And what is the RR? It is 0.50, just what we found before. Now, though, the intervention looks much less promising—it saves only 20 more people out of 20,000, rather than 20 out of 200. Put very briefly, the RR and OR do not take into account the baseline risk of succumbing to the disease, and that is the crucial part that is concealed. Never evaluate an intervention simply on the basis of the OR or RR; you also need some other index that takes the baseline into account. One simple way to get a better idea of how much real benefit results from the treatment is simply to take the difference in ARs rather than a ratio. This is called the absolute risk reduction (ARR), or the attributable risk. In the first example, the AR was 0.20 in the treatment group and 0.40 in the comparison group, so the ARR is the difference between them: 0.40 − 0.20 = 0.20. That means that 20% of patients avoided death. By contrast, in the second example, although the RR was the same (0.50), the ARR was way down at 0.2%: two patients in 1,000 avoided death. One other twist is to ask the question, "How many people do I have to treat to avoid one death?" This is called the number needed to treat, or NNT.7Laupacis A Sackett DL Roberts RS An assessment of clinically useful measures of the consequences of treatment.N Engl J Med. 1988; 318: 1728-1733Crossref PubMed Scopus (1299) Google Scholar We have already found in the first example that if we treat 100 people, we will avoid 20 deaths. If we treat 100/20 = 5 people we will avoid one death. So the NNT is simply the reciprocal of the ARR: NNT=1ARR=10.20=5 If we go through the same calculations for the second example, we find that the ARR is 0.002, and the NNT is 500. Now the difference between the two studies becomes clear: despite identical RRs, they are clearly different in terms of effectiveness. (Why is the NNT not 1? For two reasons: first, not everyone who gets the treatment does well; more importantly, despite advertisements to the contrary, not everyone who does not get the treatment succumbs to the dread disease.) Now let us move to the next box, which is the ES for differences in continuous measures. The problem is that for measures with which we are unfamiliar, such as scales tapping pain, quality of life, activities of daily living, and so forth, we do not know how many points between groups, or within a group from preintervention to postintervention, is a big or a small difference. It could be two points for one scale and 20 points for a different scale. We get around this problem by "translating" them so they all use a common yardstick, called the standardized mean difference, or Cohen's d, in honor of its developer. The formula for d is: d=MT-McSDwhere MT is the mean of the treatment group, MC that of the control group, and SD the standard deviation. (We will leave aside the question of whether it is the SD of the control group or that of both groups combined, but most people use the latter.) The value of d is expressed in units of the SD of whatever scale was used in the study. So, if d is 0.5, that means that the groups differ by one-half an SD. That is simple enough, but is that good or bad? Cohen said that a value of d of 0.2 is small, 0.5 is moderate, and 0.8 and above is large (there is no upper limit). And why those values? Because Jake Cohen speaketh from on high, and we did tremble and listen. Actually, Cohen was not quite that dogmatic; he proposed these as guidelines only, which could change from one field to the next, or as we gained more knowledge. It is the rest of the world that is fairly dogmatic now and has engraved these numbers in stone. Jumping to cell D of Table 1, the simple correlation (r) and the multiple correlation (R) are used to tell us the strength of the relationship between two variables (r) or between one variable and two or more predictors (R). They can range between −1.0 and +1.0. Forgetting about the sign (which tells us only the direction of the relationship), the guidelines for interpreting r and R are: a value of 0.100 is small, 0.243 is moderate, and 0.371 and higher are large. These numbers are exactly equivalent to d values of 0.2, 0.5, and 0.8, respectively, and come from the formula that converts d to r. In our opinion, though, these suggested values are too small by about half. Our reasoning is that squaring r or R tells us the amount of variance in one variable explainable by the other variable(s). An r of 0.371 means that only 14% of the variance is explained; 86% thus remains unexplained, and this does not seem high to us at all. We would like at least 25% of one variable to be accounted for by another, which means a correlation of at least 0.50. But when we speak, nobody trembles and listens. Meyer et al8Meyer GJ Finn SE Eyde LD et al.Psychological testing and psychological assessment. A review of evidence and issues.Am Psychol. 2001; 56: 128-165Crossref PubMed Scopus (871) Google Scholar calculated r-type ES for a number of interventions and other phenomena. They found that the effect of antihypertensives in reducing stroke was 0.03, and the effectiveness of CABG on stable heart disease and survival at 5 years was 0.08. These are not overly impressive, especially in comparison with the effects of sildenafil for improving sexual functioning (0.38) or even the ratings of prominent movie critics and box office success (0.17). The last ES, ϕ, is used to reflect the magnitude of the relationship for dichotomous variables. It is based on the χ2 test: φ=χ2N It yields a number between 0 and 1, although one drawback is that its maximum value is 1 only if (A + B) = (C + D) and (A + C) = (B + D). The guidelines for interpreting ϕ are the same as for r and R. We have rarely encountered this outside of the psychologic literature. So there it is (or rather, there they are). ES is a handy way of going beyond P levels (and even CIs) and answer the question, "So how big was the effect?" Some guidelines exist for saying whether the effect was small, moderate, or large, but these must be supplemented by your own clinical judgment and knowledge of the area.

Referência(s)
Altmetric
PlumX