Artigo Acesso aberto Revisado por pares

Clinical Epidemiology and Biostatistics

2004; Wolters Kluwer; Volume: 86; Issue: 3 Linguagem: Inglês

10.2106/00004623-200403000-00024

ISSN

1535-1386

Autores

Mininder S. Kocher, David Zurakowski,

Tópico(s)

Primary Care and Health Outcomes

Resumo

Epidemiology is the study of the distribution and determinants of disease frequency1. In the fifth century BC, Hippocrates suggested that the development of human disease might be related to the external and internal environment of an individual1. In the 1600s and 1800s in England, John Graunt and William Farr quantified vital statistics on the basis of birth and death records1. In the 1850s, John Snow associated cholera with water contamination in London by observing higher cholera rates in homes supplied by certain water sources1. Epidemiological methods gradually evolved with use of the case-control study to demonstrate an association between smoking and lung cancer, use of the prospective cohort study to determine risk factors for cardiovascular disease in the Framingham Heart Study, and use of the randomized clinical trial for the poliomyelitis vaccine1. The evidence-based medicine and patient-derived outcomes assessment movements burst onto the scene of clinical medicine in the 1980s and 1990s as a result of contemporaneous medical, societal, and economic influences. Pioneers such as Sackett and Feinstein emphasized levels of evidence and patient-centered outcomes assessment2-10. Work by Wennberg and colleagues revealed large small-area variations in clinical practice, with some patients being thirty times more likely to undergo an operative procedure than other patients with identical symptoms merely because of their geographic location11-16. Additional critical research suggested that up to 40% of some surgical procedures might be inappropriate and up to 85% of common medical treatments were not rigorously validated17-19. Meanwhile, the costs of health care were rapidly rising to over two billion dollars per day, increasing from 5.2% of the gross domestic product in 1960 to 16.2% in 199720. Health maintenance organizations and managed care emerged. In addition, increasing federal, state, and consumer oversight was brought to bear on the practice of clinical medicine. These forces have led to an increased focus on the effectiveness of clinical care. Clinical epidemiology provides the methodology with which to assess this effectiveness. This article presents an overview of the concepts of study design, hypothesis testing, measures of treatment effect, diagnostic performance, evidence-based medicine, outcomes assessment, data, and statistical analysis. Examples from the orthopaedic literature and a glossary of terminology (terms italicized throughout the text) are provided. Study Design In observational studies researchers observe patient groups without allocation of the intervention, whereas in experimental studies researchers allocate the treatment. Experimental studies involving humans are called trials. Research studies may be retrospective, meaning that the direction of inquiry is backward from the cases and that the events of interest transpired before the onset of the study. Alternatively, studies may be prospective, meaning that the direction of inquiry is forward from the cohort inception and that the events of interest transpire after the onset of the study (Fig. 1). Cross-sectional studies are used to survey one point in time. Longitudinal studies follow the same patients over multiple points in time.Fig. 1: Comparison of prospective and retrospective study designs on the basis of the direction of inquiry and the onset of the study.All research studies are susceptible to invalid conclusions due to bias, confounding, and chance. Bias is the non-random systematic error in the design or conduct of a study. Bias usually is not intentional; however, it is pervasive and insidious. Forms of bias can corrupt a study at any phase, including patient selection (selection and membership bias), study performance (performance and information bias), patient follow-up (nonresponder and transfer bias), and outcome determination (detection, recall, acceptability, and interviewer bias). Frequent biases in the orthopaedic literature include selection bias, when dissimilar groups are compared; nonresponder bias, when the follow-up rate is low; and interviewer bias, when the investigator determines the outcome. A confounder is a variable that has independent associations with both the independent (predictor) and dependent (outcome) variables, thus potentially distorting their relationship. For example, an association between knee laxity and anterior cruciate ligament injury may be confounded by female sex since women may have greater knee laxity and a higher risk of anterior cruciate ligament injury. Frequent confounders in clinical research include gender, age, socioeconomic status, and co-morbidities. As discussed below in the section on hypothesis testing, chance may lead to invalid conclusions based on the probability of type-I and type-II errors, which are related to p values and power. The adverse effects of bias, confounding, and chance can be minimized by study design and statistical analysis. Prospective studies minimize bias associated with patient selection, quality of information, attempts to recall preoperative status, and nonresponders. Randomization minimizes selection bias and equally distributes confounders. Blinding can further decrease bias, and matching can decrease confounding. Confounders can sometimes be controlled post hoc with the use of stratified analysis or multivariate methods. The effects of chance can be minimized by an adequate sample size based on power calculations and use of appropriate levels of significance in hypothesis testing. The ability of study design to optimize validity while minimizing bias, confounding, and chance is recognized by the adoption of hierarchical levels of evidence on the basis of study design (see Table [Levels of Evidence for Primary Research Question] in Instructions to Authors of this issue of The Journal). Furthermore, the standard to prove cause-effect is set higher than the standard to suggest an association. Inference of causation requires supporting data from non-observational studies such as a randomized clinical trial, a biologically plausible explanation, a relatively large effect size, reproducibility of findings, a temporal relationship between cause and effect, and a biological gradient demonstrated by a dose-response relationship. Observational study designs include case series, case-control studies, cross-sectional surveys, and cohort studies. A case series is a retrospective, descriptive account of a group of patients with interesting characteristics or a series of patients who have undergone an intervention. A case series that includes one patient is a case report. Case series are easy to construct and can provide a forum for the presentation of interesting or unusual observations. However, case series are often anecdotal, are subject to many possible biases, lack a hypothesis, and are difficult to compare with other series. Thus, case series are usually viewed as a means of generating hypotheses for additional studies but not as conclusive. A case-control study is a study in which the investigator identifies patients with an outcome of interest (cases) and patients without the outcome (controls) and then compares the two groups in terms of possible risk factors. The effects in a case-control study are frequently reported with use of the odds ratio. Case-control studies are efficient (particularly for the evaluation of unusual conditions or outcomes) and are relatively easy to perform. However, an appropriate control group may be difficult to identify, and preexisting high-quality medical records are essential. Moreover, case-control studies are susceptible to multiple biases, particularly selection and detection biases based on the identification of cases and controls. Cross-sectional surveys are often used to determine the prevalence of disease or to identify coexisting associations in patients with a particular condition at one particular point in time. The prevalence of a condition is the number of individuals with the condition divided by the total number of individuals at one point in time. Incidence, in contradistinction, refers to the number of individuals with the condition divided by the total number of individuals over a defined time period. Thus, prevalence data are usually obtained from a cross-sectional survey creating a proportion, whereas incidence data are usually obtained from a prospective cohort study and a time value is contained in the denominator. Surveys are also frequently performed to determine preferences and treatment patterns. Because cross-sectional studies represent a snapshot in time, they may be misleading if the research question involves the disease process over time. Surveys also present unique challenges in terms of adequate response rate, representative samples, and acceptability bias. A traditional cohort study is one in which a population of interest is identified and is followed prospectively in order to determine outcomes and associations with risk factors. Retrospective, or historical, cohort studies can also be performed; in those studies, cohort members are identified on the basis of records, and the follow-up period is entirely or partly in the past. Cohort studies are optimal for studying the incidence, course, and risk factors of a disease because they are longitudinal, meaning that a group of subjects is followed over time. The effects in a cohort study are frequently reported in terms of relative risk (RR). Because traditional cohort studies are prospective, they can optimize follow-up and data quality and can minimize bias associated with selection, information, and measurement. In addition, they have the correct time-sequence to provide strong evidence regarding associations. However, these studies are costly, are logistically demanding, often require a long time-period for completion, and are inefficient for the assessment of unusual outcomes or diseases. Experimental study designs may involve the use of concurrent controls, sequential controls (crossover trials), or historical controls. The randomized clinical trial (RCT) with concurrent controls is the so-called gold standard of clinical evidence as it provides the most valid conclusions (internal validity) by minimizing the effects of bias and confounding. Rigorous randomization with enough patients is the best means of avoiding confounding. The performance of a randomized control trial involves the construction of a protocol document that explicitly establishes eligibility criteria, sample size, informed consent, randomization, rules for stopping the trial, blinding, measurement, monitoring of compliance, assessment of safety, and data analysis. Because allocation is random, selection bias is minimized and confounders (known and unknown) theoretically are equally distributed between groups. Blinding minimizes performance, detection, interviewer, and acceptability bias. Blinding may be practiced at four levels: participants, investigators applying the intervention, outcome assessors, and analysts. Intention-to-treat analysis minimizes nonresponder and transfer bias, while sample-size determination ensures adequate power. The intention-to-treat principle states that all patients should be analyzed within the treatment group to which they were randomized in order to preserve the goals of randomization. Although the randomized clinical trial is the epitome of clinical research designs, the disadvantages of such trials include their expense, logistics, and time to completion. Accrual of patients and acceptance by clinicians may be difficult. With rapidly evolving technology, a new technique may quickly become well accepted, making an existing randomized clinical trial obsolete or a potential randomized clinical trial difficult to accept. Ethically, randomized clinical trials require clinical equipoise (equality of treatment options in the clinician's judgment) for enrollment, interim stopping rules to avoid harm and to evaluate adverse events, and truly informed consent. Finally, while randomized clinical trials have excellent internal validity, some have questioned their generalizability (external validity) because the practice pattern and the population of patients enrolled in a randomized clinical trial may be overly constrained and nonrepresentative. Ethical considerations are intrinsic to the design and conduct of clinical research studies. Informed consent is of paramount importance, and it is the focus of much of the activity of institutional review boards. Investigators should be familiar with the Nuremberg Code and the Declaration of Helsinki as they pertain to ethical issues of risks and benefits, protection of privacy, and respect for autonomy21,22. Hypothesis Testing The purpose of hypothesis testing is to permit generalizations from a sample to the population from which it came. Hypothesis testing confirms or refutes the assertion that the observed findings did not occur by chance alone but rather occurred because of a true association between variables. By default, the null hypothesis of a study asserts that there is no significant association between variables whereas the alternative hypothesis asserts that there is a significant association. If the findings of a study are not significant we cannot reject the null hypothesis, whereas if the findings are significant we can reject the null hypothesis and accept the alternative hypothesis. Thus, all research studies that are based on a sample make an inference about the truth in the overall population. By constructing a 2 × 2 table of the possible outcomes of a study (Table I), we can see that the inference of a study is correct if a significant association is not found when there is no true association or if a significant association is found when there is a true association. However, a study can have two types of errors. A type-I or alpha (α) error occurs when a significant association is found when there is no true association (resulting in a false-positive study that rejects a true null hypothesis). A type-II or beta (β) error wrongly concludes that there is no significant association (resulting in a false-negative study that rejects a true alternative hypothesis). TABLE I - Hypothesis Testing* Truth Experiment No Association Association No association Correct Type-II (β) error Association Type-I (α) error Correct *P value = probability of type-I (α) error. Power = 1 - probability of type-II (β) error. The alpha level refers to the probability of a type-I (α) error. By convention, the alpha level of significance is set at 0.05, which means that we accept the finding of a significant association if there is less than a one in twenty chance that the observed association was due to chance alone. Thus, the p value, which is calculated with a statistical test, is a measure of the strength of the evidence provided by the data in favor of the null hypothesis. If the p value is less than the alpha level, then the evidence against the null hypothesis is strong enough for us to reject it and conclude that the result is significant. P values frequently are used in clinical research and are given great importance by journals and readers; however, there is a strong movement in biostatistics to deemphasize p values because a significance level of p < 0.05 is arbitrary, a strict cutoff point can be misleading (there is little difference between p = 0.049 and p = 0.051, but only the former is considered "significant"), the p value gives no information about the strength of the association, and the p value may be statistically significant without the results being clinically important. Alternatives to the traditional reliance on p values include the use of variable alpha levels of significance based on the consequences of the type-I error and the reporting of p values without using the term "significant." Use of 95% confidence intervals in lieu of p values has gained acceptance as these intervals convey information regarding the significance of findings (95% confidence intervals do not overlap if they are significantly different), the magnitude of differences, and the precision of measurement (indicated by the range of the 95% confidence interval). Whereas the p value is often interpreted as being either significant or not, the 95% confidence interval provides a range of values that allows the reader to interpret the implications of the results. In addition, while p values have no units, confidence intervals are presented in the units of the variable of interest, which helps the reader to interpret the results. For example, the authors of a study of the duration of the hospital stay for children with septic arthritis of the hip managed according to a clinical practice guideline may state that "there was a significantly shorter hospital stay for patients treated according to the guideline" with the addition of either "p = 0.003" if p values are used or "95% confidence intervals, 3.8 to 5.8 days for patients treated according to the guideline and 7.3 to 9.3 days for patients not treated according to the guideline" if 95% confidence intervals are used23. The p-value approach conveys statistical significance only, whereas the confidence-interval approach conveys statistical significance (the confidence intervals do not overlap), clinical significance (the magnitude of the values), and precision (the range of the confidence intervals). Power is the probability of finding a significant association if one truly exists and is defined as 1 – the probability of a type-II (β) error. By convention, acceptable power is set at ≥80%, which means that there is ≤20% chance that the study will demonstrate no significant association when there is a true association. In practice, when a study demonstrates a significant association, the potential error of concern is the type-I (α) error as expressed by the p value. However, when a study demonstrates no significant association, the potential error of concern is the type-II (β) error as expressed by power—that is, in a study that demonstrates no significant effect, there may truly be no significant effect or there may actually be a significant effect but the study was underpowered because the sample size was too small or the measurements were too imprecise. Thus, when a study demonstrates no significant effect, the power of the study should be reported. The calculations for power analyses differ depending on the statistical methods that are utilized for the analysis; however, four elements are involved in a power analysis: α, β, effect size, and sample size (n). Effect size is the difference that you want to be able to detect with the given α and β. It is based on a clinical sense about how large a difference would be clinically meaningful. Effect sizes are often defined in dimension-less terms, on the basis of a difference in mean values divided by the pooled standard deviation for a comparison of two groups. Small sample sizes, small effect sizes, and large variances all decrease the power of a study. An understanding of power issues is important in clinical research, to minimize the use of resources when planning a study and to ensure the validity of a study. Sample-size calculations are performed when a study is being planned. Typically, power is set at 80%, alpha is set at 0.05, the effect size and variance are estimated from pilot data or the literature, and the equation is solved for the necessary sample size. Calculation of power after the study has been completed—that is, post-hoc power analysis—is controversial and is discouraged. Diagnostic Performance A diagnostic test can result in four possible scenarios: (1) true positive if the test is positive and the disease is present, (2) false positive if the test is positive and the disease is absent, (3) true negative if the test is negative and the disease is absent, and (4) false negative if the test is negative and the disease is present (Table II). The sensitivity of a test is the percentage (or proportion) of patients with the disease who are classified as having a positive result of the test (the true-positive rate). A test with 97% sensitivity implies that, of 100 patients with the disease, ninety-seven will have a positive test. Sensitive tests have a low false-negative rate. A negative result of a highly sensitive test rules disease out (SNout). The specificity of a test is the percentage (or proportion) of patients without the disease who are classified as having a negative result of the test (the true-negative rate). A test with 91% specificity implies that, of 100 patients without the disease, ninety-one will have a negative test.Specific tests have a low false-positive rate. A positive result of a highly specific test rules disease in (SPin). Sensitivity and specificity can be combined into a single parameter, the likelihood ratio (LR), which is the probability of a true positive divided by the probability of a false positive. Sensitivity and specificity can be established in studies in which the results of a diagnostic test are compared with those of the "gold standard" of diagnosis in the same patients—for example, by comparing the results of magnetic resonance imaging with arthroscopic findings24. TABLE II - Diagnostic Test Performance* Disease Positive Disease Negative Test positive a (true positive) b (false positive) Test negative c (false negative) d (true negative) *Sensitivity = a/(a + c), specificity = d/(b + d), accuracy = (a + c)/(a + b + c + d), false-negative rate = 1 - sensitivity, false-positive rate = 1 - specificity, likelihood ratio (+) = sensitivity/false-positive rate, likelihood ratio (-) = false-negative rate/specificity, positive predictive value = [(prevalence)(sensitivity)]/[(prevalence)(sensitivity) + (1 - prevalence)(1 - specificity)], and negative predictive value = [(1 - prevalence)(specificity)]/[(1 - prevalence)(specificity) + (prevalence)(1 - sensitivity)]. Sensitivity and specificity are technical parameters of diagnostic testing performance and have important implications for screening and clinical practice guidelines25,26; however, they are less relevant in the typical clinical setting because the clinician does not know whether the patient has the disease. The clinically relevant issues are the probability of the patient having the disease when the result is positive (positive predictive value [PPV]) and the probability of the patient not having the disease when the result is negative (negative predictive value [NPV]). The positive and negative predictive values are probabilities that require an estimate of the prevalence of the disease in the population, and they can be calculated with use of equations that utilize Bayes' theorem27. There is an inherent trade-off between sensitivity and specificity. Because there is typically some overlap between the diseased and nondiseased groups with respect to a test distribution, the investigator can select a positivity criterion with a low false-negative rate (to optimize sensitivity) or a low false-positive rate (to optimize specificity) (Fig. 2). In practice, positivity criteria are selected on the basis of the consequences of a false-positive or a false-negative diagnosis. If the consequences of a false-negative diagnosis outweigh the consequences of a false-positive diagnosis of a condition (such as septic arthritis of the hip in children28), a more sensitive criterion is chosen. This relationship between the sensitivity and specificity of a diagnostic test can be portrayed with use of a receiver operating characteristic (ROC) curve. A receiver operating characteristic graph shows the relationship between the true-positive rate (sensitivity) on the y axis and the false-positive rate (1 – specificity) on the x axis plotted at each possible cutoff (Fig. 3). Overall diagnostic performance can be evaluated on the basis of the area under the receiver operating characteristic curve29.Fig. 2: Selection of positivity criterion. Because there is typically overlap between the diseased population and the nondiseased population over a range of diagnostic values (x axis), there is an intrinsic trade-off between sensitivity and specificity. When the positive test results are identified as those to the right of cutoff point A, there is high sensitivity because most patients with the disease are correctly identified as having a positive result. However, there is lower specificity because some of the patients without the disease are incorrectly identified as having a positive result (false positives). When the positive test results are identified as those to the right of cutoff point B, there is lower sensitivity because some patients with the disease are incorrectly identified as having a negative result (false negatives). However, there is high specificity because most patients without the disease are correctly identified as having a negative result.Fig. 3: Receiver-operating characteristic (ROC) curve for a clinical prediction rule for differentiating septic arthritis from transient synovitis of the hip in children28. The false-positive rate (1 – specificity) is plotted on the x axis, and sensitivity is plotted on the y axis. The area under the curve represents the overall diagnostic performance of a prediction rule or a diagnostic test. For a perfect test, the area under the curve is 1.0. For random guessing, the area under the curve is 0.5.Measures of Effect Measures of likelihood include probability and odds. Probability is a number, between 0 and 1, that indicates how likely an event is to occur on the basis of the number of events per the number of trials. The probability of heads on a coin toss is 0.5. Odds is the ratio of the probability of an event occurring to the probability of the event not occurring. The odds of heads coming up on a coin toss is 1 (0.5/0.5). Because probability and odds are related, they can be converted, where odds = probability/(1 – probability). Relative risk (RR) can be determined in a prospective cohort study, where relative risk equals the incidence of disease in the exposed cohort divided by the incidence of disease in the nonexposed cohort (Table III). For example, if a prospective cohort study of skiers with deficiency of the anterior cruciate ligament shows a significantly higher proportion of subsequent knee injuries in skiers who are not treated with a brace (12.7%) than in those who are treated with a brace (2.0%), the risk ratio is 6.4 (12.7%/2.0%)30. This can be interpreted as a 6.4 times higher risk of subsequent knee injury in a skier with anterior cruciate ligament deficiency who is not treated with a brace than in such a skier who is treated with a brace. A similar measurement in a retrospective case-control study (in which incidence cannot be determined) is the odds ratio (OR), which is the ratio of the odds of having the disease in the study group to the odds of having the disease in the control group (Table III). TABLE III - Treatment Effects* Adverse Events No Adverse Events Experimental group a b Control group c d *Control event rate (CER) = c/(c + d), experimental event rate (EER) = a/(a + b), control event odds (CEO) = c/d, experimental event odds (EEO) = a/b, relative risk (RR) = EER/CER, odds ratio (OR) = EEO/CEO, relative risk reduction (RRR) = (EER - CER)/CER, absolute risk reduction (ARR) = EER - CER, and number needed to treat (NNT) = 1/ARR. Factors that are likely to increase the incidence, prevalence, morbidity, or mortality of a disease are called risk factors. The effect of a factor that reduces the probability of an adverse outcome can be quantified by the relative risk reduction (RRR), the absolute risk reduction (ARR), and the number needed to treat (NNT) (Table III). The effect of a factor that increases the probability of an adverse outcome can be quantified by the relative risk increase (RRI), the absolute risk increase (ARI), and the number needed to harm (NNH) (Table III). Outcomes Assessment Process refers to the medical care that a patient receives, whereas outcome refers to the result of that medical care. The emphasis of the outcomes assessment movement has been patient-derived outcomes assessment. Outcome measures include generic measures, condition-specific measures, and measures of patient satisfaction31. Generic measures, such as the Short Form-36 (SF-36), are used to assess health status or health-related quality of life, as based on the World Heath Organization's multiple-domain definition of health32,33. Condition-specific measures, such as the International Knee Documentation Committee (IKDC) knee score or the Constant shoulder score, are used to assess aspects of a specific condition or body system. Measures of patient satisfaction are used to assess various components of care and have diverse applications, including the evaluation of quality of care, health-care delivery, patient-centered models of care, and continuous quality improvement34-37. The process of developing an outcomes instrument involves identifying the construct, devising items, scaling responses, selecting items, forming factors, and creating scales. A large number of outcomes instruments have been developed and used without formal psychometric assessment of their reliability, validity, and responsiveness to change. Reliability refers to the repeatability of an instrument. Interobserver reliability and intraobserver reliability refer to the repeatability of the instrument when used by different observers and by the same observer at different time-points, respectively. Test-retest reliability can be assessed by using the instrument to evaluate the same patient on two different occasions without an interval change in the patient's medical status. These results are usually reported

Referência(s)