Artigo Acesso aberto Revisado por pares

Risk of Bias Assessments and Evidence Syntheses for Observational Epidemiologic Studies of Environmental and Occupational Exposures: Strengths and Limitations

2020; National Institute of Environmental Health Sciences; Volume: 128; Issue: 9 Linguagem: Inglês

10.1289/ehp6980

ISSN

1552-9924

Autores

Kyle Steenland, Mary K. Schubauer‐Berigan, Roel Vermeulen, Ruth M. Lunn, Kurt Straif, Shelia Hoar Zahm, Patricia A. Stewart, Whitney D. Arroyave, Suril Mehta, Neil Pearce,

Tópico(s)

Chemical Safety and Risk Management

Resumo

Vol. 128, No. 9 CommentaryOpen AccessRisk of Bias Assessments and Evidence Syntheses for Observational Epidemiologic Studies of Environmental and Occupational Exposures: Strengths and Limitations Kyle Steenland, M.K. Schubauer-Berigan, R. Vermeulen, R.M. Lunn, K. Straif, S. Zahm, P. Stewart, W.D. Arroyave, S.S. Mehta, and N. Pearce Kyle Steenland Address correspondence to Kyle Steenland, Rollins School of Public Health, Emory University, 1518 Clifton Rd., Atlanta, GA 30322 USA. Telephone: (404) 727-0196. Email: E-mail Address: [email protected]. Rollins School of Public Health, Emory University, Atlanta, Georgia, USA , M.K. Schubauer-Berigan International Agency for Research on Cancer (IARC), Lyon, France , R. Vermeulen Institute for Risk Assessment Science, University of Utrecht, Utrecht, Netherlands , R.M. Lunn Division of the National Toxicology Program (NTP), NIEHS, Research Triangle Park, North Carolina, USA , K. Straif Global Observatory on Pollution and Health, Boston College, Boston, Massachusetts, USA ISGlobal, Barcelona, Spain , S. Zahm Shelia Zahm Consulting, Hermon, Maine, USA , P. Stewart Stewart Exposure Assessments, LLC, Arlington, Virginia, USA , W.D. Arroyave Integrated Laboratory Systems, Morrisville, North Carolina, USA , S.S. Mehta Division of the National Toxicology Program (NTP), NIEHS, Research Triangle Park, North Carolina, USA , and N. Pearce London School of Hygiene and Tropical Medicine, London, UK Published:14 September 2020CID: 095002https://doi.org/10.1289/EHP6980AboutSectionsPDF ToolsDownload CitationsTrack CitationsCopy LTI LinkHTMLAbstractPDF ShareShare onFacebookTwitterLinked InRedditEmail AbstractBackground:Increasingly, risk of bias tools are used to evaluate epidemiologic studies as part of evidence synthesis (evidence integration), often involving meta-analyses. Some of these tools consider hypothetical randomized controlled trials (RCTs) as gold standards.Methods:We review the strengths and limitations of risk of bias assessments, in particular, for reviews of observational studies of environmental exposures, and we also comment more generally on methods of evidence synthesis.Results:Although RCTs may provide a useful starting point to think about bias, they do not provide a gold standard for environmental studies. Observational studies should not be considered inherently biased vs. a hypothetical RCT. Rather than a checklist approach when evaluating individual studies using risk of bias tools, we call for identifying and quantifying possible biases, their direction, and their impacts on parameter estimates. As is recognized in many guidelines, evidence synthesis requires a broader approach than simply evaluating risk of bias in individual studies followed by synthesis of studies judged unbiased, or with studies given more weight if judged less biased. It should include the use of classical considerations for judging causality in human studies, as well as triangulation and integration of animal and mechanistic data.Conclusions:Bias assessments are important in evidence synthesis, but we argue they can and should be improved to address the concerns we raise here. Simplistic, mechanical approaches to risk of bias assessments, which may particularly occur when these tools are used by nonexperts, can result in erroneous conclusions and sometimes may be used to dismiss important evidence. Evidence synthesis requires a broad approach that goes beyond assessing bias in individual human studies and then including a narrow range of human studies judged to be unbiased in evidence synthesis. https://doi.org/10.1289/EHP6980IntroductionEvidence synthesis (or evidence integration) is widely used to summarize findings of epidemiologic studies of environmental and occupational exposures. Such syntheses are part of systematic reviews of observational epidemiologic study findings.Systematic reviews are defined by Cochrane guidelines as reviews that "identify, appraise and synthesize all the empirical evidence that meets pre-specified eligibility criteria to answer a specific research question. They use explicit, systematic methods that are selected with a view aimed at minimizing bias, to produce more reliable findings to inform decision making" ( https://www.cochranelibrary.com/about/about-cochrane-reviews). Systematic reviews ideally should include a statement of the goals of the review and a clear description for a) determining which studies are relevant to the goals; b) how individual studies are evaluated regarding potential biases; and c) a method to synthesize evidence across studies (which sometimes includes a meta-analysis). Assessments of biases and their impact play a useful role in both b) and c). Figure 1 shows a schematic of a systematic review. Boxes 4 and 5 of this figure (evaluate evidence, integrate evidence) depict where risk of bias assessments come into play via evaluations of individual studies and evidence synthesis across studies, and they are the subject of this paper.Figure 1. Schematic for systematic review. Adapted from National Research Council (2014).Systematic reviews play a similar role today as literature reviews in the past in that both attempt to provide an overview of the literature on a particular topic, either within a discipline (e.g., epidemiology) or across disciplines, and typically assess the evidence for causality for the association between exposure and disease. Systematic reviews are often done in conjunction with a meta-analysis. A meta-analysis yields a quantitative effect estimate, such as the strength of the association between an exposure and an outcome. It also provides an opportunity to explore heterogeneity across studies, e.g., by study design, type of population under study, or other characteristics. Subjectivity (value-based judgment) is inevitably present in the assessments of the quality of the individual studies (including whether they suffer from biases) and in the decisions to include or exclude studies in evidence syntheses and meta-analyses. It is present in the degree to which the authors interpret the reported association to be causal. It is also present in the degree to which the meta-analysis authors account for other evidence that is not considered in the meta-analysis itself, such as studies with exposure effect estimates not compatible with those in the meta-analysis (e.g., prevalence rather than incidence measures), ecological studies, animal data, and mechanistic data. The existence of such subjectivity is generally recognized as inherent to systematic reviews, and the goal is to make such judgments transparent (Whaley et al. 2016; Savitz et al. 2019). There is a tension, however, between the need for expert (necessarily subjective) judgment and consistency and replicability in such reviews.Risk of bias tools have been developed with the intention of increasing transparency and reducing subjectivity. They are now often used in systematic reviews to evaluate individual studies for bias and to determine which studies should be given more or less weight in evidence synthesis, based on ranking systems to evaluate bias in individual epidemiological studies. Risk of bias tools include ROBINS-I (Sterne et al. 2016), the Newcastle-Ottawa Scale ( http://www.ohri.ca/programs/clinical_epidemiology/oxford.asp), the Navigation Guide (Woodruff and Sutton 2014), Office of Health Assessment and Translation (OHAT) (NTP 2019), and a new tool to be used with Grading of Recommendations Assessment, Development and Evaluation (GRADE) (Morgan et al. 2019) (see below) Another tool, ROBINS-E, is under development and not yet available ( http://www.bristol.ac.uk/population-health-sciences/centres/cresyda/barr/riskofbias/robins-e/). The risk of bias tool used in the Navigation Guide comes from a combination of methods described by Viswanathan et al. (2008) and Higgins and Green (2011).GRADE ( https://www.gradeworkinggroup.org/) is a method to assess the overall certainty of evidence in a set of studies, developed in the context of making clinical decisions based on human studies. GRADE has advocated risk of bias as one part of this process without, until recently, proposing a specific tool. There has been some discussion about improving the certainty of evidence criteria in GRADE (Norris and Bero 2016).We recognize that not all risk of bias tools in current use are alike (Losilla et al. 2018; Rooney et al. 2016). Different tools include different bias domains and/or define the same domains differently. Typically, all include consideration of exposure or outcome misclassification/mismeasurement, confounding, and selection bias. Some are accompanied by guidelines for evidence synthesis, and others are not. Furthermore, some partly address the concerns we outline below (see Table 1) for differences between different risk of bias tools.Table 1 Comparing risk of bias tools.Table 1, in two columns, lists Risk of Bias within individual studies and tools. The tools column is sub divided into five columns, namely, Risk of Bias in Non-randomized Studies-of Interventions begin superscript lowercase italic a end superscript, Newcastle Ottawa Scale begin superscript lowercase italic b end superscript, Morgan (G R A D E) begin superscript lowercase italic c end superscript, Navigation Guide begin superscript lowercase italic d end superscript, and Oral Health Assessment Tool begin superscript lowercase italic e end superscript.RoB within individual studiesStudy nameROBINS-IaNewcastle Ottawa scalebMorgan (GRADE)cNavigation guidedOHATeRCT/target experiment as ideal study designYesNoYesNoNoConsider direction or magnitude of bias, and importance for effect estimateOptional, but not formally incorporated into toolNoOptionalfNogOptional, but not formally incorporated into toolAssign highest domain risk of bias to entire studyYesNo (but commonly done when used by summing stars/scores across domains)YesNo study-level bias summaryNo, but used to assign to tiers in study synthesisConsider statistical methodology as a separate domainNoNoNoNoOptionalEvidence synthesis Rank observational studies as inherently suffering from biasNot applicable (no formal presentation of evidence synthesis)Not applicable (no formal presentation of evidence synthesis)Yes, indirectly because of RCT comparison, but under developmentYes, start at moderate certaintyYes, start at low to moderate certainty Possibly reject some studies based on biasNot applicable (no formal presentation of evidence of synthesis)Not applicable (no formal presentation of evidence of synthesis)Yes, although may be allowed in sensitivity analysisYes, although may be included in sensitivity analysisYes, although may be included in sensitivity analysesNote: Tools included in this table are risk of bias tools for individual studies with an algorithm-based component. GRADE, Grading of Recommendations Assessment, Development and Evaluation; OHAT, Office of Health Assessment and Translation; RCT, randomized controlled trial; RoB, risk of bias.aSterne et al. 2016.b http://www.ohri.ca/programs/clinical_epidemiology/oxford.asp.cMorgan et al. 2019.dWoodruff and Sutton 2014. The risk of bias tool used in Navigation Guide comes from a combination of methods described by Viswanathan et al. (2008) and Higgins and Green (2011).eNTP 2019.fDirection of bias considered, but not magnitude or eventual impact on effect estimate.gNot mentioned in five published case studies ( https://prhe.ucsf.edu/navigation-guide), nor in original paper by Woodruff and Sutton 2014.Assigning actual scores to individual studies based on risk of bias is not done in most of these tools, has been shown to not be effective, and is discouraged in reviews by Jüni et al. (1999) and Stang (2010) and on the Cochrane website [although scoring was recently resurrected in a new systematic review method being implemented for the U.S. Environmental Protection Agency (EPA)'s Toxic Substances Control Act (TSCA) program; see Singla et al. (2019)]. All of the above-cited risk of bias tools evaluate individual studies by level of bias (e.g., low, moderate, serious, and critical) in different domains (e.g., confounding, selection bias, and information bias), and the evaluations may potentially result in exclusion of studies deemed too biased across one or more domains from evidence synthesis. However, they do not consistently assess the direction, magnitude, or overall importance (on the effect estimate) of the various types of bias, and they bring these considerations directly into risk of bias tools. Note also that risk of bias does not mean the actual study is biased. The Navigation Guide (Woodruff and Sutton 2014) and OHAT (NTP 2019) both suggest using the direction of confounding and result of control of confounding to upgrade or downgrade estimates of confounding bias but do not formally build it into a tool. The Report on Carcinogens Handbook incorporated direction and magnitude of bias in their guidelines for study quality assessment guidance and evidence integration steps (NTP 2015). Other risk of bias tools also mention this issue but do not tackle it directly. For example, in ROBINS-I, the authors note, "It would be highly desirable to know the magnitude and direction of any potential biases identified, but this is considerably more challenging than judging the risk of bias" (Sterne et al. 2016).Assessing individual study quality is an essential part of systematic review, and risk of bias tools are one way to do this that may increase transparency and replicability in reviews. These tools differ between one another, and we do not here discuss in detail each tool individually but, rather, comment more generally on the limitations of their current use both in the evaluation of individual studies and in evidence synthesis. Although we agree that if risk of bias tools are to be used, they must have a list of domains and some overall evaluation system regarding potential bias, we note that there is no consensus on which domains are to be analyzed and how risk of bias is to be ranked.In this paper, we first critically review the benefits and pitfalls when using risk of bias assessments for individual studies. We argue, along with other authors (Savitz et al. 2019; Stang 2010; Arroyave et al. 2020), that while the use of risk of bias assessments in evidence synthesis can be a useful tool to improve transparency and limit a priori value judgments, they can also potentially be used as a mechanical exercise that leads to erroneous conclusions because the assessments may consider individual studies out of context, may poorly discriminate between studies with minimal and substantial potential bias (i.e., may not evaluate the magnitude and direction of bias and its eventual possible impact on a study's effects estimates), and may have other potential shortcomings as detailed below. Second, we consider broad types of evidence synthesis, such as those proposed by Bradford Hill (Hill 1965) and programs such as the International Agency for Research on Cancer (IARC) Monographs, and then discuss the use of triangulation (Lawlor et al. 2016). Finally, we reflect on some recent evidence syntheses and their risk of bias assessments.Risk of Bias Assessments for Individual StudiesAs previously noted, a risk of bias assessment provides a formal mechanism to systematically evaluate study quality regarding potential biases using the same approach across all studies and hence can add to transparency in systematic reviews. Here, we discuss risk of bias assessments in more detail and also make some recommendations to improve them (Table 2).Table 2 Some common practices and suggested improvements to risk of bias assessments for individual environmental epidemiologic studies and evidence synthesis.Table 2 has two columns, namely, current practice and suggested improvement.Current practiceSuggested improvementIndividual studies Compare to RCTs as ideal studyDo not consider RCTs as ideal study Evaluate bias in different domains (e.g., confounding, selection bias, measurement error)Consider the magnitude and direction of different biases and evaluate the net likely effect Rank potential biases (e.g., low, moderate, high)Rank biases considering the suggestions in rows above No evaluation of statistical methodsAdd a domain for statistical methodology similar to IARC's, i.e., assess the ability to obtain unbiased estimates of exposure–outcome associations, confidence intervals, and test statistics. Appropriateness of methods used to investigate and control confoundingEvidence synthesis In some instances, downgrade all observational studies as weak or moderate qualityAssume observational studies are high quality unless important biases are likely Reject some studies from evidence synthesis based on ranking of bias across their domains. Often make overall judgment based on meta-analyses after rejection of those studiesRetain most studies in evidence synthesis. Use methods such as sensitivity analyses and triangulation to consider net effect of possible biases. Consider evidence from other studies that were not included in meta-analysis because of different designs or parametersNote: IARC, International Agency for Research on Cancer; RCT, randomized controlled trial.Randomized controlled trials as the ideal when assessing bias vs. observational studies.Some currently available risk of bias tools propose using a hypothetical randomized controlled trial (RCT) as a thought experiment to help judge potential biases in individual observational studies (e.g., Sterne et al. 2016; Morgan et al. 2019). We recognize that the RCT model, coupled with thinking about confounding based on counterfactuals and the use of directed acyclic graphs (DAGs) to depict causal relations, has helped advance causal inference in many instances in observational epidemiology. The relative strengths and weaknesses of RCTs vs. observational studies have been an ongoing discussion in the literature (e.g., Eden et al. 2008; Sørensen et al. 2006) but is worth reemphasizing here with respect to environmental epidemiologic studies. RCTs, if properly conducted, can, in theory, avoid or minimize some of the main potential limitations of observational studies (e.g., selection bias, confounding, and differential information bias). However, comparing studies to an RCT gold standard inevitably begins by classifying observational studies as of lower quality, as these studies have the potential to suffer from biases theoretically avoided by RCTs. GRADE, for example, states that "Evidence from randomized controlled trials starts at high quality and, because of residual confounding, evidence that includes observational data starts at low quality" ( https://bestpractice.bmj.com/info/us/toolkit/learn-ebm/what-is-grade/). The Navigation Guide and OHAT consider observational studies to provide evidence of moderate quality (Woodruff and Sutton 2014; NTP 2019).The RCT gold-standard assumption can lead to extremes in which observational studies are dismissed in their entirety. For example, the current chair of the EPA Clean Air Scientific External Advisory Committee had argued that a) all observational studies quantifying an exposure–response relationship are subject to a critical level of bias (Cox 2017, 2018); and b) all air pollution epidemiology studies lack adequate control for confounding and are, therefore, subject to high risk for potential bias [see review by Goldman and Dominici (2019) and commentary by Balmes (2019)]. However, the EPA has said they will maintain their traditional approach of considering all observational studies without prejudice when evaluating scientific evidence for hazard identification of criteria air pollutants to meet their mandate for clean air (Parker 2019).We argue, in contrast, that RCTs are not the gold standard for judging observational studies, particularly occupational and environmental studies. RCTs of most environmental and occupational exposures are, by definition, not possible, as one cannot ethically randomize people to potentially harmful exposures with no perceived benefit. Beyond that, RCTs typically involve limited sample sizes and short follow-up times, which are often inadequate for observing chronic disease or rare outcomes. RCTs deliver the exposure (e.g., medication) at the beginning of follow-up, typically in a limited number of dose levels, which does not mimic the real-life circumstances of environmental observational studies. The RCT may involve highly selective study groups meeting particular criteria, which may have little generalizability to other populations.In contrast, in real life, and thus in observational studies, uncontrolled exposures are often present before follow-up begins, occur at many different exposure levels, and may vary by intensity, time of first exposure, and duration of exposure. Observational studies often involve outcomes (e.g., cancer and neurodegeneration) with long latencies following exposure, necessitating long follow-up periods with evaluation of lagged exposures and latency periods. They often also include long exposure histories (which are important for assessing cumulative exposure), require retrospective exposure assessment, and include people who change exposure categories over time. A proportion of the population is likely to have other concomitant exposures, some of which may have similar effects. Observational studies often focus on exposure–response relationships, rather than simple comparisons of an outcome among the exposed and nonexposed population. As a result, exposure–response models have been developed that address complex issues such as control for confounders, consideration of the importance of measurement error in parameter estimation, model misspecification, and the possible use of Bayesian methods to incorporate prior beliefs.We believe there should be no a priori assumption that observational studies are weaker than RCTs for studying occupational and environmental exposures, and it should be acknowledged that they generally represent the best available evidence to assess causality. Others have concluded the same. For example, the Institute of Medicine (Eden et al. 2008) report concluded, "Randomized controlled trials can best answer questions about the efficacy of screening, preventive, and therapeutic interventions while observational studies are generally the most appropriate for answering questions related to prognosis, diagnostic accuracy, incidence, prevalence, and etiology." Thus, in our view, observational studies should be considered as the norm and then assigned a lower quality if significant substantial biases are likely that would affect the parameter estimates.Identifying, describing, and ranking biases.Absent an RCT, reviewers need to ask, "Among observational studies, what are the possible biases, how likely are they, what direction are they, and how much are they likely to impact the parameter estimate?" Well-conducted assessments of biases and their potential impact are essential in evaluating the contribution of individual studies to evidence on causality, but their implementation is sometimes problematic. In some cases, it may be difficult to estimate the magnitude of the bias or to appropriately assign direction or weights for the various biases or the impact of the biases on the outcome estimates. Often, the original study authors do not provide methods complete enough to evaluate bias issues. Thus, this process necessarily involves subjectivity, which should be informed by expert judgment.The quantification and relative importance of possible biases is a critical emerging field of epidemiology (Lash et al. 2014), but detailed quantification may be beyond the purview of most current systematic reviews. However, we believe it is important to incorporate these methods to the extent possible in risk of bias tools and specify them (or the lack thereof) in the methods section of systematic reviews.For example, we know historically that many cohort studies of occupational exposures were criticized for not having smoking data when considering smoking-related diseases like lung cancer. However, relatively early in modern epidemiology, Cornfield et al. (1959) explained how to quantitatively assess the likely importance of confounding factors that differ between exposed and nonexposed populations. Later, it was shown both theoretically (Axelson 1980) and empirically (Siemiatycki et al. 1988) that confounding by smoking is unlikely to explain relative risks (RRs) for lung cancer that exceed 1.2 to 1.4 in the occupational setting. This estimation was based on analyses comparing differences in smoking status between workers and general population referents (with workers often smoking more) and accounting for the large RR for smoking and lung cancer. Hence, although smoking is a very strong lung cancer risk factor and a strong potential confounder, it is not likely to account for the large RRs found for classic occupational lung carcinogens. The impact of lack of control for smoking is expected to be even smaller for outcomes to which it is more weakly related or, in general, for any confounder only weakly related to disease.The work by Cornfield (1959) and others has been followed by methods to estimate the strength of association that a hypothetical unmeasured confounder, about which the investigator has no prior knowledge, must have with both the outcome and the exposure to have an impact equal to the observed effect (VanderWeele and Ding 2017; Ding and VanderWeele 2016; Lubin et al. 2018). This estimation enables us to say, for example, that for an RR of about 2.0 to be explained by an unmeasured confounder, the unmeasured confounder would have to have a minimal RR of 3.5 with both the exposure and outcome to reduce the observed association to the null value of 1.0. For a factor to have such a strong effect on the outcome and a strong association with the exposure, but be unknown, is generally unlikely in most settings. Yet, even today, smoking, other identified confounders, and other unknown confounders continue to be raised as possible sources of bias to explain positive findings (VanderWeele and Ding 2017). Thus, as a minimum, estimating the likely maximum extent of (unmeasured) confounding and discussing its likely impact on the observed effect estimate can and should be done.These same considerations of magnitude and direction of bias apply to other potential biases beyond confounding, e.g., selection bias and measurement error. For example, classical nondifferential measurement error will generally bias effect measures to the null so that if an elevated risk is found, it is not likely because of this source of error (although other biases might bias against the null). In contrast, Berkson measurement error generally affects the precision of the findings but not the actual point estimates (Armstrong 1998). Moreover, mismeasurement may have little consequence on exposure–response parameters when there are large exposure contrasts in the population (Avanasi et al. 2016). Sensitivity and specificity of classification of the outcome may also play a role. Selection bias, occurring during recruitment, can only affect estimates to the degree that the estimated parameter of interest (i.e., the association between exposure and outcome) among those people not included in a study differs substantially from that parameter in the population studied. Furthermore, in general, if the direction of two sources of bias is different, they may approximately cancel each other out. A reviewer needs to consider multiple sources of possible bias and their relative importance and whether they are likely to have a net effect of bias away or toward the null in the effect estimate. Thus, taking such considerations into account requires that we go beyond some current ranking schemes (Table 2). We note that these same issues are at the heart of triangulation, which is discussed in more detail below. We believe it is possible to improve risk of bias tools to formally incorporate magnitude and direction of bias, but it will take considerable work.Other possible domains in risk of bias tools.Another possible bias domain is conflict of interest, which can create a potential bias and is not always assessed in risk of bias tools. There is strong evidence that studies authored by those with vested interest are generally favorable to those interests, hence the need to disclose potential conflict of interests. The effects of conflicts of interest are well documented in clinical medicine (Angell 2008; Krauth et al. 2014; Lundh et al. 2012), and biased results from similar conflicts of interest have been documented in occupational and environmental epidemiology (Michaels 2009). Evaluation of risk of bias regarding conflict of interest may also assess issues of selective reporting of study results (e.g., only reported results that are significant). In our view, however, a potential conflict of interest does not define a specific bias in and of itself, and if specific biases are present, reviewers should be able to detect them in evaluating studies. Hence, we do not argue to include conflict of interest as a separate domain for risk of bias tools, although such potential conflicts must be clearly acknowledged by authors.Another domain, generally not included in current risk of bias tools, is potential bias because of problems in statistical methodology. Concerns include choice of an inappropriate and badly fitting model, failure to model exposure–response or to evaluate different exposure–response models, incorrect use of mixed models, incorrect use of Bayesian techniques, violation of statistical assumptions (e.g., normal residuals in linear regression), overadjustment for covariates related to exposure but not to outcome, adj

Referência(s)