The Changing Face of Epidemiology in the Genomics Era
2002; Lippincott Williams & Wilkins; Volume: 13; Issue: 4 Linguagem: Inglês
10.1097/00001648-200207000-00017
ISSN1531-5487
Autores Tópico(s)BRCA gene mutations in cancer
ResumoThe year 2001 marked the first publication of draft sequences of the human genome. 1–3 When finalized, the DNA sequence will provide a blueprint for the location and structure of all human genes. Epidemiologists and geneticists plan to use the human genome as a reference sequence to catalogue the extent of variation in chromosomal DNA. Knowledge of gene function and the effects of genetic variation will provide greater understanding of the pathophysiology of disease, and could yield new treatments and other clinical interventions for persons with specific genetic characteristics. 4,5 The long-term prospects for the genomics enterprise have been much discussed. Epidemiologists are among those taking a generally positive view of genomics, 6–8 while biologists and philosophers are more skeptical. 9–12 Population geneticists point to the difficulty of integrating genetics and environmental risk factors. 13 Some ask whether the massive scaling-up of genetic epidemiology studies is going to provide much new information on the genetic components of health, ". . .when a major justification is that nothing else has worked so far."14,p. 155 As genomics and epidemiology begin to intersect, there is the potential for both fields to be altered in ways that are mutually beneficial. The first section of this commentary discusses ways in which the study of genomics is affecting epidemiology. By the very nature of the questions it asks, genomics brings intense scrutiny to certain well-known limitations of epidemiology, and challenges epidemiologists to seek new solutions to these old problems. The second section identifies current issues in genomics research that epidemiology is particularly well suited to address. If epidemiology stands to benefit from its engagement with genomics, genomics is likely to benefit as well. Unfortunately, there is also the possibility of harm. Incorporation of genomics into epidemiology could undermine the work of public health; improvement in the health of whole populations will not come about unless epidemiologists engage with the greater scientific community in discussions about the most effective ways to use genetic information. Genomics and Its Implications for Epidemiology The term genomics is used in at least two ways. In its more limited sense, genomics refers to the laboratory techniques and information processing used to assemble the chromosomal DNA sequence of an organism, the genome.3 More broadly, the term is used to refer to what we might call the "-omics" approach. Genomics in this sense provides the centerpiece for "discovery science" (or more aptly, exploratory science), which is "research in which one generates large resources of information on biologic molecules in aggregate without necessarily knowing which pieces of information and which correlations will prove most important" (emphasis added). 15,p. 343 What some of us once scorned as "fishing expeditions"16 has acquired legitimacy with the arrival of genomics. 17 The "-omics" approach (in the sense of brute-force generation of vast amounts of data) extends beyond the domain of chromosomal DNA. It is possible to investigate gene expression (messenger RNA or "transcriptomics"), protein production ("proteomics"), levels of metabolites ("metabonomics"), and a host of other biologic molecules in the aggregate. Subdisciplines are being founded on the principles of "-omics" research. One prominent example is "pharmacogenomics," a burgeoning field based upon genome-wide evaluation of genetic factors involved in metabolism of prescription medications, as explanations for variations in drug efficacy and side-effects. 4 The marriage of broad-sweep genomics with population-based epidemiology is both inevitable and daunting. In the process, questions that lie at the very foundation of epidemiology come under fresh scrutiny. The intellectual ferment that currently surrounds genetic epidemiology owes as much to these fundamental challenges as to the more distant promise of biological insights. 13,18–20 Misclassification: The Problem of Mistaken Genes One of the great myths of laboratory science is that laboratory assays provide exact results. This is especially true for genes, which by definition appear stable and definitive. However, like all methods of data collection, laboratory assays for detecting genetic variation are subject to error. Errors can arise in many ways. The nucleotide misincorporation rate using polymerase chain reaction (PCR)-based methods has been as high as 10−4 bases per cycle. 21 Newer reagents lowered the error rate by several orders of magnitude, but have not eliminated it. Foreign DNA can be picked up and replicated in the process of PCR. These PCR artifacts can appear as spurious "rare alleles,"22 particularly when the amount of starting template is low and the number of cycles of amplification is high. Genotyping assays may detect the different alleles within a locus with varying efficiency, and differential PCR amplification can occur even when primers do not overlap the site of DNA variation. 23 Epidemiologists are turning to buccal swabs, mouth washes, and other non-blood sources of DNA as a way to increase response rates. However, we need to be aware that these methods yield lower amounts of DNA and may be more subject to contamination and PCR artifacts than blood samples. Now, more than ever, proper sampling handling is a critical issue. 24 As epidemiologists, we do not often run our own laboratories, but we should closely evaluate the quality of genotyping data. Some useful rules of thumb include the following: To minimize laboratory error, genotyping reactions should be conducted with known positive and negative controls. For example, synthetic oligonucleotides representing each of the alleles to be tested can be used as positive controls in most PCR-based genotyping assays. At least 10% of genotyping reactions should be repeated at random, in addition to repeating any samples with ambiguous results. As part of quality control, allele and genotype frequencies should be compared with previously published values, and genotype frequencies should be examined for departures from Hardy-Weinberg equilibrium. 25 (The equilibrium assumption is required for the Pearson chi-square statistic comparing allele frequencies in cases and controls, 26 but it is often overlooked.) Departures from Hardy-Weinberg equilibrium may occur due to population dynamics or laboratory error. To address the latter, a sample of genotypes should be repeated using a gold standard laboratory method (such as restriction fragment length polymorphism analysis or direct DNA sequencing). As more diverse populations are studied, the range of allelic variation will increase. A number of on-line catalogs provide repositories for variations in human genes (eg, dbSNP: http://www.ncbi.nlm.nih.gov); NCI Genome Anatomy Initiative: http://cgap.nci.nih.gov/GAI; Human Genome Project Database: http://gdbwww.gdb.org; SNP Consortium website: http://snp.schl.org; Environmental Genome Project website: http://www.genome.utah.edu/genesnps). But epidemiologists should be aware that errors are found in many databases, that not all DNA sequence information has been confirmed, and that the order and location of all human genes is still unknown. 27–29 Even if the laboratory results are correct, errors can be introduced by the complexities of information processing. Computer programs used to predict the location of coding sequences within the genome can mis-specify the beginning or end of coding regions, or can miss genes entirely. 30 With genetic data as with any other, errors of measurement produce misclassification and its consequent distortions. Random (stochastic) as well as directed-error models have been developed to account for the mistakes that occur using automated laboratory methods for single nucleotide polymorphism (SNP) detection. 31,32 Garcia-Closas et al. 33 showed that even small errors in genotyping may bias estimates of interaction and may reduce power. Errors of genetic information, like other sources of error, should be addressed in epidemiologic studies as part of a formal sensitivity analysis. 34 Confounding by Unmeasured Genes: The Issue of Linkage Disequilibrium An epidemiologic association between an allele and a health endpoint can arise when the true allele of interest lies elsewhere in the genome, in linkage disequilibrium with the measured allele. 35–37 Linkage disequilibrium occurs either because the relevant predisposing allele emerged relatively recently in the evolutionary history of a population, or because particular combinations of alleles are selected for (or against) in persons with disease. Linkage disequilibrium occurs along wide regions of human chromosomes, and may involve multiple genes. 38 Thus, epidemiologists will need to conduct laboratory assays to detect multiple alleles within and across genes, not just single genetic variants. Combinations of alleles can be difficult to measure. One method for identifying specific combinations of alleles that are co-inherited within a locus (haplotypes) is to obtain DNA from family members, but this is usually impractical in population-based studies. Another solution is to conduct allele-specific PCR amplification followed by direct DNA sequencing for all subjects in a study, which is expensive and time-consuming. Haplotypes may prove useful to epidemiologists who conduct genome-wide association studies, because it may reduce the number of markers required for genotyping. Maximum likelihood 39,40 and Bayesian methods 41 for estimating haplotypes are available, but these methods have not been tested on the large scale necessary for epidemiologic studies. Once the principal haplotypes have been identified in human populations, the next step will be to determine the effects on biologic function of specific haplotypes. Bridging the immense gap between genotype and phenotype is one of the "fundamental issues of the genomics era."42 Genotype-phenotype relationships are further complicated by the fact that unmeasured genetic and environmental factors can influence gene expression. 43–49 Thus, the effects of a given variant or haplotype may differ across populations with different genetic backgrounds and among individuals with different environmental exposures. 14 Expanding the Epidemiologic Repertoire: Emergence of Novel Study Designs Genomics has given new life to obscure epidemiologic designs and stimulated the development of new approaches. An example of neglected design is the case-only study. Prentice et al. 50 proposed case-only studies as a way to identify risk factors for chronic diseases by comparing exposures among several categories of the chronic disease. While there are limitations to the interpretation of case-only data, the fact that it functions without controls permits analysis in settings where more conventional studies might be difficult or impossible. Piegorsch et al.51 showed that case-only studies could similarly be used to investigate gene-environment interaction. In this setting, exposures are compared among categories of cases distinguished by genotype, rather than by clinical subtype. The main assumption of such analyses is that the environmental exposure and genotype are not correlated in the underlying population. When the disease outcome of interest is rare, case-only studies offer greater precision for estimating gene-environment than case-control studies of comparable size – by one estimate, roughly twice the power. 52 Case-only studies offer similar advantages for investigating gene-gene interactions. 53 There are obvious practical advantages of not having to collect controls, 54–56 but the limitations of case-only studies should also be kept in mind. 57 This design does not permit estimation of genetic or environmental effects directly, but only gene-environment or gene-gene interaction. Also, case-only studies are only capable of detecting departures from a model of multiplicative risks (ie, from a model of homogeneous or constant relative risks). 54 We have not accumulated enough information on gene-environment (or gene-gene) interaction to know how often such interactions are greater than additive but less than multiplicative. Sub-multiplicative effects are likely to be important for public health, 58 but would be missed with the case-only approach. Genomics provides further justification for epidemiologists to consider additive models for interaction, 59 because the use of additive models for gene-gene interaction (or epistasis) is already well-established among geneticists. 19 A final nagging question has to do with the general validity of the assumption of independence of genotype and environmental exposure (or, for gene-gene interaction, the independence of genotypes at two loci). Empirical data from existing case-control studies is needed to evaluate whether environmental exposures and genotypes are generally as independent as intuition would suggest. A New Type of Confounding: Population Admixture Case-only studies are one of several novel study designs employed to avoid population admixture. Admixture (or population substructure) is a particular type of confounding that occurs when subgroups within a population have both a higher risk of disease and a different frequency for a genetic marker but the marker plays no role in disease susceptibility. The higher disease risk may be due to environmental or nutritional factors unrelated to genetics. Like other sources of confounding, admixture can produce a spurious association between the genetic marker and disease. 6 Case-only studies attempt to avoid admixture by avoiding controls. Other study designs have been proposed to minimize the potential for population admixture, including the use of sibling controls and case-parent studies. 60–62 Case-parent studies comprise "triads" of affected children and their parents; the parental alleles not transmitted to the child form the "control" genotype. Case-parent studies are not vulnerable to the problems of confounding by admixture, and they provide certain other advantages of power and flexibility. 63 However, even as the unexpected advantages of case-parent triad analyses are being realized, the original concerns about confounding by admixture are being questioned. There are few real examples of distortion by population admixture, 63 and data simulations and exploration of existing data have provided little evidence that admixture is a major threat to epidemiologic inference. 64–66 Race and Ethnicity: Pulling Epidemiology into the Modern Age There is no single gene or set of genes that defines race or ethnicity. 67 The lingering misunderstanding that race is simply a surrogate for "genetics" can finally be put to rest with the opportunity to study genes directly. 68,69 Recent data have shown that when allele frequencies differ significantly across racial or ethnic groups, most of the variability occurs at anonymous loci that do not encode alleles relevant to disease. 70,71 In situations where different alleles or allele frequencies are observed in coding regions that may affect disease susceptibility, excluding racial or ethnic subgroups will be counter-productive. This is because the more genetic diversity that exists within a study population, the greater the potential for uncovering new clues to disease etiology. 67 Statistical methods for detecting and controlling for admixture have been developed, 72,73 and in some cases, population mixing actually increases, rather than decreases, the probability of finding disease susceptibility alleles. 70,74,75 For epidemiologists, the advent of genomics forces us away from nineteenth century reductionism, 9 and brings new perspective on race as a social construct rather than a biologic reality. 76 Getting What We Ask For: The Problem of Too Much Information One of the hallmarks of the "-omics" approach is the vast amount of information it generates. The complete human genome DNA sequence, when combined with high-throughput laboratory methods, will enable us to identify variation in any gene deemed relevant to disease. Epidemiologists will soon have the capacity to determine a "genome-wide" genotype for every participant in a study. Having the entire human genome available permits a comprehensive, global view of human genetics. But how can epidemiology meaningfully assess the health effects of 30,000 or more genes, plus their interactions with each other, plus their interaction with environmental factors? Furthermore, within any given gene, numerous alleles or allelic combinations are likely to contribute to susceptibility for common human diseases. 77 Even if an epidemiologic study is limited to the most plausible 0.1% of genes, this still leaves hundreds of genes, thousands of alleles, and untold combinations among them for statistical analysis within a single epidemiologic study. The problem of massive amounts of data has been brewing on the epidemiologic horizon for decades without being urgent enough to attract intensive scrutiny. Genomics thrusts these issues to the forefront. Epidemiologists must now choose from a massive number of genetic markers that are available. One option is to limit epidemiologic investigations to genetic variants that plausibly affect phenotype. 20,78 Cargill et al.79 recommend that priority goes to variants that result in non-conservative changes in amino acids within evolutionarily conserved domains. Alterations in 5′ promoter and 3′ untranslated regions are also likely to be functionally relevant. 14 As we gain a better grasp of gene function and of biology as a whole (including protein-protein interactions and the active sites of enzymes), we will be able to limit our gene search more rationally. However, non-functional variants may still be useful in chromosomal regions where functional alleles have not been identified, because the marker alleles may in be linkage disequilibrium with a true functional allele. 80 Another way to deal with increasing complexity is through improved statistical methods. 18 Hierarchical regression or Bayes estimation may help to deal with large samples and the issue of "multiple comparisons," a particularly important aspect of genomics. 81 Bayes estimation can be used to incorporate prior expectations, especially for gene-environment and gene-gene interaction. 82 Statistical methods are also needed to help us deal more explicitly with the role of chance in disease causation. From an evolutionary perspective, chance events play a fundamental role in generating genetic diversity, because "without random mistakes in the copying of genetic information, new life forms would not evolve."83,p. 4 As Finch and Kirkwood point out, 83,p. 206 "the ubiquity of noise merits attention in its own regard as part of basic biological mechanisms." The authors propose "intrinsic chance as a third factor to the conventional two factor model, which attributes genes and the environment as the main determinants of life history."83,p. 3 Bayesian analysis and hierarchical regression can be used to address the role of random error in disease etiology. While these methods are becoming more accessible, 84,85 they have yet to gain wide acceptance. Additional methods may be needed to model stochastic processes. As we comprehensively uncover the genetic and environmental underpinnings of disease, we will be forced to ask, what other explanations are there? Scaling Up: The Movement Toward Mega-Studies and Meta-Analysis One limitation of many genetic epidemiology studies has been their relatively small sample size. In studies that lack power, a false positive association is more likely to be large (and therefore impressive). This problem is exacerbated when estimating interactions, because analyses must be conducted in subsets of data. 86 Naïve authors might conclude that the observed interactions must be very important to have been detected in such a small study. Already there has been a spate of published association studies with non-reproducible results. 19,87,88 One proposed solution is larger studies. Thousands of participants are needed to pursue gene-environment interaction in case-control studies. 89,90 However, studies with more than 2,000 cases and 2,000 controls are seldom feasible (or fundable). More complex stratified sampling schemes may be needed to increase power. 91 Another solution is to pool data across studies. Population geneticists warn that "the chromosomal position and the genotype-phenotype relationship of a locus cannot be estimated reliably by use of a single data set of currently realistic size, at least for loci of small effect size. . ."81,p. 1357 Thus, it will become increasingly important for epidemiologists to share data. The Human Genome Epidemiology Network is one systematic attempt to consolidate epidemiologic genotype data (http://www.cdc.gov/genetics/hugenet). 92 This database includes estimates of allele frequencies, results of association studies, and systematic reviews. Epidemiology and Its Implications for Genomics To this point, discussion has focused on the ways genomics is pushing epidemiologists to develop new methods and refine old ones. But influence is not a one-way street. Epidemiology will contribute to, and perhaps help to define, genomics research in the coming era. Too Many Options: The Problem of Complexity There is something thrilling about the sheer volume of data that can be generated involving human genes. But the practical matter of setting priorities must eventually be addressed. Epidemiology's public health perspective may help. Rather than tackle every possible research question, a concern for public health would suggest that we investigate connections between genes and disease that have a major impact on society. Certainly identifying major loci of susceptibility for high-risk families is an important area of investigation, but we also need to explore the connections between more common genetic variants and more common diseases. 14 Genetic variants that are rare and observed in only select populations may help locate certain disease genes, but the attributable fraction for such variants is likely to be negligible. One way to prioritize the hunt for alleles relevant to health is to give special attention to genes that interact with modifiable environmental exposures. It may not be obvious from a genetic perspective, but one of the most promising benefits of genetic research is the identification of genetic subgroups that are susceptible to environmental causes of disease. 93,94 The damaging effects of preventable environmental exposures may be more apparent among those who are genetically susceptible. 95 Incorporating genetic markers into epidemiologic studies will help to identify environmental risk factors that are too weak to be discovered on their own. Genetically susceptible subgroups may also clarify dose-response relationships that are otherwise inconclusive or controversial. 93 Keeping Things in Perspective: Maintaining a Focus on Populations Epidemiology has a defined research agenda. Epidemiologists seek to identify the determinants of health outcomes in human populations. Research in human genomics is less well-defined at present, and could benefit from implementation of some principles of population-based epidemiology: for example, how to sample human populations to obtain valid estimates of allele frequencies, how to identify cases and controls, and how to integrate genetic data with environmental exposure information. 96 Indeed, the chief contribution of epidemiology to genomics research may be to emphasize the importance of populations. Other disciplines can help to reinforce the population perspective, including population genetics, social history, and anthropology. 9–11,14,97 When interpreting genetic data, we need to take into account cultural and environmental factors that influence human health. Genetic markers reflect human origins and population dynamics, and represent a tool for uncovering the complicated interrelationships between environment, culture, and genetics in human history. 67,71 Indeed, each human disease has a unique genetic architecture that depends on human evolutionary history.97 Genetics has been used to locate the source of disease outbreaks among human populations, 97 including plague, parasitic infections and sexually transmitted diseases. 98,99 Polymorphisms in drug and carcinogen metabolism genes originally evolved in humans and animals as an adaptation to plant toxins consumed in different parts of the world. 100 Thus, as we explore the role of genetics in human disease, we need to take into account not just who we are as individuals, but our origins and future as dynamic human populations. 71,97 Getting the Full Picture: Interpreting Results in the Context of Public Health Epidemiology can help bring the public perspective to future research in human genetics. For public health professionals, "health" is not defined as the absence of a particular genetic or biochemical defect, but rather as complete physical, mental, and social well-being. 101 This emphasis on broadly defined health can add an important dimension to genomics research by widening the scope of investigation. It is possible for a genetic disadvantage in one setting to be a genetic advantage when confronted with a different set of environmental exposures. For example, some alleles may increase risk of a given disease but positively affect other aspects of health, such as fertility. 102 In the initial stages of research, it may be necessary to focus on how an allele affects a single disease in a single group, but this is not adequate for final conclusions. This is a particularly important point when interventions are being considered at the level of an individual's genotype. 53,103,104 We need to consider the full spectrum of health outcomes, and how they are affected positively or negatively by our genetic makeup. The End of Determinism: Dealing with Uncertainty Epidemiology can also offer insights into causal inference. Basic scientists currently express frustration because "genetic factors are not likely to explain diseases in the usual causal sense."14,p. 153 Multi-generation pedigrees segregating highly penetrant alleles can be difficult enough to understand. Causal pathways to human disease become increasingly complex when genetic factors are neither necessary nor sufficient for disease to occur. Epidemiologists are accustomed to understanding events in a probabilistic fashion, and have developed models for disease causation that take into account multiple contributory causes as well as unmeasured risk factors. 105,106 These tools are directly relevant to the types of genes currently being investigated for complex diseases such as type 2 diabetes, hypertension and asthma, where no single locus or allele is likely explain a majority of cases of illness in any given population. The Challenges Facing Genomics and Epidemiology Genomics and epidemiology represent two powerful approaches for understanding the causes of human disease. In one sense, they stand as two distinct enterprises, each with its own culture. But as genomics and epidemiology intersect, the two fields will depend increasingly on the methods of the other. Epidemiologists rely on genomic scientists to obtain DNA sequence data and suggest interesting new genes. The field of genomics needs epidemiologists to unravel the complex role of genetic variation in human disease. Each field will inevitably change the other. Much of this change is likely to be for the better. There are pitfalls. Not everyone is convinced that "fishing editions" are good science 107 or a valid way to improve public health. 14 On the one hand, clinical interventions for carriers of rare, highly penetrant mutations represent an important application of genomics to clinical medicine. 108 Epidemiologists play an important role by collaborating with clinicians to develop tests for such mutations that have maximum sensitivity, specificity, and predictive value. 109–111 However, highly penetrant mutations are rare, and most strong genetic effects are rare. 19,97 The majority of cases of chronic disease will be carriers of common, low penetrance alleles or polymorphisms. 14 Like all weak risk factors, the predictive ability of such polymorphisms will be low, even in combination. 112,113 Thus, in any given population, relatively few persons will be candidates for genetic testing and clinical intervention. 114 Interventions to reduce environmental exposures could be targeted to those with particular polymorphisms, 115 but the utility of such an approach has yet to be demonstrated. For many complex diseases, the number of genes that contribute to susceptibility is likely to be quite large, and the effects of each gene or allele will be weak. For example, suppose there are a dozen or more genes that contribute to type 2 diabetes. Attempting to identify susceptible subgroups for public health interventions would be too complex to be of practical value. Thus, for most chronic diseases, it is likely that more persons will benefit from modification of lifestyle or environmental factors than from knowledge of their genotypes. 14,116,117 If epidemiologists direct their efforts toward a comprehensive search for the genetic underpinnings of every discrete health outcome, and ignore environmental exposures and attributable risk, we will miss an opportunity to prevent disease. We must maintain the integrated perspective that distinguishes epidemiology from other health disciplines; otherwise, we will abandon public health. Conclusions Epidemiologists will not make meaningful contributions to genomics simply by providing DNA samples and generating more data. Epidemiologists need to contribute to the development of a conceptual framework that incorporates genomics into public health. Establishing such a framework requires new solutions to methodologic problems that have challenged epidemiologists for decades, and thus will benefit all aspects of epidemiologic research. These issues have been neglected owing to their difficulty, but epidemiologists need to address them in order to move forward. Genetics may be able to reveal much about human biology, but it will not be able to explain human health. Health depends not just upon individual genotypes but also upon the organization of communities and societies. 118,119 Genomics provides a new window on biological complexity that should stimulate epidemiologists to achieve a greater understanding of disease and health at all levels. Complex interactions occur not only between genes and environment, but also among social, economic, and political determinants of health. 120 As our field eagerly adopts the tools of modern molecular genetics, epidemiologists must convey the public-health perspective to the community of geneticists who venture into the study of human health. This is not merely an opportunity; it is our essential – and unique – contribution. Acknowledgments I thank Kenneth Weiss, Beverly Rockhill, David Savitz, Andrew Olshan, Beth Newman, Jack Taylor, and Laura Beskow for insightful comments on the manuscript.
Referência(s)