Biases and Reconciliation in Estimates of Linkage Disequilibrium in the Human Genome
2006; Elsevier BV; Volume: 78; Issue: 4 Linguagem: Inglês
10.1086/502803
ISSN1537-6605
AutoresItsik Pe’er, Yves Chrétien, Paul I. W. de Bakker, Jeffrey C. Barrett, Mark J. Daly, David Altshuler,
Tópico(s)Genomics and Chromatin Dynamics
ResumoGenetic association studies of common disease often rely on linkage disequilibrium (LD) along the human genome and in the population under study. Although understanding the characteristics of this correlation has been the focus of many large-scale surveys (culminating in genomewide haplotype maps), the results of different studies have yielded wide-ranging estimates. Since understanding these differences (and whether they can be reconciled) has important implications for whole-genome association studies, in this article we dissect biases in these estimations that are due to known aspects of study design and analytic methodology. In particular, we document in the empirical data that the long-known complicating effects of allele frequency, marker density, and sample size largely reconcile all large-scale surveys. Two exceptions are an underappraisal of redundancy among single-nucleotide polymorphisms (SNPs) when evaluation is limited to short regions (as in candidate-gene resequencing studies) and an inflation in the extent of LD in HapMap phase I, which is likely due to oversampling of specific haplotypes in the creation of the public SNP map. Understanding these factors can guide the understanding of empirical LD surveys and has implications for genetic association studies. Genetic association studies of common disease often rely on linkage disequilibrium (LD) along the human genome and in the population under study. Although understanding the characteristics of this correlation has been the focus of many large-scale surveys (culminating in genomewide haplotype maps), the results of different studies have yielded wide-ranging estimates. Since understanding these differences (and whether they can be reconciled) has important implications for whole-genome association studies, in this article we dissect biases in these estimations that are due to known aspects of study design and analytic methodology. In particular, we document in the empirical data that the long-known complicating effects of allele frequency, marker density, and sample size largely reconcile all large-scale surveys. Two exceptions are an underappraisal of redundancy among single-nucleotide polymorphisms (SNPs) when evaluation is limited to short regions (as in candidate-gene resequencing studies) and an inflation in the extent of LD in HapMap phase I, which is likely due to oversampling of specific haplotypes in the creation of the public SNP map. Understanding these factors can guide the understanding of empirical LD surveys and has implications for genetic association studies. Genetic association studies offer a powerful strategy for dissecting the contribution of common alleles to complex diseases (Risch and Merikangas Risch and Merikangas, 1996Risch N Merikangas K The future of genetic studies of complex human diseases.Science. 1996; 273: 1516-1517Crossref PubMed Scopus (4154) Google Scholar). Because whole-genome resequencing is not yet practical, widely used study designs rely extensively (either explicitly or implicitly) on linkage disequilibrium (LD), with researchers counting on associations to be detected by correlations between causal variants and neighboring genotyped markers (and not always being fortunate enough to have discovered and genotyped the causal marker in the patient samples). Thus, the extent and structure of LD acutely affect the economics, performance, design, and analysis of genetic association studies (Kruglyak Kruglyak, 1999Kruglyak L Prospects for whole-genome linkage disequilibrium mapping of common disease genes.Nat Genet. 1999; 22: 139-144Crossref PubMed Scopus (1110) Google Scholar; Reich et al. Reich et al., 2001Reich DE Cargill M Bolk S Ireland J Sabeti PC Richter DJ Lavery T Kouyoumjian R Farhadian SF Ward R Lander ES Linkage disequilibrium in the human genome.Nature. 2001; 411: 199-204Crossref PubMed Scopus (1287) Google Scholar; de Bakker et al. de Bakker et al., 2005de Bakker P Yelensky R Pe'er I Gabriel SB Daly M Altshuler D Tagging efficiency and study-wide power in genetic association studies.Nat Genet. 2005; 37: 1217-1223Crossref PubMed Scopus (1450) Google Scholar; Hirschhorn and Daly Hirschhorn and Daly, 2005Hirschhorn JN Daly MJ Genome-wide association studies for common diseases and complex traits.Nat Rev Genet. 2005; 6: 95-108Crossref PubMed Scopus (2014) Google Scholar; Wang et al. Wang et al., 2005Wang WY Barratt BJ Clayton DG Todd JA Genome-wide association studies: theoretical and practical concerns.Nat Rev Genet. 2005; 6: 109-118Crossref PubMed Scopus (883) Google Scholar). The design of such studies requires a thorough and robust understanding of LD patterns in the human genome. Many recent studies have independently attempted to quantify these patterns by large-scale examination of SNP alleles in representative human populations (Johnson et al. Johnson et al., 2001Johnson GC Esposito L Barratt BJ Smith AN Heward J Di Genova G Ueda H Cordell HJ Eaves IA Dudbridge F Twells RC Payne F Hughes W Nutland S Stevens H Carr P Tuomilehto-Wolf E Tuomilehto J Gough SC Clayton DG Todd JA Haplotype tagging for the identification of common disease genes.Nat Genet. 2001; 29: 233-237Crossref PubMed Scopus (998) Google Scholar; Patil et al. Patil et al., 2001Patil N Berno AJ Hinds DA Barrett WA Doshi JM Hacker CR Kautzer CR Lee DH Marjoribanks C McDonough DP Nguyen BT Norris MC Sheehan JB Shen N Stern D Stokowski RP Thomas DJ Trulson MO Vyas KR Frazer KA Fodor SP Cox DR Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21.Science. 2001; 294: 1719-1723Crossref PubMed Scopus (917) Google Scholar; Reich et al. Reich et al., 2001Reich DE Cargill M Bolk S Ireland J Sabeti PC Richter DJ Lavery T Kouyoumjian R Farhadian SF Ward R Lander ES Linkage disequilibrium in the human genome.Nature. 2001; 411: 199-204Crossref PubMed Scopus (1287) Google Scholar; Gabriel et al. Gabriel et al., 2002Gabriel SB Schaffner SF Nguyen H Moore JM Roy J Blumenstiel B Higgins J DeFelice M Lochner A Faggart M Liu-Cordero SN Rotimi C Adeyemo A Cooper R Ward R Lander ES Daly MJ Altshuler D The structure of haplotype blocks in the human genome.Science. 2002; 296: 2225-2229Crossref PubMed Scopus (4526) Google Scholar; Carlson et al. Carlson et al., 2004Carlson CS Eberle MA Kruglyak L Nickerson DA Mapping complex disease loci in whole-genome association studies.Nature. 2004; 429: 446-452Crossref PubMed Scopus (493) Google Scholar; Ke et al. Ke et al., 2004Ke X Hunt S Tapper W Lawrence R Stavrides G Ghori J Whittaker P Collins A Morris AP Bentley D Cardon LR Deloukas P The impact of SNP density on fine-scale patterns of linkage disequilibrium.Hum Mol Genet. 2004; 13: 577-588Crossref PubMed Scopus (158) Google Scholar; Hinds et al. Hinds et al., 2005Hinds DA Stuve LL Nilsen GB Halperin E Eskin E Ballinger DG Frazer KA Cox DR Whole-genome patterns of common DNA variation in three human populations.Science. 2005; 307: 1072-1079Crossref PubMed Scopus (953) Google Scholar), culminating in two genomewide surveys by Perlegen, with 1.6 million SNPs in 71 samples (Hinds et al. Hinds et al., 2005Hinds DA Stuve LL Nilsen GB Halperin E Eskin E Ballinger DG Frazer KA Cox DR Whole-genome patterns of common DNA variation in three human populations.Science. 2005; 307: 1072-1079Crossref PubMed Scopus (953) Google Scholar), and by the International HapMap Project, with 1 million SNPs (phase I [Altshuler et al. Altshuler et al., 2005Altshuler D Brooks LD Chakravarti A Collins FS Daly MJ Donnelly P A haplotype map of the human genome.Nature. 2005; 437: 1299-1320Crossref PubMed Scopus (4501) Google Scholar]) going up to >3 million SNPs (phase II [International HapMap Project Web site) in 269 samples. Since it has not been practical to exhaustively examine each base pair in very large and diverse populations, efforts to characterize variation have, by necessity, required a variety of trade-offs: sequencing a small set of chromosomes over long regions (Patil et al. Patil et al., 2001Patil N Berno AJ Hinds DA Barrett WA Doshi JM Hacker CR Kautzer CR Lee DH Marjoribanks C McDonough DP Nguyen BT Norris MC Sheehan JB Shen N Stern D Stokowski RP Thomas DJ Trulson MO Vyas KR Frazer KA Fodor SP Cox DR Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21.Science. 2001; 294: 1719-1723Crossref PubMed Scopus (917) Google Scholar), sequencing a larger set of samples for a selected set of candidate genes (Crawford et al. Crawford et al., 2004Crawford DC Carlson CS Rieder MJ Carrington DP Yi Q Smith JD Eberle MA Kruglyak L Nickerson DA Haplotype diversity across 100 candidate genes for inflammation, lipid metabolism, and blood pressure regulation in two populations.Am J Hum Genet. 2004; 74: 610-622Abstract Full Text Full Text PDF PubMed Scopus (152) Google Scholar), or typing subsets of SNPs from public maps in larger samples and longer genomic spans (Gabriel et al. Gabriel et al., 2002Gabriel SB Schaffner SF Nguyen H Moore JM Roy J Blumenstiel B Higgins J DeFelice M Lochner A Faggart M Liu-Cordero SN Rotimi C Adeyemo A Cooper R Ward R Lander ES Daly MJ Altshuler D The structure of haplotype blocks in the human genome.Science. 2002; 296: 2225-2229Crossref PubMed Scopus (4526) Google Scholar; Ke et al. Ke et al., 2004Ke X Hunt S Tapper W Lawrence R Stavrides G Ghori J Whittaker P Collins A Morris AP Bentley D Cardon LR Deloukas P The impact of SNP density on fine-scale patterns of linkage disequilibrium.Hum Mol Genet. 2004; 13: 577-588Crossref PubMed Scopus (158) Google Scholar; Altshuler et al. Altshuler et al., 2005Altshuler D Brooks LD Chakravarti A Collins FS Daly MJ Donnelly P A haplotype map of the human genome.Nature. 2005; 437: 1299-1320Crossref PubMed Scopus (4501) Google Scholar; Hinds et al. Hinds et al., 2005Hinds DA Stuve LL Nilsen GB Halperin E Eskin E Ballinger DG Frazer KA Cox DR Whole-genome patterns of common DNA variation in three human populations.Science. 2005; 307: 1072-1079Crossref PubMed Scopus (953) Google Scholar; International HapMap Project). The use of incomplete data sets introduces biases in principle, including overrepresentation of high-allele frequencies, artifacts of small samples, and analysis of short genomic regions that underestimates long-range LD. Moreover, particular regions examined may not be representative of genomewide LD patterns, potentially because of extreme natural selection at these regions. It is, therefore, unsurprising that the different choices made in the design of the above studies result in inappreciably different estimates of allelic correlation in the genome (Ke et al. Ke et al., 2004Ke X Hunt S Tapper W Lawrence R Stavrides G Ghori J Whittaker P Collins A Morris AP Bentley D Cardon LR Deloukas P The impact of SNP density on fine-scale patterns of linkage disequilibrium.Hum Mol Genet. 2004; 13: 577-588Crossref PubMed Scopus (158) Google Scholar) (see below). Moreover, although the influences of known biases on LD have individually been predicted theoretically and observed empirically (Hedrick Hedrick, 1987Hedrick PW Gametic disequilibrium measures: proceed with caution.Genetics. 1987; 117: 331-341PubMed Google Scholar; Devlin and Risch Devlin and Risch, 1995Devlin B Risch N A comparison of linkage disequilibrium measures for fine-scale mapping.Genomics. 1995; 29: 311-322Crossref PubMed Scopus (841) Google Scholar; Ardlie et al. Ardlie et al., 2002Ardlie KG Kruglyak L Seielstad M Patterns of linkage disequilibrium in the human genome.Nat Rev Genet. 2002; 3: 299-309Crossref PubMed Scopus (794) Google Scholar; Teare et al. Teare et al., 2002Teare MD Dunning AM Durocher F Rennart G Easton DF Sampling distribution of summary linkage disequilibrium measures.Ann Hum Genet. 2002; 66: 223-233Crossref PubMed Google Scholar; Ke et al. Ke et al., 2004Ke X Hunt S Tapper W Lawrence R Stavrides G Ghori J Whittaker P Collins A Morris AP Bentley D Cardon LR Deloukas P The impact of SNP density on fine-scale patterns of linkage disequilibrium.Hum Mol Genet. 2004; 13: 577-588Crossref PubMed Scopus (158) Google Scholar), a systematic comparison of the available large-scale empirical data after adjustment for biases—uncovering what differences, if any, are not explained by known aspects of study design—has not, to our knowledge, been published elsewhere. In this article, we examine a variety of large-scale, publicly available data sets through an identical analysis process, and we iteratively examine the impact of biases in study design on a variety of LD measures. The results reveal that most, but not all, differences among surveys can straightforwardly be reconciled and can offer insight into the limitations and interpretations of available genomewide data sets. Finally, they reassure us with regard to the objectiveness of upcoming final HapMap data, only a glimpse of which was available at the time of this article's submission. We analyze the following data sets (see table 1): (1) 166 genes resequenced across 47 individuals from two population panels, as part of the SeattleSNPs project (Carlson et al. Carlson et al., 2004Carlson CS Eberle MA Kruglyak L Nickerson DA Mapping complex disease loci in whole-genome association studies.Nature. 2004; 429: 446-452Crossref PubMed Scopus (493) Google Scholar; SeattleSNPs Variation Discovery Resource Web site); (2) five 500-kb regions from the HapMap ENCODE project, resequenced in 16 individuals and genotyped for every known or discovered marker in the corresponding HapMap 90 individuals (Altshuler et. al Altshuler et al., 2005Altshuler D Brooks LD Chakravarti A Collins FS Daly MJ Donnelly P A haplotype map of the human genome.Nature. 2005; 437: 1299-1320Crossref PubMed Scopus (4501) Google Scholar; HapMap ENCODE Web site); (3) one 10-Mb region of chromosome 20 typed by the Sanger Center for almost all public SNPs across at least 42 individuals per population panel (Ke et al. Ke et al., 2004Ke X Hunt S Tapper W Lawrence R Stavrides G Ghori J Whittaker P Collins A Morris AP Bentley D Cardon LR Deloukas P The impact of SNP density on fine-scale patterns of linkage disequilibrium.Hum Mol Genet. 2004; 13: 577-588Crossref PubMed Scopus (158) Google Scholar; Wellcome Trust Sanger Institute, Human Chromosome 20 Web site); (4) 62 autosomal regions typed for >2,500 public, most of which are now double-hit, SNPs in a pilot study of haplotype structure (Gabriel et al. Gabriel et al., 2002Gabriel SB Schaffner SF Nguyen H Moore JM Roy J Blumenstiel B Higgins J DeFelice M Lochner A Faggart M Liu-Cordero SN Rotimi C Adeyemo A Cooper R Ward R Lander ES Daly MJ Altshuler D The structure of haplotype blocks in the human genome.Science. 2002; 296: 2225-2229Crossref PubMed Scopus (4526) Google Scholar; Structure of Haplotype Blocks in the Human Genome Web site); (5) genomewide SNP data with 1 million public, mostly double-hit (Reich et al. Reich et al., 2003Reich DE Gabriel SB Altshuler D Quality and completeness of SNP databases.Nat Genet. 2003; 33: 457-458Crossref PubMed Scopus (166) Google Scholar), SNPs typed in 90 individuals per population panel (Altshuler et al. Altshuler et al., 2005Altshuler D Brooks LD Chakravarti A Collins FS Daly MJ Donnelly P A haplotype map of the human genome.Nature. 2005; 437: 1299-1320Crossref PubMed Scopus (4501) Google Scholar); and (6) genomewide SNP data released by Perlegen with 1.6 million SNPs typed in 71 individuals from three population panels (Hinds et al. Hinds et al., 2005Hinds DA Stuve LL Nilsen GB Halperin E Eskin E Ballinger DG Frazer KA Cox DR Whole-genome patterns of common DNA variation in three human populations.Science. 2005; 307: 1072-1079Crossref PubMed Scopus (953) Google Scholar; Perlegen Genotype Browser). All of our data sets are publicly available at their respective Web sites. Although we realize that some of these data sets are continually updated and that other data sets exist, we examined specific data freezes that usually follow one of the designs we already examined.Table 1Public Data Sets Have Different DesignsData Set and PopulationSample SizeaNo. of chromosomes per population.DensitybIn SNPs/kb.SNPs in StudycAscertainment in resequencing/public databases.Total LengthdIn Mb.Region LengthdIn Mb.Reference and Web SiteSeattleSNPs:3Resequencing4.025eAverage.Crawford et al. Crawford et al., 2004Crawford DC Carlson CS Rieder MJ Carrington DP Yi Q Smith JD Eberle MA Kruglyak L Nickerson DA Haplotype diversity across 100 candidate genes for inflammation, lipid metabolism, and blood pressure regulation in two populations.Am J Hum Genet. 2004; 74: 610-622Abstract Full Text Full Text PDF PubMed Scopus (152) Google Scholar; SeattleSNPs Variation Discovery Resource CEPH European (Utah)48 African American46ENCODE:2.5Resequencing and public5.5Altshuler et. al 2005; HapMap ENCODE CEPH European (Utah)120fIn trios. Yoruban (Nigeria)120fIn trios. Han (China) and Japanese178Chromosome 20:.5Public1010Ke et al. Ke et al., 2004Ke X Hunt S Tapper W Lawrence R Stavrides G Ghori J Whittaker P Collins A Morris AP Bentley D Cardon LR Deloukas P The impact of SNP density on fine-scale patterns of linkage disequilibrium.Hum Mol Genet. 2004; 13: 577-588Crossref PubMed Scopus (158) Google Scholar; Wellcome Trust Sanger Institute Chromosome 20 CEPH European (Utah)96 Beni (Nigeria)96 Han (China)96Gabriel:.12Public13.2eAverage.Gabriel et al. Gabriel et al., 2002Gabriel SB Schaffner SF Nguyen H Moore JM Roy J Blumenstiel B Higgins J DeFelice M Lochner A Faggart M Liu-Cordero SN Rotimi C Adeyemo A Cooper R Ward R Lander ES Daly MJ Altshuler D The structure of haplotype blocks in the human genome.Science. 2002; 296: 2225-2229Crossref PubMed Scopus (4526) Google Scholar; Structure of Haplotype Blocks in the Human Genome CEPH European (Utah)96 Yoruban96 Han (China)96HapMap:.2Public2,800gEntire genome.100e,Average.hEntire chromosome.Altshuler et. al Altshuler et al., 2005Altshuler D Brooks LD Chakravarti A Collins FS Daly MJ Donnelly P A haplotype map of the human genome.Nature. 2005; 437: 1299-1320Crossref PubMed Scopus (4501) Google Scholar; International HapMap Project CEPH European (Utah)120fIn trios. Yoruban120fIn trios. Han (China) and Japanese178Perlegen:.6Resequencing and public2,800gEntire genome.100e,Average.hEntire chromosome.Hinds et al. Hinds et al., 2005Hinds DA Stuve LL Nilsen GB Halperin E Eskin E Ballinger DG Frazer KA Cox DR Whole-genome patterns of common DNA variation in three human populations.Science. 2005; 307: 1072-1079Crossref PubMed Scopus (953) Google Scholar; Perlegen Genotype Browser CEPH European (Utah)48 African American46 Han (China)48a No. of chromosomes per population.b In SNPs/kb.c Ascertainment in resequencing/public databases.d In Mb.e Average.f In trios.g Entire genome.h Entire chromosome. Open table in a new tab All SNPs not polymorphic in each panel were deleted when we analyzed that panel. Initially, all polymorphic SNPs (without any frequency cutoff) were used for analysis. In subsequent analysis, SNPs were used per the specific random thinning of data sets. Pairwise statistics of LD (absolute D′ and r2) were computed by Haploview (Barrett et al. Barrett et al., 2005Barrett JC Fry B Maller J Daly MJ Haploview: analysis and visualization of LD and haplotype maps.Bioinformatics. 2005; 21: 263-265Crossref PubMed Scopus (11392) Google Scholar). For markers A/a and B/b with allele frequencies PA⩾Pa and PB⩾Pb and allele-combination frequencies PAB, PAb, PaB, and Pab, such that D=PAB−PAPB⩾0, these pairwise LD statistics are defined as D′=D/minPA,PB−PAPB and r2=D2/PAPBPaPb. Allele-combination frequencies were computed by the expectation-maximization algorithm, constrained by family-based obligate-phasing data (Barrett et al. Barrett et al., 2005Barrett JC Fry B Maller J Daly MJ Haploview: analysis and visualization of LD and haplotype maps.Bioinformatics. 2005; 21: 263-265Crossref PubMed Scopus (11392) Google Scholar). Data sets were thinned for equating minor-allele frequency spectra, region lengths, marker density, and sample size (see appendix A). The main tool for analysis, Haploview (Barrett et al. Barrett et al., 2005Barrett JC Fry B Maller J Daly MJ Haploview: analysis and visualization of LD and haplotype maps.Bioinformatics. 2005; 21: 263-265Crossref PubMed Scopus (11392) Google Scholar), is open-source, freely available software. All source code used for the analysis in this article is available from the authors' Web site. We analyzed six large-scale, publicly available genotype data sets (see table 1 for nomenclature). These data sets represent different trade-offs of data set design, with regard to parameters known to affect measurements of LD, including marker density, proportion of variation ascertained, number of chromosomes genotyped, physical lengths of individual regions, and total physical length spanned (see table 1). Notably, the analyzed data sets differently represent rare versus common alleles and have different allele-frequency spectra (see fig. 1). Since LD patterns vary among populations studied (Gabriel et al. Gabriel et al., 2002Gabriel SB Schaffner SF Nguyen H Moore JM Roy J Blumenstiel B Higgins J DeFelice M Lochner A Faggart M Liu-Cordero SN Rotimi C Adeyemo A Cooper R Ward R Lander ES Daly MJ Altshuler D The structure of haplotype blocks in the human genome.Science. 2002; 296: 2225-2229Crossref PubMed Scopus (4526) Google Scholar; Evans and Cardon Evans and Cardon, 2005Evans DM Cardon LR A comparison of linkage disequilibrium patterns and estimated population recombination rates across multiple populations.Am J Hum Genet. 2005; 76: 681-687Abstract Full Text Full Text PDF PubMed Scopus (85) Google Scholar), we limit comparison of LD across data sets to individuals of similar continental origin (Rosenberg et al. Rosenberg et al., 2002Rosenberg NA Pritchard JK Weber JL Cann HM Kidd KK Zhivotovsky LA Feldman MW Genetic structure of human populations.Science. 2002; 298: 2381-2385Crossref PubMed Scopus (1935) Google Scholar) (see fig. 1). There are various ways to quantify LD—on the basis of pairs (Devlin and Risch Devlin and Risch, 1995Devlin B Risch N A comparison of linkage disequilibrium measures for fine-scale mapping.Genomics. 1995; 29: 311-322Crossref PubMed Scopus (841) Google Scholar) or haplotypes consisting of larger sets of markers (Daly et al. Daly et al., 2001Daly MJ Rioux JD Schaffner SF Hudson TJ Lander ES High-resolution haplotype structure in the human genome.Nat Genet. 2001; 29: 229-232Crossref PubMed Scopus (1382) Google Scholar; Gabriel et al. Gabriel et al., 2002Gabriel SB Schaffner SF Nguyen H Moore JM Roy J Blumenstiel B Higgins J DeFelice M Lochner A Faggart M Liu-Cordero SN Rotimi C Adeyemo A Cooper R Ward R Lander ES Daly MJ Altshuler D The structure of haplotype blocks in the human genome.Science. 2002; 296: 2225-2229Crossref PubMed Scopus (4526) Google Scholar; Phillips et al. Phillips et al., 2003Phillips MS Lawrence R Sachidanandam R Morris AP Balding DJ Donaldson MA Studebaker JF et al.Chromosome-wide distribution of haplotype blocks and the role of recombination hot spots.Nat Genet. 2003; 33: 382-387Crossref PubMed Scopus (232) Google Scholar) and by other methods (Hill and Weir Hill and Weir, 1994Hill WG Weir BS Maximum-likelihood estimation of gene location by linkage disequilibrium.Am J Hum Genet. 1994; 54: 705-714PubMed Google Scholar; Morton et al. Morton et al., 2001Morton NE Zhang W Taillon-Miller P Ennis S Kwok PY Collins A The optimal measure of allelic association.Proc Natl Acad Sci USA. 2001; 98: 5217-5221Crossref PubMed Scopus (100) Google Scholar; Nothnagel and Ott Nothnagel and Ott, 2002Nothnagel M Ott J Statistical gene mapping of traits in humans—hypertension as a complex trait: is it amenable to genetic analysis?.Semin Nephrol. 2002; 22: 105-114Abstract Full Text PDF PubMed Scopus (1) Google Scholar; Sabatti and Risch Sabatti and Risch, 2002Sabatti C Risch N Homozygosity and linkage disequilibrium.Genetics. 2002; 160: 1707-1719PubMed Google Scholar). For simplicity, we compared three straightforward and widely used measures of LD applied to the available data sets. 1.Pairwise relative disequilibrium, known as Lewontin's D′ (Lewontin Lewontin, 1964Lewontin RC The interaction of selection and linkage. II. Optimum models.Genetics. 1964; 50: 757-782PubMed Google Scholar), as a function of distance. This metric is proportional to the extent of recombination across the pair of alleles in the history of the sample.2.The pairwise correlation coefficient between a pair of SNPs (r2), which is related to study power under a multiplicative model.3.The fraction of all SNPs that are highly redundant (exceeding a pairwise r2 threshold) with one or more others (Carlson et al. Carlson et al., 2004Carlson CS Eberle MA Kruglyak L Nickerson DA Mapping complex disease loci in whole-genome association studies.Nature. 2004; 429: 446-452Crossref PubMed Scopus (493) Google Scholar)—such that another could proxy for the SNP in a genotyping experiment—which we refer to as the "proxy rate" (Altshuler et. al Altshuler et al., 2005Altshuler D Brooks LD Chakravarti A Collins FS Daly MJ Donnelly P A haplotype map of the human genome.Nature. 2005; 437: 1299-1320Crossref PubMed Scopus (4501) Google Scholar). Figure 2 shows the distributions of these three statistics across the six data sets, revealing wide variability among estimates, even within samples drawn from the same continental origin. From these analyses, it is not possible to be sure whether or not the different surveys present a consistent and robust set of estimates of the true results. We next attempted to correct for a set of the known biases detailed in table 1. Each difference in experimental design was evaluated by data reduction: two data sets were aligned with respect to each parameter by reduction of the one that was more complete. It is well understood that SNPs of different MAFs, on average, have different LD properties (Pritchard and Przeworski Pritchard and Przeworski, 2001Pritchard JK Przeworski M Linkage disequilibrium in humans: models and data.Am J Hum Genet. 2001; 69: 1-14Abstract Full Text Full Text PDF PubMed Scopus (857) Google Scholar) due both to population genetic effects (common alleles are, on average, older than less common alleles) and to effects of sampling, in that rare SNPs tend to have higher pairwise D′ values and lower pairwise r2 values than do common SNPs (fig. B1). Similarly, sample size affects estimation of D′, with smaller samples failing to sample rare fourth gametes and, therefore, inflating estimated D′ (Jorde Jorde, 2000Jorde LB Linkage disequilibrium and the search for complex disease genes.Genome Res. 2000; 10: 1435-1444Crossref PubMed Scopus (286) Google Scholar) (fig. A3). Figure 1 illustrates how the depth of resequencing determines which MAF strata are represented in each study. The SeattleSNPs and ENCODE data sets, each based on extensive sequencing, are most enriched in rare alleles and have the highest average D′ and the lowest average r2 values; data sets based on dbSNP underrepresent low-frequency alleles and show the inverse pattern. We examined whether the observed differences in pairwise LD are solely the result of these well-understood differences in frequency spectrum and in the number of chromosomes sampled. Specifically, by randomly selecting individuals and ascertaining subsets of markers, we reduced the data sets to achieve the same sample size and allele-frequency spectrum in each (see the "Methods" section and appendix A). Whereas the common ground for sample size was naturally chosen to be the smallest data set sample size—23 unrelated subjects—choosing a standard MAF spectrum is somewhat arbitrary, since there is no single "correct" spectrum. Since there is, in fact, different LD around alleles of different frequencies, the question is best examined as a function of allele frequency. If data were more abundant, it would be ideal to establish the true distribution of pairwise LD for each pair of allele frequencies. Another option is to observe LD only within specific slices of the frequency spectrum (see fig. B1). However, for a baseline that could be evaluated in all studies and still use most of the data, we examined a uniform-frequency distribution, which is also the contribution of each frequency bin to the heterozygosity in theoretical predictions under neutral model assumptions (Kimura and Crow Kimura and Crow, 1964Kimura M Crow JF The number of alleles that can be maintained in a finite population.Genetics. 1964; 49: 725-738PubMed Google Scholar) and in empirical data (Cargill et al. Cargill et al., 1999Cargill M Altshuler D Ireland J Sklar P Ardlie K Patil N Shaw N Lane CR Lim EP Kalyanaraman N Nemesh J Ziaugra L Friedland L Rolfe A Warrington J Lipshutz R Daley GQ Lander ES Characterization of single-nucleotide polymorphisms in coding regions of human genes.Nat Genet. 1999; 22: 231-238Crossref PubMed Scopus (1516) Google Scholar). Figure 3 demonstrates that these adjustments largely reconcile the different estimates of D′ and r2, thus giving reassurance that the largest component of the observed differences in these measures (fig. 2) is simply the different proportions of high-frequency and low-frequency SNPs typed across a range of sample sizes. In contrast to average values of D′ and r2, redundancy in the different data sets (measured using the proxy rate) remained quite variable, despite reconciliation of sample size and allele-frequency distribution. The most obvious explanation is simply marker density (Ke et al. Ke et al., 2004Ke X Hunt S Tapper W Lawrence R Stavrides G Ghori J Whittaker P Collins A Morris AP Bentley D Cardon LR Deloukas P The impact of SNP density on fine-scale patterns of linkage disequilibrium.Hum Mol Genet. 2004; 13: 577-588Crossref PubMed Scopus (158) Google Scholar), since increasing marker density increases the
Referência(s)