Artigo Acesso aberto Revisado por pares

Proportioning Whole-Genome Single-Nucleotide–Polymorphism Diversity for the Identification of Geographic Population Structure and Genetic Ancestry

2006; Elsevier BV; Volume: 78; Issue: 4 Linguagem: Inglês

10.1086/501531

ISSN

1537-6605

Autores

Óscar Lao, Kate van Duijn, Paula Kersbergen, Peter de Knijff, Manfred Kayser,

Tópico(s)

Genetic Associations and Epidemiology

Resumo

The identification of geographic population structure and genetic ancestry on the basis of a minimal set of genetic markers is desirable for a wide range of applications in medical and forensic sciences. However, the absence of sharp discontinuities in the neutral genetic diversity among human populations implies that, in practice, a large number of neutral markers will be required to identify the genetic ancestry of one individual. We showed that it is possible to reduce the amount of markers required for detecting continental population structure to only 10 single-nucleotide polymorphisms (SNPs), by applying a newly developed ascertainment algorithm to Affymetrix GeneChip Mapping 10K SNP array data that we obtained from samples of globally dispersed human individuals (the Y Chromosome Consortium panel). Furthermore, this set of SNPs was able to recover the genetic ancestry of individuals from all four continents represented in the original data set when applied to an independent, much larger, worldwide population data set (Centre d'Etude du Polymorphisme Humain–Human Genome Diversity Project Cell Line Panel). Finally, we provide evidence that the unusual patterns of genetic variation we observed at the respective genomic regions surrounding the five most informative SNPs is in agreement with local positive selection being the explanation for the striking SNP allele-frequency differences we found between continental groups of human populations. The identification of geographic population structure and genetic ancestry on the basis of a minimal set of genetic markers is desirable for a wide range of applications in medical and forensic sciences. However, the absence of sharp discontinuities in the neutral genetic diversity among human populations implies that, in practice, a large number of neutral markers will be required to identify the genetic ancestry of one individual. We showed that it is possible to reduce the amount of markers required for detecting continental population structure to only 10 single-nucleotide polymorphisms (SNPs), by applying a newly developed ascertainment algorithm to Affymetrix GeneChip Mapping 10K SNP array data that we obtained from samples of globally dispersed human individuals (the Y Chromosome Consortium panel). Furthermore, this set of SNPs was able to recover the genetic ancestry of individuals from all four continents represented in the original data set when applied to an independent, much larger, worldwide population data set (Centre d'Etude du Polymorphisme Humain–Human Genome Diversity Project Cell Line Panel). Finally, we provide evidence that the unusual patterns of genetic variation we observed at the respective genomic regions surrounding the five most informative SNPs is in agreement with local positive selection being the explanation for the striking SNP allele-frequency differences we found between continental groups of human populations. The identification of geographic population (sub)structure is an important prerequisite for finding genes of complex traits through association mapping (Ziv and Burchard Ziv and Burchard, 2003Ziv E Burchard EG Human population structure and genetic association studies.Pharmacogenomics. 2003; 4: 431-441Crossref PubMed Scopus (98) Google Scholar; Freedman et al. Freedman et al., 2004Freedman ML Reich D Penney KL McDonald GJ Mignault AA Patterson N Gabriel SB Topol EJ Smoller JW Pato CN Pato MT Petryshen TL Kolonel LN Lander ES Sklar P Henderson B Hirschhorn JN Altshuler D Assessing the impact of population stratification on genetic association studies.Nat Genet. 2004; 36: 388-393Crossref PubMed Scopus (599) Google Scholar; Marchini et al. Marchini et al., 2004Marchini J Cardon LR Phillips MS Donnelly P The effects of human population structure on large genetic association studies.Nat Genet. 2004; 36: 512-517Crossref PubMed Scopus (651) Google Scholar; McKeigue McKeigue, 2005McKeigue PM Prospects for admixture mapping of complex traits.Am J Hum Genet. 2005; 76: 1-7Abstract Full Text Full Text PDF PubMed Scopus (145) Google Scholar), and the identification of genetic ancestry and recent genetic admixture is crucial for admixture mapping (Chakraborty and Weiss Chakraborty and Weiss, 1988Chakraborty R Weiss KM Admixture as a tool for finding linked genes and detecting that difference from allelic association between loci.Proc Natl Acad Sci USA. 1988; 85: 9119-9123Crossref PubMed Scopus (301) Google Scholar; Parra et al. Parra et al., 1998Parra EJ Marcini A Akey J Martinson J Batzer MA Cooper R Forrester T Allison DB Deka R Ferrell RE Shriver MD Estimating African American admixture proportions by use of population-specific alleles.Am J Hum Genet. 1998; 63: 1839-1851Abstract Full Text Full Text PDF PubMed Scopus (632) Google Scholar; Montana and Pritchard Montana and Pritchard, 2004Montana G Pritchard JK Statistical tests for admixture mapping with case-control and cases-only data.Am J Hum Genet. 2004; 75: 771-789Abstract Full Text Full Text PDF PubMed Scopus (119) Google Scholar). Furthermore, DNA testing in forensics can potentially use genetic ancestry identification to predict the geographic origin of a person, which might help police direct investigations (Shriver and Kittles Shriver and Kittles, 2004Shriver MD Kittles RA Genetic ancestry and the search for personalized genetic histories.Nat Rev Genet. 2004; 5: 611-618Crossref PubMed Scopus (165) Google Scholar; Ray et al. Ray et al., 2005Ray DA Walker JA Hall A Llewellyn B Ballantyne J Christian AT Turteltaub K Batzer MA Inference of human geographic origins using Alu insertion polymorphisms.Forensic Sci Int. 2005; 153: 117-124Abstract Full Text Full Text PDF PubMed Scopus (47) Google Scholar). However, finding genetic markers that clearly differentiate populations has turned out to be difficult in practice, because the neutral genetic diversity in human populations tends to be distributed throughout the worldwide continents without sharp discontinuities (Cavalli-Sforza et al. Cavalli-Sforza et al., 1994Cavalli-Sforza LL Menozzi P Piazza A The history and geography of human genes. Princeton University Press, Princeton, NJ1994Google Scholar; Ramachandran et al. Ramachandran et al., 2005Ramachandran S Deshpande O Roseman CC Rosenberg NA Feldman MW Cavalli-Sforza LL Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa.Proc Natl Acad Sci USA. 2005; 102: 15942-15947Crossref PubMed Scopus (705) Google Scholar) and with only small discontinuities due to geographic barriers (Rosenberg et al. Rosenberg et al., 2005Rosenberg NA Mahajan S Ramachandran S Zhao C Pritchard JK Feldman MW Clines, clusters, and the effect of study design on the inference of human population structure.PloS Genetics. 2005; 1: 1-12Crossref Scopus (216) Google Scholar). Although small, the observed genetic differentiation between continents is statistically significant (Romualdi et al. Romualdi et al., 2002Romualdi C Balding D Nasidze IS Risch G Robichaux M Sherry ST Stoneking M Batzer MA Barbujani GV Patterns of human diversity, within and among continents, inferred from biallelic DNA polymorphisms.Genome Res. 2002; 12: 602-612Crossref PubMed Scopus (168) Google Scholar), suggesting that, if the number of loci analyzed is large enough, it may be possible to correctly infer the geographic origin of an individual (Edwards Edwards, 2003Edwards AWV Human genetic diversity: Lewontin's fallacy.Bioessays. 2003; 25: 798-801Crossref PubMed Scopus (173) Google Scholar). However, this weak genetic differentiation also implies that powerful clustering algorithms are required to detect existing genetic population (sub)structure by use of a large number of markers, but the final result is going to depend not only on the number of markers used but also on the evolutionary assumptions of the procedure applied (Corander et al. Corander et al., 2004Corander J Waldmann P Marttinen P Sillanpaa MJ BAPS 2: enhanced possibilities for the analysis of genetic population structure.Bioinformatics. 2004; 20: 2363-2369Crossref PubMed Scopus (356) Google Scholar). Despite that it has been shown that nonrecombining markers from paternally inherited Y chromosomes and maternally transmitted mtDNA are highly suitable for detecting geographic population (sub)structure and genetic ancestry (Jobling and Tyler-Smith Jobling and Tyler-Smith, 2003Jobling MA Tyler-Smith C The human Y chromosome: an evolutionary marker comes of age.Nat Rev Genet. 2003; 4: 598-612Crossref PubMed Scopus (661) Google Scholar; Jobling et al. Jobling et al., 2004Jobling MA Hurles ME Tyler-Smith C Human evolutionary genetics: origins, peoples, and disease. Garland Science, New York2004Google Scholar), their analysis can yield conflicting results for populations that have experienced sex-mediated genetic admixture throughout their history (Carvalho-Silva et al. Carvalho-Silva et al., 2001Carvalho-Silva DR Santos FR Rocha J Pena SD The phylogeography of Brazilian Y-chromosome lineages.Am J Hum Genet. 2001; 68: 281-286Abstract Full Text Full Text PDF PubMed Scopus (264) Google Scholar). Thus, it seems logical to use autosomal genetic markers in addition to sex-specific markers to correctly identify geographic population structure and genetic ancestry. Elsewhere, it has been shown that individuals from different geographic origins can be classified according to their continental region of sampling by use of the genetic information of several hundred autosomal microsatellites (Rosenberg et al. Rosenberg et al., 2002Rosenberg NA Pritchard JK Weber JL Cann HM Kidd KK Zhivotovsky LA Feldman MWV Genetic structure of human populations.Science. 2002; 298: 2381-2385Crossref PubMed Scopus (1940) Google Scholar, Rosenberg et al., 2003Rosenberg NA Li LM Ward R Pritchard JK Informativeness of genetic markers for inference of ancestry.Am J Hum Genet. 2003; 73: 1402-1422Abstract Full Text Full Text PDF PubMed Scopus (484) Google Scholar) as well as autosomal Alu-insertion polymorphisms and microsatellites (Bamshad et al. Bamshad et al., 2003Bamshad MJ Wooding S Watkins WS Ostler CT Batzer MA Jorde LBV Human population genetic structure and inference of group membership.Am J Hum Genet. 2003; 72: 578-589Abstract Full Text Full Text PDF PubMed Scopus (243) Google Scholar). Although the number of microsatellites can be reduced to ∼40 when the statistical parameter In ("informativeness of assignment" index) is applied to marker ascertainment (Rosenberg et al. Rosenberg et al., 2003Rosenberg NA Li LM Ward R Pritchard JK Informativeness of genetic markers for inference of ancestry.Am J Hum Genet. 2003; 73: 1402-1422Abstract Full Text Full Text PDF PubMed Scopus (484) Google Scholar), their relatively high mutation rates (Kayser et al. Kayser et al., 2000Kayser M Roewer L Hedman M Henke L Henke J Brauer S Kruger C Krawczak M Nagy M Dobosz T Szibor R de Knijff P Stoneking M Sajantila A Characteristics and frequency of germline mutations at microsatellite loci from the human Y chromosome, as revealed by direct observation in father/son pairs.Am J Hum Genet. 2000; 66: 1580-1588Abstract Full Text Full Text PDF PubMed Scopus (303) Google Scholar; Holtkemper et al. Holtkemper et al., 2001Holtkemper U Rolf B Hohoff C Forster P Brinkmann B Mutation rates at two human Y-chromosomal microsatellite loci using small pool PCR techniques.Hum Mol Genet. 2001; 10: 629-633Crossref PubMed Scopus (36) Google Scholar; Xu and Fu Xu and Fu, 2004Xu H Fu YX Estimating effective population size or mutation rate with microsatellites.Genetics. 2004; 166: 555-563Crossref PubMed Scopus (46) Google Scholar) keeps the number of markers relatively high. In principle, the number of genetic markers could be reduced by using SNPs that mutate ∼100,000 times more slowly than do microsatellites (Thomson et al. Thomson et al., 2000Thomson R Pritchard JK Shen P Oefner PJ Feldman MW Recent common ancestry of human Y chromosomes: evidence from DNA sequence data.Proc Natl Acad Sci USA. 2000; 97: 7360-7365Crossref PubMed Scopus (238) Google Scholar). Recent studies suggest that there is a considerably large number of autosomal SNPs showing a geographically restricted allele-frequency distribution (Hinds et al. Hinds et al., 2005Hinds DA Stuve LL Nilsen GB Halperin E Eskin E Ballinger DG Frazer KA Cox DR Whole-genome patterns of common DNA variation in three human populations.Science. 2005; 307: 1072-1079Crossref PubMed Scopus (953) Google Scholar). However, only a small number of populations from a small number of geographic regions have been analyzed so far (HapMap HapMap, 2003HapMap The International HapMap Project.Nature. 2003; 426: 789-796Crossref PubMed Scopus (4688) Google Scholar; Hinds et al. Hinds et al., 2005Hinds DA Stuve LL Nilsen GB Halperin E Eskin E Ballinger DG Frazer KA Cox DR Whole-genome patterns of common DNA variation in three human populations.Science. 2005; 307: 1072-1079Crossref PubMed Scopus (953) Google Scholar). In this study, we used global whole-genome SNP variation and a newly developed ascertainment algorithm to identify a minimal set of markers with maximal ability to detect geographic population structure and genetic ancestry. In principle, genetic markers with the largest genetic distances between populations—determined, for example, by applying the genetic distance FST (Weir et al. Weir et al., 2005Weir BS Cardon LR Anderson AD Nielsen DM Hill WG Measures of human population structure show heterogeneity among genomic regions.Genome Res. 2005; 15: 1468-1476Crossref PubMed Scopus (209) Google Scholar)—are the best candidates for population differentiation (Shriver et al. Shriver et al., 2004Shriver MD Kennedy GC Parra EJ Lawson HA Sonpar V Huang J Akey JM Jones KW The genomic distribution of population substructure in four populations using 8,525 autosomal SNPs.Hum Genomics. 2004; 1: 274-286Crossref PubMed Scopus (169) Google Scholar). However, the redundancy of ancestry information between markers needs to be considered when aiming to minimize the number of genetic markers. Therefore, we developed a new method that is based on the informativeness of assignment index In (Rosenberg et al. Rosenberg et al., 2003Rosenberg NA Li LM Ward R Pritchard JK Informativeness of genetic markers for inference of ancestry.Am J Hum Genet. 2003; 73: 1402-1422Abstract Full Text Full Text PDF PubMed Scopus (484) Google Scholar) to find a set of markers that tends to maximize the genetic differentiation between populations while minimizing the number of markers needed. This statistic computes the amount of (nonredundant) assignment information that a particular locus or set of loci contains, to differentiate a particular set of groups defined a priori. Since In computes the nonredundant amount of ancestry information, we thereby avoid the usually observed ascertainment bias toward markers that only differentiate between African and non-African groups, caused by the fact that genetic differences are usually largest between African and non-African populations (Hinds et al. Hinds et al., 2005Hinds DA Stuve LL Nilsen GB Halperin E Eskin E Ballinger DG Frazer KA Cox DR Whole-genome patterns of common DNA variation in three human populations.Science. 2005; 307: 1072-1079Crossref PubMed Scopus (953) Google Scholar). This index ranges from 0 (when the frequency of all alleles of one locus are equally distributed between populations) to the natural logarithm of the number of considered populations (when the different alleles are able to unequivocally differentiate the populations). The informativeness of assignment index under the assumption of a population model without admixture (In) was computed because it has been shown that the In statistic produces similar estimates and has similar properties as the informativeness of assignment index under the assumption of a population model with admixture (Ia) (Rosenberg et al. Rosenberg et al., 2003Rosenberg NA Li LM Ward R Pritchard JK Informativeness of genetic markers for inference of ancestry.Am J Hum Genet. 2003; 73: 1402-1422Abstract Full Text Full Text PDF PubMed Scopus (484) Google Scholar). The In statistic is preferred over the Ia statistic for defining informative markers, because Ia can produce denominators of 0 when two or more populations have the same allele frequencies, whereas In cannot (Rosenberg et al. Rosenberg et al., 2003Rosenberg NA Li LM Ward R Pritchard JK Informativeness of genetic markers for inference of ancestry.Am J Hum Genet. 2003; 73: 1402-1422Abstract Full Text Full Text PDF PubMed Scopus (484) Google Scholar). We have overcome the problem of extremely large computational efforts needed to consider all possible allele combinations for a large number of loci by applying a genetic algorithm (Haupt and Haupt Haupt and Haupt, 2004Haupt RL Haupt SE Practical genetic algorithms. Wiley-Interscience, Hoboken, NJ2004Google Scholar). We analyzed >11,500 SNPs, using the Affymetrix 10K Array Xba 131, in 76 human individuals from 21 sampling localities representing six worldwide geographic areas: Africa, South Africa, America, Asia, North Asia, and Europe (Y Chromosome Consortium [YCC] panel). In short, 250 ng of DNA from each individual was digested, ligated, and amplified. PCR products were fragmented and biotin-labeled after pooling and purification. The biotin-labeled DNA fragments were hybridized to the probes on the Affymetrix GeneChip Mapping 10K array. Finally, the arrays were washed, stained, scanned, and analyzed. All procedures were done in accordance with the recommendations of Affymetrix (Sellick et al. Sellick et al., 2003Sellick GS Garrett C Houlston RS A novel gene for neonatal diabetes maps to chromosome 10p12.1-p13.Diabetes. 2003; 52: 2636-2638Crossref PubMed Scopus (37) Google Scholar; Shriver et al. Shriver et al., 2005Shriver MD Mei R Parra EJ Sonpar V Halder I Tishkoff SA Schurr TG Zhadanov SI Osipova LP Brutsaert TD Friedlaender J Jorde LB Watkins WS Bamshad MJ Gutierrez G Loi H Matsuzaki H Kittles RA Argyropoulos G Fernandez JR Akey JM Jones KW Large-scale SNP analysis reveals clustered and continuous patterns of human genetic variation.Hum Genomics. 2005; 2: 81-89PubMed Google Scholar). SNPs typed in <90% of the individuals or located on the X chromosome were removed from the final data set, which resulted in usable genotypes for 8,491 SNPs per individual. One might expect that the relatively small number of individuals we used per locality (on average, 3.5 individuals) would tend to decrease the observed genetic variance within populations and spuriously increase the amount of genetic variance explained between populations and continents. To test whether the sample size used here could spuriously increase the genetic differences between continents, we applied an analysis-ofmolecular-variance approach (Excoffier et al. Excoffier et al., 1992Excoffier L Smouse PE Quattro JMV Analysis of molecular variance inferred from metric distances among DNA haplotypes: application to human mitochondrial DNA restriction data.Genetics. 1992; 131: 479-491PubMed Google Scholar), using the Arlequin 2.000 software (Schneider et al. Schneider et al., 2000Schneider S Roessli D Excoffier L Arlequin, version 2.000: a software for population genetics data analysis. Genetics and Biometry Laboratory, Switzerland2000Google Scholar) and the 8,491 SNPs and grouping the populations into five continental regions: Europe, Africa, America, Asia, and Oceania. The amount of genetic variance explained within populations was 85.5% (P<.0005), which is within the range of the usually observed values of genetic variance within populations when neutral markers are used (Romualdi et al. Romualdi et al., 2002Romualdi C Balding D Nasidze IS Risch G Robichaux M Sherry ST Stoneking M Batzer MA Barbujani GV Patterns of human diversity, within and among continents, inferred from biallelic DNA polymorphisms.Genome Res. 2002; 12: 602-612Crossref PubMed Scopus (168) Google Scholar). This result seems to contradict expectations, but it can be explained by the fact that all SNPs used on the Affymetrix arrays were originally selected from The SNP Consortium repository (Matsuzaki et al. Matsuzaki et al., 2004Matsuzaki H Loi H Dong S Tsai YY Fang J Law J Di X Liu WM Yang G Liu G Huang J Kennedy GC Ryder TB Marcus GA Walsh PS Shriver MD Puck JM Jones KW Mei R Parallel genotyping of over 10,000 SNPs using a one-primer assay on a high-density oligonucleotide array.Genome Res. 2004; 14: 414-425Crossref PubMed Scopus (252) Google Scholar), showing a similar degree of genetic variation based on a small set of population samples from different continents (Hao et al. Hao et al., 2004Hao K Li C Rosenow C Wong WH Detect and adjust for population stratification in population-based association study using genomic control markers: an application of Affymetrix GeneChip Human Mapping 10K array.Eur J Hum Genet. 2004; 12: 1001-1006Crossref PubMed Scopus (40) Google Scholar). In fact, this array has been successfully used for linkage mapping in different human populations (Kelsell et al. Kelsell et al., 2005Kelsell DP Norgett EE Unsworth H Teh MT Cullup T Mein CA Dopping-Hepenstal PJ et al.Mutations in ABCA12 underlie the severe congenital skin disease harlequin ichthyosis.Am J Hum Genet. 2005; 76: 794-803Abstract Full Text Full Text PDF PubMed Scopus (242) Google Scholar). Thus, these SNPs do not represent the true underlying genetic variation of human populations, and it can be expected that the ascertainment bias would tend to increase the genetic variance within populations, compensating for the expected reduction of the genetic variance within populations when small sample sizes are used. Each individual of the YCC panel could be genetically assigned—by use of all 8,491 SNPs and the STRUCTURE analysis—to one of the four groups considered. These four genetic groups correlate with four geographical regions: western Eurasia, East Asia, Africa, and America. Each group was then considered artificially as a population, and the In statistic was computed per marker. Loci with In 10% missing genotypes (individuals 737 and 1007), on the basis of our SNP genotyping. As a result, we used 1,046 individuals from 51 populations. By use of the obtained HGDP data, In was computed for each SNP, with seven groups considered, and was compared with the In computed on the basis of YCC data, with four groups considered. A positive Spearman's correlation (r=0.564; P=.09) was observed between the measures of informativeness of ancestry in both data sets; however, the slope of the lineal regression reached only 0.316, with a 95% CI (−0.026 to 0.658) that did not include 1. This indicates a substantial loss of information from the markers to differentiate groups of populations in the new HGDP population data set compared with the original YCC data set. As expected, when the same comparative analysis was performed but with HGDP samples assigned to the four groups used for the YCC panel, a higher and statistically significant correlation (r=0.66; P=.04) was obtained, with a higher slope of lineal regression of 0.655 and a 95% CI (0.046–1.264) that did include 1. However, since these analyses are based on the use of single markers, it is not justifiable to exclude the possibility of differentiating the seven groups when all 10 SNPs are considered at the same time. The In value considering seven groups and using all 10 markers simultaneously was 1.01 (95% CI 0.989–1.063), representing only 52% of the total amount of information that could be obtained with seven groups. Given that different clustering algorithms can produce different results (Corander et al. Corander et al., 2004Corander J Waldmann P Marttinen P Sillanpaa MJ BAPS 2: enhanced possibilities for the analysis of genetic population structure.Bioinformatics. 2004; 20: 2363-2369Crossref PubMed Scopus (356) Google Scholar), we applied three different ways of detecting clusters of populations in the CEPH panel. In the first approach, the In statistic was computed for each pair of populations on the basis of the genotypes of the 10 ascertained SNPs, and the matrix was plotted by means of multidimensional scaling (MDS) (Kruskal and Wish Kruskal and Wish, 1978Kruskal J Wish M Multidimensional scaling. Sage University paper series on quantitative applications in the social sciences, series number 11, Newbury Park, CA1978Crossref Google Scholar) with the STATISTICA 6.0 software (StatSoft StatSoft, 2001StatSoft (2001) STATISTICA (data analysis software system), release 6.0, TulsaGoogle Scholar). Although the In statistic is an index of the informativeness of markers for ancestral inference, it correlates with classical measures of genetic distances (such as FST) when computed between pairs of populations;

Referência(s)