A Simple and Improved Correction for Population Stratification in Case-Control Studies

Artigo Acesso aberto Revisado por pares

A Simple and Improved Correction for Population Stratification in Case-Control Studies

2007; Elsevier BV; Volume: 80; Issue: 5 Linguagem: Inglês

10.1086/516842

ISSN

1537-6605

Autores

Michael P. Epstein, Andrew S. Allen, Glen A. Satten,

Tópico(s)

Genetic Associations and Epidemiology

Resumo

Population stratification remains an important issue in case-control studies of disease-marker association, even within populations considered to be genetically homogeneous. Campbell et al. (Nature Genetics 2005;37:868–872) illustrated this by showing that stratification induced a spurious association between the lactase gene (LCT) and tall/short status in a European American sample. Furthermore, existing approaches for controlling stratification by use of substructure-informative loci (e.g., genomic control, structured association, and principal components) could not resolve this confounding. To address this problem, we propose a simple two-step procedure. In the first step, we model the odds of disease, given data on substructure-informative loci (excluding the test locus). For each participant, we use this model to calculate a stratification score, which is that participant’s estimated odds of disease calculated using his or her substructure-informative–loci data in the disease-odds model. In the second step, we assign subjects to strata defined by stratification score and then test for association between the disease and the test locus within these strata. The resulting association test is valid even in the presence of population stratification. Our approach is computationally simple and less model dependent than are existing approaches for controlling stratification. To illustrate these properties, we apply our approach to the data from Campbell et al. and find no association between the LCT locus and tall/short status. Using simulated data, we show that our approach yields a more appropriate correction for stratification than does principal components or genomic control. Population stratification remains an important issue in case-control studies of disease-marker association, even within populations considered to be genetically homogeneous. Campbell et al. (Nature Genetics 2005;37:868–872) illustrated this by showing that stratification induced a spurious association between the lactase gene (LCT) and tall/short status in a European American sample. Furthermore, existing approaches for controlling stratification by use of substructure-informative loci (e.g., genomic control, structured association, and principal components) could not resolve this confounding. To address this problem, we propose a simple two-step procedure. In the first step, we model the odds of disease, given data on substructure-informative loci (excluding the test locus). For each participant, we use this model to calculate a stratification score, which is that participant’s estimated odds of disease calculated using his or her substructure-informative–loci data in the disease-odds model. In the second step, we assign subjects to strata defined by stratification score and then test for association between the disease and the test locus within these strata. The resulting association test is valid even in the presence of population stratification. Our approach is computationally simple and less model dependent than are existing approaches for controlling stratification. To illustrate these properties, we apply our approach to the data from Campbell et al. and find no association between the LCT locus and tall/short status. Using simulated data, we show that our approach yields a more appropriate correction for stratification than does principal components or genomic control. Case-control studies of disease-marker association are susceptible to the confounding effects of population stratification, which originate from the coupling of allele-frequency heterogeneity to disease-risk heterogeneity within a population. To avoid stratification, studies often use data from individuals from a single race or ethnicity group (or, at the very least, they analyze data stratified on the basis of participants’ race or ethnicity) in the hope of achieving a genetically homogeneous population. Recent results1Campbell CD Ogburn EL Lunetta KL Lyon HN Freedman ML Groop LC Altshuler D Ardlie KG Hirschhorn JN Demonstrating stratification in a European American population.Nat Genet. 2005; 37: 868-872Crossref PubMed Scopus (344) Google Scholar disputed this perception by demonstrating the existence of stratification in a case-control sample of Americans of European origin who were selected for extreme values of height; in these data, both tall/short status and allele frequencies at a SNP located within the lactase gene (LCT [MIM 603202]) (involved in lactase persistence) varied considerably from northwestern to southeastern Europe. A naive association analysis between this LCT SNP and height resulted in a strongly significant finding (P=3.6×10−7). In efforts to determine whether this result was spurious, the association analyses were repeated by conditioning on grandparental ancestry, and a much weaker signal was observed (P=.0074).1Campbell CD Ogburn EL Lunetta KL Lyon HN Freedman ML Groop LC Altshuler D Ardlie KG Hirschhorn JN Demonstrating stratification in a European American population.Nat Genet. 2005; 37: 868-872Crossref PubMed Scopus (344) Google Scholar Furthermore, additional association analyses in a case-control study from Poland (P=.92) and a case-parent trio study from Scandinavia (P=.93) failed to confirm the initial significant association. These results led to the conclusion that the initial association result between the LCT SNP and height within the European American sample was largely or completely due to population stratification.1Campbell CD Ogburn EL Lunetta KL Lyon HN Freedman ML Groop LC Altshuler D Ardlie KG Hirschhorn JN Demonstrating stratification in a European American population.Nat Genet. 2005; 37: 868-872Crossref PubMed Scopus (344) Google Scholar Although the demonstration of stratification in subjects of European American ancestry is of concern, conventional wisdom suggests that such stratification can be corrected by applying appropriate statistical methods that use panels of genetic markers that provide information on population structure. However, neither genomic control2Devlin B Roeder K Genomic control for association studies.Biometrics. 1999; 55: 997-1004Crossref PubMed Scopus (2155) Google Scholar, 3Devlin B Roeder K Wasserman L Genomic control, a new approach to genetic-based association studies.Theor Popul Biol. 2001; 60: 155-166Crossref PubMed Scopus (383) Google Scholar nor structured association4Pritchard JK Rosenberg NA Use of unlinked genetic markers to detect population stratification in association studies.Am J Hum Genet. 1999; 65: 220-228Abstract Full Text Full Text PDF PubMed Scopus (926) Google Scholar–6Pritchard JK Stephens M Donnelly P Inference of population structure using multilocus genotype data.Genetics. 2000; 155: 945-959PubMed Google Scholar could properly correct for the confounding effects of stratification with the use of a collection of 111 missense and noncoding SNPs and 67 ancestry-informative SNPs.1Campbell CD Ogburn EL Lunetta KL Lyon HN Freedman ML Groop LC Altshuler D Ardlie KG Hirschhorn JN Demonstrating stratification in a European American population.Nat Genet. 2005; 37: 868-872Crossref PubMed Scopus (344) Google Scholar More recently, an approach based on principal components7Zhu X Zhang S Zhao H Cooper RS Association mapping using a mixture model for complex traits.Genet Epidemiol. 2002; 23: 181-196Crossref PubMed Scopus (107) Google Scholar, 8Zhang S Zhu X Zhao H On a semi-parametric test to detect associations between quantitative traits and candidate genes using unrelated individuals.Genet Epidemiol. 2003; 24: 44-56Crossref PubMed Scopus (71) Google Scholar, 9Chen H-S Zhu X Zhao H Zhang S Qualitative semi-parametric test to detect genetic association in case-control design under structured population.Ann Hum Genet. 2003; 67: 250-264Crossref PubMed Scopus (57) Google Scholar, 10Price AL Patterson NJ Plenge RM Weinblatt ME Shadick NA Reich D Principal components analysis corrects for stratification in genome-wide association studies.Nat Genet. 2006; 38: 904-909Crossref PubMed Scopus (6180) Google Scholar also failed to resolve this stratification.10Price AL Patterson NJ Plenge RM Weinblatt ME Shadick NA Reich D Principal components analysis corrects for stratification in genome-wide association studies.Nat Genet. 2006; 38: 904-909Crossref PubMed Scopus (6180) Google Scholar These results suggest that improved statistical methods for correcting population stratification in genetic association studies of complex disease are needed. We describe here a novel statistical approach for controlling population stratification in case-control studies of disease. Our approach consists of two steps. In the first step, we model the odds of disease, given data on substructure-informative loci (excluding the test locus). For each participant, we use this model to calculate a stratification score, which is that participant’s estimated odds of disease calculated using his or her substructure-informative–loci data in the disease-odds model. In the second step, we assign subjects to strata defined by stratification score and then test for association between the disease and the test locus within these strata. The resulting association test is valid even in the presence of population stratification. Our stratification-score approach circumvents many of the modeling assumptions and analytical limitations inherent in existing procedures, such as genomic control, structured association, and principal components. Using the height data described above, as well as simulated data, we show that subclassification based on the stratification score provides an appropriate and powerful correction for confounding due to population stratification in situations where other approaches fail. Assume a retrospective study design that collects marker data from unrelated case and control subjects. For a given subject, let D denote a disease indicator (1=case; 0=control). Let G denote the genotype at a SNP of interest. Let Z denote a vector of genotype data for a set of substructure-informative loci. Finally, letΘV=P[D=1|V]P[D=0|V] denote the odds of disease for a given set of variables V. We assume that we can account for population stratification by an unmeasured (possibly vector-valued) variable U. We assume that U is not an effect modifier, so, if U were observed, we would have θG,U=exp[α+β(G)+γ(U)], where β(·) and γ(·) are known functions (up to parameters to be estimated). As a result, stratification on values of γ(U) yields the true association between D and G. Because U is unmeasured, we instead use the substructure-informative loci Z as a surrogate for this stratification variable (note that Z can also be generalized to include additional environmental covariates that provide information on U). We assume that Z provides enough information on substructure that G provides no additional information on U in the presence of Z within controls—that is, P[U|G,Z,D=0]=P[U|Z,D=0]. In this situation, we write11Satten GA Kupper LL Inferences about exposure-disease associations using probability-of-exposure information.J Am Stat Assoc. 1993; 88: 200-208Google Scholar the odds of disease given G and Z asΘG,Z=eα+β(G)∑Ueγ(U)P[U|Z,D=0]≡eα+β(G)+ψ(Z) . As a result, stratification on the unknown function ψ(Z) yields the true association between D and G.12Miettinen O Stratification by a multivariate confounder score.Am J Epidemiol. 1976; 104: 609-620PubMed Google Scholar The null hypothesis of no association between G and D implies that β(G)=0, and hence ψ(Z)=ln{θZ}-α. Thus, under the null hypothesis, stratification on values of ln{θZ} (or θZ) is equivalent to stratifying on ψ(Z). This result implies that, when the null hypothesis is true, stratification on θZ appropriately estimates the true (null) association between D and G. We conclude that a test of β(G)=0 in strata with constant values of the score ln{θZ} is valid in the presence of population stratification. A more detailed demonstration of the above result can be found in appendix A. These results motivate the application of our two-step procedure for controlling population stratification in case-control studies. In the first step, we compute θZ by applying a user-defined model that can range from the simple (e.g., logistic regression) to the complex (e.g., machine-learning algorithms). For all calculations in this article, we compute θZ by first using generalized partial least squares13Marx BD Iteratively reweighted partial least squares estimation for generalized linear regression.Technometrics. 1996; 38: 374-381Crossref Scopus (63) Google Scholar (PLS) to identify new variables that are linear combinations of marker genotypes and then using these new variables in a logistic-regression model for disease. Like principal components, PLS finds orthogonal linear combinations of the marker genotypes that explain variability in the data. However, unlike principal components, PLS attempts to simultaneously explain variability in both the marker data and the trait data; hence, the linear combinations found by PLS are always correlated with the trait. Generalized PLS extends the PLS model, which was originally formulated for quantitative data, to categorical outcomes. We chose the number of PLS variables by selecting the model that minimized the Bayesian information criterion (BIC).14Schwarz G Estimating the dimension of a model.Ann Stat. 1978; 6: 461-464Crossref Google Scholar In the second step of our two-step approach, we use the quartiles of the stratification scores based on θZ to assign each subject to one of five strata (of approximately equal size), and then we test for association between G and D in the stratified data (e.g., using stratified logistic regression). Use of five strata is motivated by studies that show that this choice accounts for at least 90% of bias when a continuous variable is categorized, for a variety of distributions.15Cochran WG The effectiveness of subclassification in removing bias in observational studies.Biometrics. 1968; 24: 295-313Crossref PubMed Scopus (624) Google Scholar–17Rosenbaum PR Rubin DB Reducing bias in observational studies using subclassification on the propensity score.J Am Stat Assoc. 1984; 79: 516-524Crossref Scopus (2521) Google Scholar Using data from Campbell et al.,1Campbell CD Ogburn EL Lunetta KL Lyon HN Freedman ML Groop LC Altshuler D Ardlie KG Hirschhorn JN Demonstrating stratification in a European American population.Nat Genet. 2005; 37: 868-872Crossref PubMed Scopus (344) Google Scholar we compared our stratification-score approach to genomic control, structured association, principal components, and a naive approach that ignores stratification. We used data from 192 tall and 176 short participants who were genotyped at a SNP of interest (rs4988235) in the LCT gene, as well as at a panel of substructure-informative loci consisting of 111 missense or noncoding SNPs and 67 ancestry-informative markers (AIMs). We first conducted a naive Armitage trend test between the LCT SNP and height. Using the substructure-informative loci, we then attempted to resolve the stratification in the sample, using genomic control and principal components. For genomic control, we estimated the inflation factorλˆ by dividing the median of the Armitage trend tests for the substructure-informative loci by the median of the χ21 distribution2Devlin B Roeder K Genomic control for association studies.Biometrics. 1999; 55: 997-1004Crossref PubMed Scopus (2155) Google Scholar and then by taking18Setakis E Stirnadel H Balding DJ Logistic regression protects against population stratification in genetic association studies.Genome Res. 2006; 16: 290-296Crossref PubMed Scopus (88) Google Scholarλˆ=max(1,λˆ. We used this estimate to scale down the naive Armitage trend test of the LCT SNP. For principal components, we used the eigenvectors of the variance-covariance matrix of the substructure-informative loci as covariates in a linear-regression model that examines the relationship between height and the LCT SNP. As recently recommended,10Price AL Patterson NJ Plenge RM Weinblatt ME Shadick NA Reich D Principal components analysis corrects for stratification in genome-wide association studies.Nat Genet. 2006; 38: 904-909Crossref PubMed Scopus (6180) Google Scholar we included 10 covariates corresponding to the first 10 principal components of the variance-covariance matrix in the model. We used the likelihood-ratio statistic to test the coefficient of genotype at the test locus (coded as an additive model); significance was assessed by comparing the test statistic to the appropriate quantile of the χ2 distribution with 1 df. Results for these data calculated by use of STRUCTURE have been reported elsewhere.1Campbell CD Ogburn EL Lunetta KL Lyon HN Freedman ML Groop LC Altshuler D Ardlie KG Hirschhorn JN Demonstrating stratification in a European American population.Nat Genet. 2005; 37: 868-872Crossref PubMed Scopus (344) Google Scholar Finally, we calculated the stratification score for each participant, using generalized PLS variables in logistic regression, as described above. We then divided the data into five strata that have equal numbers of observations in each stratum, on the basis of the quartiles of the stratification scores. Using these strata, we tested for association between height and the LCT SNP, using stratified logistic regression. We conducted additional simulations to compare our proposed approach for correcting stratification to genomic control and principal components. We simulated data sets with 500 cases and 500 controls that were sampled retrospectively from a population consisting of three equally frequent latent subpopulations. Within the population, we simulated a test SNP, assuming different values for the inbreeding coefficient FST (0.03 or 0.15, with the latter value corresponding to the estimated inbreeding coefficient in the height data1Campbell CD Ogburn EL Lunetta KL Lyon HN Freedman ML Groop LC Altshuler D Ardlie KG Hirschhorn JN Demonstrating stratification in a European American population.Nat Genet. 2005; 37: 868-872Crossref PubMed Scopus (344) Google Scholar) and the minor-allele frequency (MAF). For a test SNP with FST=0.03, we considered the models P=0.159,0.113,0.037, P=0.340,0.290,0.125, and P=0.50,0.40,0.30, where P=p1,p2,p3 and pj denote the MAF of the locus in latent subpopulation j. These values correspond to pooled population MAFs of ∼0.10, 0.25, and 0.40, respectively. For a test SNP with FST=0.15, we considered the models P=0.28,0.03,0.03, P=0.52,0.18,0.05, and P=0.70,0.40,0.17, which again correspond to pooled population MAFs of ∼0.10, 0.25, and 0.40, respectively. We assumed that control participants have the same allele-frequency distribution as the overall population (a rare-disease approximation). Case participants were sampled in different proportions from the three subpopulations. To induce severe stratification, we sampled cases in the proportions 0.45, 0.33, and 0.22 from subpopulations 1, 2, and 3, respectively. To induce more moderate stratification, we sampled cases in the proportions 0.40, 0.33, and 0.27. In addition, we also considered a situation of no confounding by sampling cases in the same proportions (0.33, 0.33, and 0.33) as the controls. We implemented this last sampling scheme to assess the performance of our stratification-score approach in situations where it is not actually required for valid analysis, since there is no difference in baseline disease risk (a requirement for confounding to occur) when cases and controls are sampled in the same proportion. Further, the substructure-informative loci are unrelated to disease risk, resulting in a stratification based entirely on noise. All simulations assumed Hardy-Weinberg equilibrium (HWE) within each subpopulation and thus among controls in each subpopulation. We assumed a multiplicative model of allele effect for the tested locus, such that the case samples in each subpopulation were also in HWE with risk-allele frequency in subpopulation j given by eβpj/(eβpj+1-pj), where β is the log-odds of disease per copy of the risk allele. We considered simulations under both a null model (β=0) and an alternative model (β=ln(1.4)). We assumed that the value of β was constant across strata. We generated panels of 100 substructure-informative markers under two different scenarios. The first scenario assumed the marker data consisted of AIMs with large FST values in the population, whereas the second scenario assumed that the marker data consisted of random SNPs, all with FST=0.03. Under both scenarios, we generated appropriate SNP data, using a large list19Akey JM Zhang G Zhang K Jin L Shriver MD Interrogating a high-density SNP map for signatures of natural selection.Genome Res. 2002; 12: 1805-1814Crossref PubMed Scopus (678) Google Scholar of candidate-gene SNPs with variable allele-frequency differences among three subpopulations consisting of East Asians, African Americans, and European Americans. For sampling AIMs, we chose the 100 most informative SNPs (i.e., those with the highest FST values) from this list that were polymorphic in each subpopulation. The FST values of these candidate-gene SNPs ranged from 0.55 to 0.84. For simulation of random SNPs, we chose 100 markers from the list with an FST value of 0.03. Ignoring stratification, we found a significant association between the LCT SNP and height, using a naive Armitage trend test (P=.0038). This P value differs from that reported elsewhere1Campbell CD Ogburn EL Lunetta KL Lyon HN Freedman ML Groop LC Altshuler D Ardlie KG Hirschhorn JN Demonstrating stratification in a European American population.Nat Genet. 2005; 37: 868-872Crossref PubMed Scopus (344) Google Scholar (P=3.6×10−7), because the latter result is from the analysis of a much larger sample (1,057 short and 1,132 tall subjects, also including participants who were not genotyped at the AIMs) that further assumed HWE in both case and control participants.20Saseini P From genotype to genes: doubling the sample size.Biometrics. 1997; 53: 1253-1261Crossref PubMed Scopus (739) Google Scholar We found that neither genomic control2Devlin B Roeder K Genomic control for association studies.Biometrics. 1999; 55: 997-1004Crossref PubMed Scopus (2155) Google Scholar, 3Devlin B Roeder K Wasserman L Genomic control, a new approach to genetic-based association studies.Theor Popul Biol. 2001; 60: 155-166Crossref PubMed Scopus (383) Google Scholar nor principal components7Zhu X Zhang S Zhao H Cooper RS Association mapping using a mixture model for complex traits.Genet Epidemiol. 2002; 23: 181-196Crossref PubMed Scopus (107) Google Scholar–8Zhang S Zhu X Zhao H On a semi-parametric test to detect associations between quantitative traits and candidate genes using unrelated individuals.Genet Epidemiol. 2003; 24: 44-56Crossref PubMed Scopus (71) Google Scholar, 9Chen H-S Zhu X Zhao H Zhang S Qualitative semi-parametric test to detect genetic association in case-control design under structured population.Ann Hum Genet. 2003; 67: 250-264Crossref PubMed Scopus (57) Google Scholar10Price AL Patterson NJ Plenge RM Weinblatt ME Shadick NA Reich D Principal components analysis corrects for stratification in genome-wide association studies.Nat Genet. 2006; 38: 904-909Crossref PubMed Scopus (6180) Google Scholar resolved the confounding in the sample. For genomic control, the scaled-down Armitage trend test was still significant (e.g., P=.0038), regardless of whether we used the 111 missense and noncoding SNPs alone, the 67 ancestry-informative SNPs alone, or all 178 loci together, because, in each case, the median trend test for marker SNPs was less than the median of the χ21 distribution. For principal components, we duplicated results published elsewhere10Price AL Patterson NJ Plenge RM Weinblatt ME Shadick NA Reich D Principal components analysis corrects for stratification in genome-wide association studies.Nat Genet. 2006; 38: 904-909Crossref PubMed Scopus (6180) Google Scholar—that the first 10 principal components of the variance-covariance matrix for the substructure-informative loci failed to resolve the confounding between height and the LCT SNP (P=.003). Campbell et al.1Campbell CD Ogburn EL Lunetta KL Lyon HN Freedman ML Groop LC Altshuler D Ardlie KG Hirschhorn JN Demonstrating stratification in a European American population.Nat Genet. 2005; 37: 868-872Crossref PubMed Scopus (344) Google Scholar reported that the structured-association package STRUCTURE6Pritchard JK Stephens M Donnelly P Inference of population structure using multilocus genotype data.Genetics. 2000; 155: 945-959PubMed Google Scholar found only one population in the height data by use of the entire panel of 178 substructure-informative loci. Hence, the association test based on structured association is the naive (unstratified) test, which is significant (P=.0038). Unlike genomic control, structured association, and principal components, our stratification score approach resolved the confounding in the height data from Campbell et al.1Campbell CD Ogburn EL Lunetta KL Lyon HN Freedman ML Groop LC Altshuler D Ardlie KG Hirschhorn JN Demonstrating stratification in a European American population.Nat Genet. 2005; 37: 868-872Crossref PubMed Scopus (344) Google Scholar We calculated the stratification score for each subject, using the first six PLS components (based on minimization of the BIC). We then ranked the stratification scores of all subjects and used the ranking to divide the subjects into five strata of approximately equal size. Using stratified logistic regression, we found no association between the LCT SNP and tall/short status (P=.44). Table 1 shows the genotype counts of tall or short subjects within each stratum formed using the stratification score, as well as the accompanying trend test result. Results show little association between genotype and disease within each stratum.Table 1LCT SNP Genotype Distribution among StrataNo. of Subjects with LCT GenotypeStratum and Height StatusCCCTTTArmitage χ21PStratum 1:.99.32 Tall022 Short143123Stratum 2:.06.80 Tall353 Short172521Stratum 3:.07.79 Tall52313 Short81213Stratum 4:2.37.12 Tall53030 Short323Stratum 5:.61.43 Tall43532 Short012Strata ignored:8.43.0037 Tall179580 Short427162 Open table in a new tab To ensure that our null finding was not because of insufficient power resulting from the pattern of tall/short subjects within each stratum, we conducted additional simulations of stratified data with the same row marginal totals as in table 1. Short participants were assumed to be in HWE and to have T allele frequency P=39/70, the observed frequency of the T allele among short participants. Tall participants were assumed to be in HWE and have T allele frequency eβp/(eβp+1-p); in this expression, β is the log relative risk of being tall per copy of the T allele. We found that this pattern allows an 85% power to detect a two-fold increase in risk per allele in a multiplicative model, which suggests that our null finding is not because of low power. Table 2 provides type I error results for simulated data sets that assume a test locus with a moderate FST of 0.03 under substantial stratification (see the “Simulation Design” section). We show empirical type I error rates for five statistics that test for association between the genotype at a SNP of interest and disease: a naive χ21 association test that ignores stratification, a χ21 association test stratified by the true yet unknown subpopulation status (the gold standard when stratification exists), a χ21 association test based on our proposed stratification-score approach, a χ21 association test based on principal components, and a χ21 association test based on genomic control.Table 2Type I Error Rates under Substantial StratificationMarker Type and Test Locus MAFNo AdjustmentKnown StrataStratification ScorePrincipal ComponentsGenomic ControlAIM: .10.121.055.043.054.017 .25.195.057.048.062.026 .40.132.058.051.064.022Random: .10.126.049.049.049.023 .25.169.039.050.041.031 .40.139.048.049.054.028Note.—Empirical type I error results at nominal α=0.05 for 500 cases and 500 controls under the assumption of a test-locus FST of 0.03. The simulation design is described in the “Material and Methods” section. Stratification score, principal components, and genomic control tests use 100 substructure-informative loci to correct for population stratification. Open table in a new tab Note.— Empirical type I error results at nominal α=0.05 for 500 cases and 500 controls under the assumption of a test-locus FST of 0.03. The simulation design is described in the “Material and Methods” section. Stratification score, principal components, and genomic control tests use 100 substructure-informative loci to correct for population stratification. Table 2 shows that, as anticipated, naive association tests that ignore stratification have inflated type I error (∼0.12–0.20 when the nominal significance is α=0.05, depending on the MAF of the test locus), whereas association tests stratified by known subpopulation have appropriate type I error. We found that both our proposed stratification-score procedure and principal components yielded appropriate type I error regardless of the control MAF and the nature of the substructure-informative loci used (AIMs with large FST values or random markers with the same FST=0.03 as the locus of interest). On the other hand, we observed that genomic control can overcorrect for stratification, particularly when AIMs are used. This result is anticipated, because genomic control implicitly assumes that the FST value (or λ) of the substructure-informative loci is the same as the FST value (or λ) of the tested locus. The use of AIMs would lead to an estimate of λ that

Ver no editor

Altmetric

PlumX

Entrar

Lembrar minha senha

Receber meu e-mail de confirmação

A Simple and Improved Correction for Population Stratification in Case-Control Studies