Regression-Based Association Analysis with Clustered Haplotypes through Use of Genotypes

Artigo Acesso aberto Revisado por pares

Regression-Based Association Analysis with Clustered Haplotypes through Use of Genotypes

2006; Elsevier BV; Volume: 78; Issue: 2 Linguagem: Inglês

10.1086/500025

ISSN

1537-6605

Autores

Jung‐Ying Tzeng, Chih‐Hao Wang, Jau‐Tsuen Kao, Chuhsing Kate Hsiao,

Tópico(s)

Genetic Mapping and Diversity in Plants and Animals

Resumo

Haplotype-based association analysis has been recognized as a tool with high resolution and potentially great power for identifying modest etiological effects of genes. However, in practice, its efficacy has not been as successfully reproduced as expected in theory. One primary cause is that such analysis tends to require a large number of parameters to capture the abundant haplotype varieties, and many of those are expended on rare haplotypes for which studies would have insufficient power to detect association even if it existed. To concentrate statistical power on more-relevant inferences, in this study, we developed a regression-based approach using clustered haplotypes to assess haplotype-phenotype association. Specifically, we generalized the probabilistic clustering methods of Tzeng to the generalized linear model (GLM) framework established by Schaid et al. The proposed method uses unphased genotypes and incorporates both phase uncertainty and clustering uncertainty. Its GLM framework allows adjustment of covariates and can model qualitative and quantitative traits. It can also evaluate the overall haplotype association or the individual haplotype effects. We applied the proposed approach to study the association between hypertriglyceridemia and the apolipoprotein A5 gene. Through simulation studies, we assessed the performance of the proposed approach and demonstrate its validity and power in testing for haplotype-trait association. Haplotype-based association analysis has been recognized as a tool with high resolution and potentially great power for identifying modest etiological effects of genes. However, in practice, its efficacy has not been as successfully reproduced as expected in theory. One primary cause is that such analysis tends to require a large number of parameters to capture the abundant haplotype varieties, and many of those are expended on rare haplotypes for which studies would have insufficient power to detect association even if it existed. To concentrate statistical power on more-relevant inferences, in this study, we developed a regression-based approach using clustered haplotypes to assess haplotype-phenotype association. Specifically, we generalized the probabilistic clustering methods of Tzeng to the generalized linear model (GLM) framework established by Schaid et al. The proposed method uses unphased genotypes and incorporates both phase uncertainty and clustering uncertainty. Its GLM framework allows adjustment of covariates and can model qualitative and quantitative traits. It can also evaluate the overall haplotype association or the individual haplotype effects. We applied the proposed approach to study the association between hypertriglyceridemia and the apolipoprotein A5 gene. Through simulation studies, we assessed the performance of the proposed approach and demonstrate its validity and power in testing for haplotype-trait association. In the search for genes underlying human complex diseases, one crucial step is to detect the association between the genetic variants and the disease phenotypes. Since a high density of SNPs is being identified and used in genetic studies, jointly analyzing all variants within a gene or chromosomal region for association can be more informative and effective (Stephens et al. Stephens et al., 2001Stephens J Schneider J Tanguay D Choi J Acharya T Stanley S Jiang R et al.Haplotype variation and linkage disequilibrium in 313 human genes.Science. 2001; 293: 489-493Crossref PubMed Scopus (690) Google Scholar). The haplotype, the ordered allele sequences on a chromosome, provides a natural framework for performing joint analysis of multiple markers and is predominantly considered the unit of analysis in association studies. Haplotype analyses are believed to provide high resolution and potentially great power for identifying modest etiological effects of genes (International HapMap Consortium International HapMap Consortium, 2003International HapMap Consortium The International HapMap Project.Nature. 2003; 426: 789-796Crossref PubMed Scopus (4684) Google Scholar). Following this viewpoint, many statistical methods have been proposed to evaluate haplotype-disease association for case-control samples, including likelihood ratio tests for testing equality of haplotype frequencies between cases and controls (e.g., Sham Sham, 1998Sham P Statistics in human genetics. Arnold, New York1998Google Scholar), tests and inferences for specific haplotype effects under a variety of regression models (e.g., Schaid et al. Schaid et al., 2002Schaid DJ Rowland CM Tines DE Jacobson RM Poland GA Score tests for association between traits and haplotypes when linkage phase is ambiguous.Am J Hum Genet. 2002; 70: 425-434Abstract Full Text Full Text PDF PubMed Scopus (1538) Google Scholar; Zaykin et al. Zaykin et al., 2002Zaykin DV Westfall PH Young SS Karnoub MA Wagner MJ Ehm MG Testing association of statistically inferred haplotypes with discrete and continuous traits in samples of unrelated individuals.Hum Hered. 2002; 53: 79-91Crossref PubMed Scopus (580) Google Scholar; Epstein and Satten Epstein and Satten, 2003Epstein MP Satten GA Inference on haplotype effects in case-control studies using unphased genotype data.Am J Hum Genet. 2003; 73: 1316-1329Abstract Full Text Full Text PDF PubMed Scopus (208) Google Scholar; Lake et al. Lake et al., 2003Lake SL Lyon H Tantisira K Silverman EK Weiss ST Laird NM Schaid DJ Estimation and tests of haplotype-environment interaction when linkage phase is ambiguous.Hum Hered. 2003; 55: 56-65Crossref PubMed Scopus (385) Google Scholar; Stram et al. Stram et al., 2003Stram DO Pearce CL Bretsky P Freedman M Hirschhorn JN Altshuler D Kolonel LN Henderson BE Thomas DC Modeling and E-M estimation of haplotype-specific relative risks from genotype data for a case-control study of unrelated individuals.Hum Hered. 2003; 55: 179-190Crossref PubMed Scopus (222) Google Scholar; Zhao et al. Zhao et al., 2003Zhao LP Li SS Khalid N A method for the assessment of disease associations with single-nucleotide polymorphism haplotypes and environmental variables in case-control studies.Am J Hum Genet. 2003; 72: 1231-1250Abstract Full Text Full Text PDF PubMed Scopus (151) Google Scholar; Lin Lin, 2004Lin DY Haplotype-based association analysis in cohort studies of unrelated individuals.Genet Epidemiol. 2004; 26: 255-264Crossref PubMed Scopus (42) Google Scholar; Zeng and Lin Zeng and Lin, 2005Zeng D Lin DY Estimating haplotype-disease associations with pooled genotype data.Genet Epidemiol. 2005; 28: 70-82Crossref PubMed Scopus (28) Google Scholar), haplotype-similarity approaches that detect association via excessive haplotype sharing in cases (e.g., Van der Meulen and te Meerman Van der Meulen and te Meerman, 1997Van der Meulen MA te Meerman GJ Haplotype sharing analysis in affected individuals from nuclear families with at least one affected offspring.Genet Epidemiol. 1997; 14: 915-919Crossref PubMed Scopus (44) Google Scholar; McPeek and Strahs McPeek and Strahs, 1999McPeek MS Strahs A Assessment of linkage disequilibrium by the decay of haplotype sharing, with application to fine-scale genetic mapping.Am J Hum Genet. 1999; 65: 858-875Abstract Full Text Full Text PDF PubMed Scopus (177) Google Scholar; Bourgain et al. Bourgain et al., 2000Bourgain C Genin E Quesneville H Clerget-Darpoux F Search for multifactorial disease susceptibility genes in founder populations.Ann Hum Genet. 2000; 64: 255-265Crossref PubMed Google Scholar, Bourgain et al., 2001Bourgain C Génin E Holopainen P Mustalahti K Mäki M Partanen J Clerget-Darpoux F Use of closely related affected individuals for the genetic study of complex diseases in founder populations.Am J Hum Genet. 2001; 68: 154-159Abstract Full Text Full Text PDF PubMed Scopus (35) Google Scholar, Bourgain et al., 2002Bourgain C Genin E Ober C Clerget-Darpoux F Missing data in haplotype analysis: a study on the MILC method.Ann Hum Genet. 2002; 66: 99-108Crossref PubMed Scopus (19) Google Scholar; Tzeng et al. Tzeng et al., 2003aTzeng JY Byerley W Devlin B Roeder K Wasserman L Outlier detection and false discovery rates for whole-genome DNA matching.J Am Stat Assoc. 2003; 98: 236-246Crossref Scopus (25) Google Scholar, Tzeng et al., 2003bTzeng J-Y Devlin B Wasserman L Roeder K On the identification of disease mutations by the analysis of haplotype similarity and goodness of fit.Am J Hum Genet. 2003; 72: 891-902Abstract Full Text Full Text PDF PubMed Scopus (111) Google Scholar; Yu et al. Yu et al., 2004Yu K Gu CC Province M Xiong CJ Rao DC Genetic association mapping under founder heterogeneity via weighted haplotype similarity analysis in candidate genes.Genet Epidemiol. 2004; 27: 182-191Crossref PubMed Scopus (28) Google Scholar), and clustering methods that group homogeneous haplotypes and perform analysis on the unit of haplotype groups (e.g., Seltman et al. Seltman et al., 2001Seltman H Roeder K Devlin B Transmission/disequilibrium test meets measured haplotype analysis: family-based association analysis guided by evolution of haplotypes.Am J Hum Genet. 2001; 68: 1250-1263Abstract Full Text Full Text PDF PubMed Scopus (83) Google Scholar, Seltman et al., 2003Seltman H Roeder K Devlin B Evolutionary-based association analysis using haplotype data.Genet Epidemiol. 2003; 25: 48-58Crossref PubMed Scopus (93) Google Scholar; Molitor et al. Molitor et al., 2003aMolitor J Marjoram P Thomas D Application of Bayesian spatial statistical methods to analysis of haplotypes effects and gene mapping.Genet Epidemiol. 2003; 25: 95-105Crossref PubMed Scopus (33) Google Scholar, Molitor et al., 2003bMolitor J Marjoram P Thomas D Fine-scale mapping of disease genes with multiple mutations via spatial clustering techniques.Am J Hum Genet. 2003; 73: 1368-1384Abstract Full Text Full Text PDF PubMed Scopus (76) Google Scholar; Durrant et al. Durrant et al., 2004Durrant C Zondervan KT Cardon LR Hunt S Deloukas P Morris AP Linkage disequilibrium mapping via cladistic analysis of single-nucleotide polymorphism haplotypes.Am J Hum Genet. 2004; 75: 35-43Abstract Full Text Full Text PDF PubMed Scopus (162) Google Scholar; Tzeng Tzeng, 2005Tzeng JY Evolutionary-based grouping of haplotypes in association analysis.Genet Epidemiol. 2005; 28: 220-231Crossref PubMed Scopus (35) Google Scholar). Whereas the progress in both data availability and data analyses increases the feasibility of haplotype-based association studies, practical implementation indicates that the study findings of such types are not consistently reproducible (Lohmueller et al. Lohmueller et al., 2003Lohmueller KE Pearce CL Pike M Lander ES Hirschhorn JN Meta-analysis of genetic association studies supports a contribution of common variants to susceptibility to common disease.Nat Genet. 2003; 33: 177-182Crossref PubMed Scopus (1583) Google Scholar; Neale and Sham Neale and Sham, 2004Neale BM Sham PC The future of association studies: gene-based analysis and replication.Am J Hum Genet. 2004; 75: 353-362Abstract Full Text Full Text PDF PubMed Scopus (505) Google Scholar). Lohmueller et al. (Lohmueller et al., 2003Lohmueller KE Pearce CL Pike M Lander ES Hirschhorn JN Meta-analysis of genetic association studies supports a contribution of common variants to susceptibility to common disease.Nat Genet. 2003; 33: 177-182Crossref PubMed Scopus (1583) Google Scholar) concluded that the inconsistency could be explained largely by a high rate of false-negative results or, equivalently, lack of power. Recently, Chapman and colleagues (Chapman et al. Chapman et al., 2003Chapman JM Cooper JD Todd JA Clayton DG Detecting disease associations due to linkage disequilibrium using haplotype tags: a class of tests and the determinants of statistical power.Hum Hered. 2003; 56: 18-31Crossref PubMed Scopus (347) Google Scholar; Clayton et al. Clayton et al., 2004Clayton D Chapman J Cooper J Use of unphased multilocus genotype data in indirect association studies.Genet Epidemiol. 2004; 27: 415-428Crossref PubMed Scopus (165) Google Scholar) further revealed that analyses-based locus models that regress phenotypes on multiple SNP loci can sometimes be more powerful than haplotype analyses, such as when tag SNPs are used. The main reason is that the locus model uses fewer parameters than does a haplotype model; by modeling only the main effect and low-order interactions of SNPs, the locus model does not spend degrees of freedom on rare haplotypes for which studies would have insufficient power to detect association even if it were present (Clayton et al. Clayton et al., 2004Clayton D Chapman J Cooper J Use of unphased multilocus genotype data in indirect association studies.Genet Epidemiol. 2004; 27: 415-428Crossref PubMed Scopus (165) Google Scholar). In contrast to a locus model, haplotype analysis requires a larger number of parameters to capture the abundant haplotype varieties, and the test power is limited by the many degrees of freedom that they use. The power is worsened by the need to adjust for multiple testing when many genes are evaluated. Further difficulties emerge from the fact that complex diseases are derived from intricate genetic and environmental factors (see, e.g., Peltonen and McKusick Peltonen and McKusick, 2001Peltonen L McKusick VA Genomics and medicine: dissecting human disease in the postgenomic era.Science. 2001; 291: 1224-1229Crossref PubMed Scopus (314) Google Scholar). Understanding the genetic etiology of complex diseases requires a joint consideration of all potential attributes and sometimes even other auxiliary covariates. The vast quantities of covariates from environmental effects and gene-gene and gene-environment interactions further exacerbate the degrees-of-freedom problem. Model-based association methods, which incorporate covariate information in association analysis, play an increasingly important role in modern association studies. They facilitate the study of complex gene-disease association. Besides the ability to accommodate polygenic effects, environmental covariates, and interactions among them, model-based analyses can evaluate haplotype effects at either the global level (i.e., evaluating overall haplotype association) or the individual level (i.e., evaluating haplotype-specific association). They also allow modeling of diseases through a variety of clinical phenotypes, from dichotomous to ordinal to quantitative traits. These flexibilities and advantages again reflect the need for efficient usage of haplotype information in a model-based framework for studying association. Haplotype grouping offers one promising avenue for controlling the issue of degrees of freedom that is encountered in haplotypes-based multiple-marker analysis. It enhances the efficiency of haplotype analysis by using a small number of degrees of freedom to study haplotypes and concentrates statistical power on more-relevant inference. In an earlier study (Tzeng Tzeng, 2005Tzeng JY Evolutionary-based grouping of haplotypes in association analysis.Genet Epidemiol. 2005; 28: 220-231Crossref PubMed Scopus (35) Google Scholar), we introduced an algorithm to cluster related haplotypes to improve the power of association tests. This algorithm adapts the same evolutionary concepts of cladistic analyses and groups rare haplotypes with their closest major haplotypes according to the evolutionary relationships summarized in a haplotype tree. Since many haplotype trees are often virtually likely given the observed data, one key feature of the proposed algorithm is the incorporation of the tree uncertainty in association testing. The algorithm is motivated by and relies on the common disease/common variants assumption (Collins et al. Collins et al., 1997Collins FS Guyer MS Charkravarti A Variations on a theme: cataloging human DNA sequence variation.Science. 1997; 278: 1580-1581Crossref PubMed Scopus (822) Google Scholar), which conjectures that common modest-risk variants may contribute more to the development of common complex disease than do rare high-risk variants. The algorithm is also built on the recent discovery of the human genome structure that the majority of haplotype diversities are concentrated on a few major categories because of the correlations among proximate SNPs (e.g., Daly et al. Daly et al., 2001Daly MJ Rioux JD Schaffner SF Hudson TJ Lander ES High-resolution haplotype structure in the human genome.Nat Genet. 2001; 29: 229-232Crossref PubMed Scopus (1383) Google Scholar; Johnson et al. Johnson et al., 2001Johnson GC Esposito L Barratt BJ Smith AN Heward J Di Genova G Ueda H Cordell HJ Eaves IA Dudbridge F Twells RC Payne F Hughes W Nutland S Stevens H Carr P Tuomilehto-Wolf E Tuomilehto J Gough SC Clayton DG Todd JA Haplotype tagging for the identification of common disease genes.Nat Genet. 2001; 29: 233-237Crossref PubMed Scopus (999) Google Scholar). Therefore, instead of spending degrees of freedom on rare haplotypes that would result in unstable statistical inference and insufficient testing power, the algorithm reduces the observed haplotype space, in a probabilistic manner, to a core haplotype set that contains fewer polymorphisms but possesses the essential information for studying haplotype-disease association. Such core haplotype diversity presumably mimics the diversity before the occurrence of other events that are not directly related to the evolution of disease mutation—for example, recent marker mutation, gene conversion, genotyping error, and even missing data. The grouping analysis of Tzeng (Tzeng, 2005Tzeng JY Evolutionary-based grouping of haplotypes in association analysis.Genet Epidemiol. 2005; 28: 220-231Crossref PubMed Scopus (35) Google Scholar) is limited to assessing global association between haplotypes and traits. It cannot evaluate the effect of individual haplotypes or accommodate for covariates. Its implementation requires phased haplotypes and empirical evaluation of the significance level. In the present study, we generalized the clustering approach of Tzeng (Tzeng, 2005Tzeng JY Evolutionary-based grouping of haplotypes in association analysis.Genet Epidemiol. 2005; 28: 220-231Crossref PubMed Scopus (35) Google Scholar) to a generalized linear model framework and allowed for unphased genotypes. We constructed tests that are based on clustered haplotypes, for assessing association at both global and haplotype-specific levels. The test incorporates two major sources of uncertainties in haplotype analysis—clustering uncertainty and phase uncertainty. Among the many promising regression-based approaches that evaluate individual effects of haplotypes through use of genotypes, we established our work on the score tests developed by Schaid et al. (Schaid et al., 2002Schaid DJ Rowland CM Tines DE Jacobson RM Poland GA Score tests for association between traits and haplotypes when linkage phase is ambiguous.Am J Hum Genet. 2002; 70: 425-434Abstract Full Text Full Text PDF PubMed Scopus (1538) Google Scholar). Their method has been shown to be robust to departure from the Hardy-Weinberg equilibrium and to possess comparable power with retrospective approaches for case-control data that are sampled retrospectively (Satten and Epstein Satten and Epstein, 2004Satten GA Epstein MP Comparison of prospective and retrospective methods for haplotype inference in case-control studies.Genet Epidemiol. 2004; 27: 192-201Crossref PubMed Scopus (71) Google Scholar). Through simulation studies, we assessed the performance of the proposed approach and demonstrated its validity and power in testing for haplotype-trait association. We also illustrated the proposed approach through an application to a hypertriglyceridemia study, in which we tested the apolipoprotein A5 gene (APOA5), a confirmed risk factor of hypertriglyceridemia. We begin this section by reviewing the clustering methods of Tzeng (Tzeng, 2005Tzeng JY Evolutionary-based grouping of haplotypes in association analysis.Genet Epidemiol. 2005; 28: 220-231Crossref PubMed Scopus (35) Google Scholar). We then integrate the clustering algorithm into a regression framework. Finally, we construct the score test for association that incorporates phase ambiguity and clustering uncertainty on the basis of the work of Schaid et al. (Schaid et al., 2002Schaid DJ Rowland CM Tines DE Jacobson RM Poland GA Score tests for association between traits and haplotypes when linkage phase is ambiguous.Am J Hum Genet. 2002; 70: 425-434Abstract Full Text Full Text PDF PubMed Scopus (1538) Google Scholar) and Tzeng (Tzeng, 2005Tzeng JY Evolutionary-based grouping of haplotypes in association analysis.Genet Epidemiol. 2005; 28: 220-231Crossref PubMed Scopus (35) Google Scholar). The fundamental purpose of the clustering algorithm is to group rare haplotypes with their corresponding ancestral haplotypes. Given an evolutionary tree of haplotypes, the algorithm sequentially combines "rare" haplotypes into their one-step neighboring haplotypes, from the tips of the tree toward the major nodes. Each of the resulting clusters is represented by the most common haplotype, and haplotypes within a cluster are assumed to have the same effect on the disease trait. Determining "rare" haplotypes requires a trade-off between information and dimensionality, and the algorithm uses an information criterion to find the optimal balance between the two. The information criterion is defined as "the cumulative Shannon information content" (Shannon Shannon, 1948Shannon CE A mathematical theory of communication.Bell System Tech J. 1948; 27 (623-656): 379-423Crossref Scopus (20636) Google Scholar), with penalty function determined by the number of dimensions and the sample size involved. Denote HF as the full set of observed haplotypes and HC as the set of clustered haplotypes. The algorithm obtains HC by preserving high-frequency haplotypes—that is, to set HC as the ℓ most frequent haplotypes, where ℓ maximizes the information criterion. In reality, the evolutionary tree is often unknown and needs to be inferred. Instead of inferring the most-likely tree relationship and performing grouping accordingly, the algorithm assigns each relationship branch a probability. It then clusters haplotypes by considering all relationships according to the probability weights. The branch probability is determined by two factors that were commonly considered in reconstructing a haplotype tree (Crandall and Templeton Crandall and Templeton, 1993Crandall KA Templeton AR Empirical tests of some predictions from coalescent theory with applications to intraspecific phylogeny reconstruction.Genetics. 1993; 134: 959-969PubMed Google Scholar; Slatkin and Rannala Slatkin and Rannala, 1997Slatkin M Rannala B Estimating the age of alleles by use of intraallelic variability.Am J Hum Genet. 1997; 60: 447-458PubMed Google Scholar): (1) the relatedness of haplotypes and (2) the age of haplotypes. The algorithm uses haplotype frequencies to indicate the haplotype age. To measure the relatedness of haplotypes, a certain metric of haplotype similarity is used, such as counting the number of matching loci between two haplotypes. When the evolutionary relationships are known, the branch probability is reduced to an indicator function of whether two haplotypes u and v are one-step related. For further detail, see Tzeng (Tzeng, 2005Tzeng JY Evolutionary-based grouping of haplotypes in association analysis.Genet Epidemiol. 2005; 28: 220-231Crossref PubMed Scopus (35) Google Scholar). The general algorithm can be described as follows: first, partition the list HF into (1) H(0)=HC, the core category, (2) H(1), the one-step neighbors of H(0) that consist of haplotypes different from the core haplotypes by one step of mutation, and (3) H(2), the two-step neighbors of H(0) that consist of haplotypes different from the core haplotypes by two steps of mutation, and continue until the entire space of HF is exhausted. Let ΠF denote the haplotype frequencies of HF; correspondingly, ΠF is also decomposed into Π(0),Π(1),…,Π(j),…,Π(J). Starting from j=J to j=1, group each element of H(j) to its one-step ancestor in H(j-1) and combine the frequencies. The grouping rule is specified according to the branch probabilities that are stored in the allocation matrix B(j); each row of B(j) describes to whom and how a certain haplotype of H(j) is allocated among H(j-1). As illustrated by Tzeng (Tzeng, 2005Tzeng JY Evolutionary-based grouping of haplotypes in association analysis.Genet Epidemiol. 2005; 28: 220-231Crossref PubMed Scopus (35) Google Scholar), this one-step grouping process is equivalent to the matrix operation Π(j)′B(j), and the overall process can be described as Π′C(=Π(0)*′)=Π(0)′+Π(1)′B(1)+Π(2)′B(2)B(1)+⋯+Π(J)′B(J)B(J-1)⋯B(2)B(1) . Or, equivalently, ΠC′=ΠF′B ,(1) where ΠF=[Π(0)_Π(1)_Π(2)_⋮] B=[I_B(1)_B(2)B(1)_⋮] . Suppose there are (L+1) distinct haplotypes in the population and they are clustered into (L*+1) groups. The dimension of B is (L+1)×(L*+1). Given that the clustering procedure can be implemented via the matrix multiplication in equation (1), it is straightforward to integrate this dimension reduction procedure into a regression framework. Under the regression model, probabilistic clustering of haplotypes can be done by replacing the vector of the haplotype frequencies Π in equation (1) with the data matrix of haplotypes. That is, denote XF as the haplotype matrix of the full dimension with use of a certain scoring rule; its (h,i) entry, for example, can be the number of copies of haplotype h that individual i possesses. The matrix XF has dimension (L+1)×n, where n is the sample size. Then the data matrix of clustered haplotypes, XC, can be obtained by XC′=XF′B(Π) .(2) Here, we rewrite the allocation matrix B as B(Π) to emphasize the fact that the allocation matrix B is a function of the haplotype frequency Π. Let Y denote an n×1 vector of the disease trait values, and let Z denote a P×n matrix of the P environmental covariates. With the original haplotype data of full dimension, the effects of the genetic and environmental covariates can be modeled by the generalized linear model (GLM): g(EY)≡η=XF′βF+Z′γ , where β′F=(βF(0),βF(1),˙,βF(L)) is an (L+1)×1 vector. The association of haplotypes with the disease traits can be detected by testing H0:βF(0)=βF(1)=˙=βF(L). To reduce the degrees of freedom, we performed an analysis on groups of homogeneous haplotypes, using the following model: g(EY)≡η=XC′βC+Z′γ , where X′C is obtained by the clustering algorithm of equation (2) and β′C=(βC(0), βC(1), ˙, βC(L*)) with L*≤L. The association test is now performed through the (L*+1) parameters of the clustered haplotypes, H0:βC(0)=βC(1)=⋯=βC(L*) .(3) Here, we derive the score test for association in the clustered haplotype space. We first calculate the score function, which is the partial derivative of the log likelihood function, and then use it to construct the score test. To facilitate derivation, we reparameterize βC via a linear transformation βC≡[μμ+α1⋮μ+αL*]=A[μα] with A=[10⋯01⋮IL*×L*1] . Consequently, the global null hypothesis (3) is equivalent to H0:α1=α2=˙=αL*=0, and the effect of haplotype h can be examined by H0:αh=0. Consider observed data (Y,G,Z) in which G is the data matrix of unphased genotypes. For each individual i, we treat the observed genotype gi as an incomplete version of haplotype count xF,i, which is the ith column of the design matrix XF. Without losing generality, here we assume that the vector xF,i is normed so that its entries sum to 1. Under the assumption of Hardy-Weinberg equilibrium, xF,i∼½×multinomial(2,ΠF). The GLM density of trait yi, given covariates xF,i and zi, is f(yi|xF,i,zi;α,μ,ϕ,γ,Π)=exp[yiηi-b(ηi)a(ϕ)+c(yi,ϕ)] , where ηi=xC,i′βC+zi′γ=xF,i′ B(Π)A [μα]+zi′γ , and ϕ is the dispersion parameter (see table 1 of Schaid et al. [Schaid et al., 2002Schaid DJ Rowland CM Tines DE Jacobson RM Poland GA Score tests for association between traits and haplotypes when linkage phase is ambiguous.Am J Hum Genet. 2002; 70: 425-434Abstract Full Text Full Text PDF PubMed Scopus (1538) Google Scholar]). Let ζ denote the vector of the nuisance parameters (μ,γ,ϕ,Π). The likelihood function for (α,ζ) on the basis of the data (Y,G,Z) is L(α,ζ;Y,G,Z)=Πi=1n{∑xF,if(yi,xF,i,gi|zi; α,β)}=Πi=1n{∑xF,if(yi|xF,i,zi;α,ζ)×P(gi|xF,i)×P(xF,i;Π)} .(4) Because P(gi|xF,i) is an indicator function of whether the haplotype count xF,i is compatible with the observed genotype gi, likelihood (4) can be further simplified as L(α,ζ;Y,G,Z)=Πi=1n{∑xF,i∈gif(yi|xF,i,zi;α,ζ)×P(xF,i;Π)} .(5) The score function for α is the partial derivative of likelihood (5), with respect to α. The resulting score statistic, denoted by Sα, is the score function evaluated at the restricted maximum-likelihood estimates under the null hypothesis. Sα is the statistic we use to test haplotype effect; in appendix A, we show the following result: Sα=∑i=1nyi-y¯a(ϕ)B(Π)′-0E(Xi|gi)|α=α˜=0ζ=ζ˜ , where α˜ and ζ˜ are the restricted maximum-likelihood estimates under the null hypothesis, B(Π)′-0 is the matrix B(Π) with the first column (i.e., the baseline haplotype) removed, and E(Xi|gi) is the same as that defined by Schaid et al. (Schaid et al., 2002Schaid DJ Rowland CM Tines DE Jacobson RM Poland GA Score tests for association between traits and haplotypes when linkage phase is ambiguous.Am J Hum Genet. 2002; 70: 425-434Abstract Full Text Full Text PDF PubMed Scopus (1538) Google Scholar), the expected haplotype counts given the observed genotypes. We see that the proposed score statistic that accounts for phase and clustering ambiguities is the original score test of Schaid et al. (Schaid et al., 2002Schaid DJ Rowland CM Tines DE Jacobson RM Poland GA Score tests for association between traits and haplot

Ver no editor

Altmetric

PlumX

Entrar

Lembrar minha senha

Receber meu e-mail de confirmação

Regression-Based Association Analysis with Clustered Haplotypes through Use of Genotypes