A Testing Framework for Identifying Susceptibility Genes in the Presence of Epistasis

Artigo Acesso aberto Revisado por pares

A Testing Framework for Identifying Susceptibility Genes in the Presence of Epistasis

2005; Elsevier BV; Volume: 78; Issue: 1 Linguagem: Inglês

10.1086/498850

ISSN

1537-6605

Autores

Joshua Millstein, David V. Conti, Frank D. Gilliland, W. James Gauderman,

Tópico(s)

Gene expression and cancer classification

Resumo

An efficient testing strategy called the “focused interaction testing framework” (FITF) was developed to identify susceptibility genes involved in epistatic interactions for case-control studies of candidate genes. In the FITF approach, likelihood-ratio tests are performed in stages that increase in the order of interaction considered. Joint tests of main effects and interactions are performed conditional on significant lower-order effects. A reduction in the number of tests performed is achieved by prescreening gene combinations with a goodness-of-fit χ2 statistic that depends on association among candidate genes in the pooled case-control group. Multiple testing is accounted for by controlling false-discovery rates. Simulation analysis demonstrated that the FITF approach is more powerful than marginal tests of candidate genes. FITF also outperformed multifactor dimensionality reduction when interactions involved additive, dominant, or recessive genes. In an application to asthma case-control data from the Children’s Health Study, FITF identified a significant multilocus effect between the nicotinamide adenine dinucleotide (phosphate) reduced:quinone oxidoreductase gene (NQO1), myeloperoxidase gene (MPO), and catalase gene (CAT) (unadjusted P=.00026), three genes that are involved in the oxidative stress pathway. In an independent data set consisting primarily of African American and Asian American children, these three genes also showed a significant association with asthma status (P=.0008). An efficient testing strategy called the “focused interaction testing framework” (FITF) was developed to identify susceptibility genes involved in epistatic interactions for case-control studies of candidate genes. In the FITF approach, likelihood-ratio tests are performed in stages that increase in the order of interaction considered. Joint tests of main effects and interactions are performed conditional on significant lower-order effects. A reduction in the number of tests performed is achieved by prescreening gene combinations with a goodness-of-fit χ2 statistic that depends on association among candidate genes in the pooled case-control group. Multiple testing is accounted for by controlling false-discovery rates. Simulation analysis demonstrated that the FITF approach is more powerful than marginal tests of candidate genes. FITF also outperformed multifactor dimensionality reduction when interactions involved additive, dominant, or recessive genes. In an application to asthma case-control data from the Children’s Health Study, FITF identified a significant multilocus effect between the nicotinamide adenine dinucleotide (phosphate) reduced:quinone oxidoreductase gene (NQO1), myeloperoxidase gene (MPO), and catalase gene (CAT) (unadjusted P=.00026), three genes that are involved in the oxidative stress pathway. In an independent data set consisting primarily of African American and Asian American children, these three genes also showed a significant association with asthma status (P=.0008). The importance of accounting for gene-gene interactions in the search for susceptibility genes for complex diseases has been widely suggested to explain difficulties in replicating significant findings. Recent human and animal studies of complex diseases have identified susceptibility genes that marginally contribute to a common trait, to a minor extent only or not at all, but that interact significantly in combined analyses (Kuida and Beier Kuida and Beier, 2000Kuida S Beier DR Genetic localization of interacting modifiers affecting severity in a murine model of polycystic kidney disease.Genome Res. 2000; 10: 49-54PubMed Google Scholar; Naber et al. Naber et al., 2000Naber CK Husing J Wolfhard U Erbel R Siffert W Interaction of the ACE D allele and the GNB3 825T allele in myocardial infarction.Hypertension. 2000; 36: 986-989Crossref PubMed Scopus (60) Google Scholar; Williams et al. Williams et al., 2000Williams SM Addy JH Phillips 3rd, JA Dai M Kpodonu J Afful J Jackson H Joseph K Eason F Murray MM Epperson P Aduonum A Wong LJ Jose PA Felder RA Combinations of variations in multiple genes are associated with hypertension.Hypertension. 2000; 36: 2-6Crossref PubMed Scopus (129) Google Scholar; Hsueh et al. Hsueh et al., 2001Hsueh WC Cole SA Shuldiner AR Beamer BA Blangero J Hixson JE MacCluer JW Mitchell BD Interactions between variants in the β3-adrenergic receptor and peroxisome proliferator-activated receptor-γ2 genes and obesity.Diabetes Care. 2001; 24: 672-677Crossref PubMed Scopus (75) Google Scholar; Kim et al. Kim et al., 2001Kim JH Sen S Avery CS Simpson E Chandler P Nishina PM Churchill GA Naggert JK Genetic analysis of a new mouse model for non-insulin-dependent diabetes.Genomics. 2001; 74: 273-286Crossref PubMed Scopus (116) Google Scholar; Tripodis et al. Tripodis et al., 2001Tripodis N Hart AA Fijneman RJ Demant P Complexity of lung cancer modifiers: mapping of thirty genes and twenty-five interactions in half of the mouse genome.J Natl Cancer Inst. 2001; 93: 1484-1491Crossref PubMed Scopus (73) Google Scholar; Ukkola et al. Ukkola et al., 2001Ukkola O Perusse L Chagnon YC Despres JP Bouchard C Interactions among the glucocorticoid receptor, lipoprotein lipase and adrenergic receptor genes and abdominal fat in the Quebec Family Study.Int J Obes Relat Metab Disord. 2001; 25: 1332-1339Crossref PubMed Scopus (59) Google Scholar; Barlassina et al. Barlassina et al., 2002Barlassina C Lanzani C Manunta P Bianchi G Genetics of essential hypertension: from families to genes.J Am Soc Nephrol. 2002; 13: S155-S164Crossref PubMed Google Scholar; De Miglio et al. De Miglio et al., 2004De Miglio MR Pascale RM Simile MM Muroni MR Virdis P Kwong KM Wong LK Bosinco GM Pulina FR Calvisi DF Frau M Wood GA Archer MC Feo F Polygenic control of hepatocarcinogenesis in Copenhagen × F344 rats.Int J Cancer. 2004; 111: 9-16Crossref PubMed Scopus (20) Google Scholar; Yanchina et al. Yanchina et al., 2004Yanchina ED Ivchik TV Shvarts EI Kokosov AN Khodzhayantz NE Gene-gene interactions between glutathione-s transferase M1 and matrix metalloproteinase 9 in the formation of hereditary predisposition to chronic obstructive pulmonary disease.Bull Exp Biol Med. 2004; 137: 64-66Crossref PubMed Scopus (16) Google Scholar; Yang et al. Yang et al., 2004Yang P Bamlet WR Ebbert JO Taylor WR de Andrade M Glutathione pathway genes and lung cancer risk in young and old populations.Carcinogenesis. 2004; 25: 1935-1944Crossref PubMed Scopus (87) Google Scholar; Aston et al. Aston et al., 2005Aston CE Ralph DA Lalo DP Manjeshwar S Gramling BA Defreese DC West AD Branam DE Thompson LF Craft MA Mitchell DS Shimasaki CD Mulvihill JJ Jupe ER Oligogenic combinations associated with breast cancer risk in women under 53 years of age.Hum Genet. 2005; 116: 208-221Crossref PubMed Scopus (38) Google Scholar; Dong et al. Dong et al., 2005Dong C Li WD Li D Price RA Interaction between obesity-susceptibility loci in chromosome regions 2p25-p24 and 13q13-q21.Eur J Hum Genet. 2005; 13: 102-108Crossref PubMed Scopus (33) Google Scholar; Roldan et al. Roldan et al., 2005Roldan V Gonzalez-Conejero R Marin F Pineda J Vicente V Corral J Five prothrombotic polymorphisms and the prevalence of premature myocardial infarction.Haematologica. 2005; 90: 421-423PubMed Google Scholar). Several investigators have found alleles that have opposite effects depending on the genetic background (Balmain and Harris Balmain and Harris, 2000Balmain A Harris CC Carcinogenesis in mouse and human cells: parallels and paradoxes.Carcinogenesis. 2000; 21: 371-377Crossref PubMed Scopus (113) Google Scholar; Staessen et al. Staessen et al., 2001Staessen JA Wang JG Brand E Barlassina C Birkenhager WH Herrmann SM Fagard R Tizzoni L Bianchi G Effects of three candidate genes on prevalence and incidence of hypertension in a Caucasian population.J Hypertens. 2001; 19: 1349-1358Crossref PubMed Scopus (205) Google Scholar), which further raises the likelihood of overlooking epistatic susceptibility genes in single-gene analyses (Culverhouse Culverhouse, 2002Culverhouse R A perspective on epistasis: limits of models displaying no main effects.Am J Hum Genet. 2002; 70: 461-471Abstract Full Text Full Text PDF PubMed Scopus (295) Google Scholar). Accounting for interactions is not a trivial task, because of the serious multiple-testing problem created by the large number of possible interactions for even a relatively small set of candidate genes. For example, in the Children’s Health Study (CHS), a prospective study of children’s respiratory health, we are studying ∼20 candidate genes related to oxidative stress and inflammatory pathways (Gilliland et al. Gilliland et al., 1999Gilliland FD McConnell R Peters J Gong HJ A theoretical basis for investigating ambient air pollution and children’s respiratory health.Environ Health Perspect. 1999; 107: 403-407Crossref PubMed Scopus (88) Google Scholar). These 20 genes yield 190 possible two-gene interactions and 1,140 possible three-gene interactions. If the multiple testing problem is ignored, type I error rates will be greatly inflated, leading to false conclusions and to studies that are difficult to replicate. Foulkes et al. (Foulkes et al., 2005Foulkes AS Reilly M Zhou L Wolfe M Rader DJ Mixed modelling to characterize genotype-phenotype associations.Stat Med. 2005; 24: 775-789Crossref PubMed Scopus (17) Google Scholar) applied a combined dimension-reduction and mixed-modeling approach to four SNPs in three lipase genes to assess risk of cardiovascular disease. Although their approach accounts for possible interactions and allows controlling for possible confounders, it is unclear what the performance or proper implementation would be for a larger set of candidates. Devlin et al. (Devlin et al., 2003Devlin B Roeder K Wasserman L Analysis of multilocus models of association.Genet Epidemiol. 2003; 25: 36-47Crossref PubMed Scopus (53) Google Scholar) showed that type I error rates were extremely inflated for model-selection methods such as the Lasso (Tibshirani Tibshirani, 1996Tibshirani R Regression shrinkage and selection via the Lasso.J R Stat Soc B. 1996; 58: 267-288Google Scholar). Another multilocus approach is the set-association approach (Hoh et al. Hoh et al., 2001Hoh J Wille A Ott J Trimming, weighting, and grouping SNPs in human case-control association studies.Genome Res. 2001; 11: 2115-2119Crossref PubMed Scopus (264) Google Scholar), which uses sums of statistics based on locus-specific association and Hardy-Weinberg disequilibrium to test a global null hypothesis. This approach may be powerful for finding many small effects that combine to have an important effect on the phenotype but does not explicitly account for possible epistatic interactions. Several data-mining approaches have been developed to address the problem of identifying susceptibility genes involved in epistatic interactions (Ritchie et al. Ritchie et al., 2001Ritchie MD Hahn LW Roodi N Bailey LR Dupont WD Parl FF Moore JH Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer.Am J Hum Genet. 2001; 69: 138-147Abstract Full Text Full Text PDF PubMed Scopus (1443) Google Scholar, Ritchie et al., 2003bRitchie MD White BC Parker JS Hahn LW Moore JH Optimization of neural network architecture using genetic programming improves detection and modeling of gene-gene interactions in studies of human diseases.BMC Bioinformatics. 2003b; 4: 28Crossref PubMed Scopus (174) Google Scholar; Moore and Hahn Moore and Hahn, 2002Moore JH Hahn LW A cellular automata approach to detecting interactions among single-nucleotide polymorphisms in complex multifactorial diseases.Pac Symp Biocomput. 2002; : 53-64PubMed Google Scholar; Bastone et al. Bastone et al., 2004Bastone L Reilly M Rader DJ Foulkes AS MDR and PRP: a comparison of methods for high-order genotype-phenotype associations.Hum Hered. 2004; 58: 82-92Crossref PubMed Scopus (47) Google Scholar; Cook et al. Cook et al., 2004Cook NR Zee RY Ridker PM Tree and spline based association analysis of gene-gene interaction models for ischemic stroke.Stat Med. 2004; 23: 1439-1453Crossref PubMed Scopus (108) Google Scholar; Culverhouse et al. Culverhouse et al., 2004Culverhouse R Klein T Shannon W Detecting epistatic interactions contributing to quantitative traits.Genet Epidemiol. 2004; 27: 141-152Crossref PubMed Scopus (139) Google Scholar; Foulkes et al. Foulkes et al., 2004Foulkes AS De Gruttola V Hertogs K Combining genotype groups and recursive partitioning: an application to human immunodeficiency virus type 1 genetics data.Appl Stat. 2004; 53: 311-323Google Scholar); however, their performance may be limited in the presence of main effects or genetic heterogeneity. Also, properties related to type I error rates and power have not been thoroughly compared with more traditional approaches. A data-mining approach that has generated some recent interest is multifactor dimensionality reduction (MDR) (Ritchie et al. Ritchie et al., 2001Ritchie MD Hahn LW Roodi N Bailey LR Dupont WD Parl FF Moore JH Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer.Am J Hum Genet. 2001; 69: 138-147Abstract Full Text Full Text PDF PubMed Scopus (1443) Google Scholar, Ritchie et al., 2003aRitchie MD Hahn LW Moore JH Power of multifactor dimensionality reduction for detecting gene-gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity.Genet Epidemiol. 2003a; 24: 150-157Crossref PubMed Scopus (453) Google Scholar; Hahn et al. Hahn et al., 2003Hahn LW Ritchie MD Moore JH Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions.Bioinformatics. 2003; 19: 376-382Crossref PubMed Scopus (942) Google Scholar; Bastone et al. Bastone et al., 2004Bastone L Reilly M Rader DJ Foulkes AS MDR and PRP: a comparison of methods for high-order genotype-phenotype associations.Hum Hered. 2004; 58: 82-92Crossref PubMed Scopus (47) Google Scholar; Cho et al. Cho et al., 2004Cho YM Ritchie MD Moore JH Park JY Lee KU Shin HD Lee HK Park KS Multifactor-dimensionality reduction shows a two-locus interaction associated with type 2 diabetes mellitus.Diabetologia. 2004; 47: 549-554Crossref PubMed Scopus (157) Google Scholar; Coffey et al. Coffey et al., 2004Coffey CS Hebert PR Ritchie MD Krumholz HM Gaziano JM Ridker PM Brown NJ Vaughan DE Moore JH An application of conditional logistic regression and multifactor dimensionality reduction for detecting gene-gene interactions on risk of myocardial infarction: the importance of model validation.BMC Bioinformatics. 2004; 5: 49Crossref PubMed Scopus (116) Google Scholar; Hahn and Moore Hahn and Moore, 2004Hahn LW Moore JH Ideal discrimination of discrete clinical endpoints using multilocus genotypes.In Silico Biol. 2004; 4: 183-194PubMed Google Scholar; Moore Moore, 2004Moore JH Computational analysis of gene-gene interactions using multifactor dimensionality reduction.Expert Rev Mol Diagn. 2004; 4: 795-803Crossref PubMed Scopus (217) Google Scholar; Tsai et al. Tsai et al., 2004Tsai CT Lai LP Lin JL Chiang FT Hwang JJ Ritchie MD Moore JH Hsu KL Tseng CD Liau CS Tseng YZ Renin-angiotensin system gene polymorphisms and atrial fibrillation.Circulation. 2004; 109: 1640-1646Crossref PubMed Scopus (304) Google Scholar; Williams et al. Williams et al., 2004Williams SM Ritchie MD Phillips 3rd, JA Dawson E Prince M Dzhura E Willis A Semenya A Summar M White BC Addy JH Kpodonu J Wong LJ Felder RA Jose PA Moore JH Multilocus analysis of hypertension: a hierarchical approach.Hum Hered. 2004; 57: 28-38Crossref PubMed Scopus (144) Google Scholar; Qin et al. Qin et al., 2005Qin S Zhao X Pan Y Liu J Feng G Fu J Bao J Zhang Z He L An association study of the N-methyl-d-aspartate receptor NR1 subunit gene (GRIN1) and NR2B subunit gene (GRIN2B) in schizophrenia with universal DNA microarray.Eur J Hum Genet. 2005; 13: 807-814Crossref PubMed Scopus (93) Google Scholar; Soares et al. Soares et al., 2005Soares ML Coelho T Sousa A Batalov S Conceicao I Sales-Luis ML Ritchie MD Williams SM Nievergelt CM Schork NJ Saraiva MJ Buxbaum JN Susceptibility and modifier genes in Portuguese transthyretin V30M amyloid polyneuropathy: complexity in a single-gene disease.Hum Mol Genet. 2005; 14: 543-553Crossref PubMed Scopus (93) Google Scholar). MDR is a nonparametric method designed to detect genes involved in high-order interactions in case-control studies (Ritchie et al. Ritchie et al., 2001Ritchie MD Hahn LW Roodi N Bailey LR Dupont WD Parl FF Moore JH Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer.Am J Hum Genet. 2001; 69: 138-147Abstract Full Text Full Text PDF PubMed Scopus (1443) Google Scholar). To implement this method, the investigator must first specify the number of interacting genes, k, to consider (throughout this article, we will consider three genes). The data are divided into 10 equal parts, and the phenotypes of subjects in each 1/10 of the data are predicted by the MDR model derived from the remaining 9/10 of the data. For each 9/10 of the data, several steps are performed. For every set of k genes, MDR classifies each multilocus genotype as “high risk” or “low risk,” depending on the ratio of cases to controls. The subjects in the “high risk” groups are then pooled. The k-gene set that maximizes the cases:controls ratio in the pooled “high risk” group is selected as the “best” gene set. Disease status for subjects in the remaining 1/10 of the data is then predicted on the basis of genotype risk for the “best” gene set. The overall “best” gene set is determined by the data split with the lowest prediction error. Prediction error is averaged over the 10 data splits and is used as a measure of predictive power. Another useful measure, termed “consistency,” is the number of data splits with the same “best” set of factors. We developed a new search strategy designed to identify susceptibility genes among a group of candidate genes in the presence of gene-gene interactions. The candidate genes may be selected for their role in a specific biochemical pathway or from a prior genome scan for linkage. A powerful testing framework based on likelihood-ratio tests (LRTs) is presented here that simultaneously tests multilocus effects across various orders of interaction. Our search strategy also employs a screening statistic to reduce the total number of gene sets that are tested for multilocus effects. We present an assessment of power and type I error from simulation analysis and compare the method's performance with that of MDR. We then apply both our method and MDR to a case-control data set from the CHS that includes 12 candidate loci measured in asthmatic and nonasthmatic subjects. Consider a disease phenotype, D, and a sample of cases (D=1) and controls (D=0) selected from some population. We assume that genotypes are obtained for each subject for a set of diallelic, autosomal candidate loci. For each candidate locus, indexed by i, j, k,…, we define a covariate, G, with possible values 0, 1, or 2, corresponding to genotypes aa, Aa, and AA, respectively. This defines a log-additive coding scheme, a robust approach when the specific genetic model is unknown (Schaid Schaid, 1996Schaid DJ General score tests for associations of genetic markers with disease using cases and their parents.Genet Epidemiol. 1996; 13: 423-449Crossref PubMed Scopus (264) Google Scholar). We note, however, that the methods presented here are readily adaptable to alternative risk models (e.g., dominant, recessive, or codominant). We adopt a logistic model to relate genes to D. For example, the fully saturated model for a set of three candidate genes has the form logit[P(D=1)]=β0+βiGi+βjGj+βkGk+βijGiGj+βikGiGk+βjkGjGk+βijkGiGjGk .(1) The model contains three main effects, three two-way interactions, and one three-way interaction. An analogous saturated model for two genes would be logit[P(D=1)]=β0+βiGi+βjGj+βijGiGj ,(2) whereas a model for a single gene would be logit[P(D=1)]=β0+βiGi .(3) LRTs can be used to identify susceptibility genes by testing the parameters in the above models. An LRT statistic is computed as χ2=2(Lfull−Lreduced), where Lfull is the log-likelihood of the data computed under a fully specified model and Lreduced is the log-likelihood computed under the constraint that one or more parameters equal zero. Under the null hypothesis, this statistic has a χ2 distribution with df equal to the difference in the number of unconstrained parameters between the full and reduced models. Three LRT testing strategies for identification of genes will be considered. The simple model in equation (3) is used to test the null hypothesis βi=0 for each candidate gene. We refer to this test as the marginal test of Gi, since the estimated effect from this model, βi, represents an average of the main effect of Gi and any interactive effects with other loci. With a total of K candidate genes, there are K marginal tests. The threshold for significance is adjusted for multiple testing by controlling false-discovery rates (FDRs) (Benjamini and Hochberg Benjamini and Hochberg, 1995Benjamini Y Hochberg Y Controlling the false discovery rate: a practical and powerful approach to multiple testing.J R Stat Soc Ser B. 1995; 57: 289-300Google Scholar), although other approaches (e.g., Bonferroni adjustment) could be adopted. In brief, Benjamini and Hochberg (Benjamini and Hochberg, 1995Benjamini Y Hochberg Y Controlling the false discovery rate: a practical and powerful approach to multiple testing.J R Stat Soc Ser B. 1995; 57: 289-300Google Scholar) defined FDR as the ratio of the number of falsely rejected null hypotheses to the total number of rejected null hypotheses. They showed that the expected FDR can be controlled by a procedure that applies a cutoff to the unadjusted ordered P values, P(1), P(2),…,P(i),…,P(m). All null hypotheses with P values at or below cutoff t are rejected; specifically, t=max<FENLP=CUBSTYLE=SP(i):P(i)≤iαm} . In this strategy, tests are performed in a series of stages, with an incremental increase in the highest-order interaction parameter considered at each subsequent stage. The first stage tests the main effect of each gene, the second stage tests all possible two-way interactions, the third stage tests all three-way interactions, and so forth. To avoid retesting the same effects, a test in a higher stage (e.g., test of a specific two-way interaction in stage 2) is conditioned on any component factors (e.g., either of the two genes involved in that two-way interaction) that were already declared significant in a lower stage (e.g., stage 1). Gene sets are tested for multilocus effects, whether or not marginal effects were found. Type I error is controlled by dividing the overall α level by the number of stages and allocating this adjusted α level, α*, to each stage. Within each stage, the threshold for significance is adjusted by controlling FDR. The specific stages are as follows. 1.First stage. Perform marginal LRTs of βi for each of the K candidate genes. Declare a test significant if Pi<α*1, where Pi is the P value that corresponds to the ith LRT and α*1 denotes the significance threshold for first-stage tests corrected to control FDR. A total of K tests are conducted in this stage.2.Second stage. For all possible two-gene sets (K(K−1)/2), the full model (eq. [2]) is tested against the reduced model, logit[P(D=1)]=β0+βiGiI()+βjGjI() ,where I() is an indicator function that assumes the value 1 if the corresponding term was statistically significant in a first-stage test and 0 otherwise. Thus, if both βi and βj were statistically significant in the first stage, the reduced model would be β0+βiGi+βjGj, and the interaction between Gi and Gj would be tested in a 1-df test in this second stage. On the other hand, if neither βi nor βj was statistically significant in the first stage, then a 3-df test of βi, βj, and βij would be conducted in the second stage. This selective conditioning is done to avoid retesting effects that have already been declared significant. Significance is declared if Pij<α*2, where Pij is the P value that corresponds to the ijth LRT and α*2 denotes the significance threshold for second-stage tests corrected to control FDR.3.Third stage. All three-gene sets are tested (the number of tests is K(K−1)(K−2)/6) in a fashion similar to the method in stage 2. The saturated model (eq. [1]) is tested against the reduced model, logit[P(D=1)]=β0+βiGiI()+βjGjI()+βkGkI()+βijGiGjI()+βikGiGkI()+βjkGjGkI() ,where, again, the indicator function I() assumes the value 1 if the term was in a model that achieved statistical significance in a previous stage and 0 otherwise. It should be stated explicitly that a model that includes higher-order terms would always include the component lower-order terms. The ITF approach described thus far can be directly generalized to multilocus effects involving four or more genes. It is clear that the number of tests conducted in the ITF method can be quite large when K is large. Adjusting the type I error for so many tests may cause an unacceptable loss in power. We developed a method for prescreening all possible gene sets, to focus attention on those that are most likely to be informative in the ITF. Let Gijk denote a multilocus genotype over a set of three candidate genes i, j, and k. Then, by the Bayes theorem, the probability that a case possesses the particular genotype Gijk is P(GijkD=1)=P(D=1Gijk)P(Gijk)P(D=1) .The factor P(Gijk) describes the population distribution of Gijk, which, under our assumption of locus independence, is simply a product of the corresponding genotype frequencies. If the three loci combine to affect disease risk, P(Gijk|D=1) will differ from P(Gijk) by an amount that depends on the magnitude of risk that Gijk confers. One might compute a measure of difference between the observed distribution of Gijk in cases and that expected on the basis of the product of genotype frequencies and then focus the third stage of the ITF on only those sets with a difference that exceeds some threshold. However, the use of only cases in this screening step will induce a bias into the ITF because of the explicit use of disease status. Rather, we propose to compute this difference measured with the pooled sample of cases and controls, to avoid this bias. A deviation from the expected prevalence of Gijk in the entire case-control sample could be the result of a deviation from the expected prevalence of Gijk in cases and could thus indicate association with disease. The measure of difference we propose to use is a χ2 goodness-of-fit statistic that compares the observed with the expected distribution of Gijk in the combined case-control sample. The χ2 statistic is then used as the criterion by which to choose gene combinations for inclusion in ITF—that is, only gene sets with a calculated χ2 statistic above a selected cutoff value are analyzed. The form of the χ2 statistic should match the underlying assumptions of risk—in other words, for the risk model in equation (1), the genotype groups would be chosen to match risk levels associated with each interaction term. For instance, there would be four genotype groups for two-gene sets, corresponding to Gi×Gj=0, 1, 2, or 4, and five genotype groups for three-gene sets, corresponding to Gi×Gj×Gk=0, 1, 2, 4, or 8. The χ2 statistic, henceforth referred to as the “CSS” (chi-squared subset) statistic, would then take the form CSS=∑i=1r[ni−E(ni)]E(ni) .Here, ni is the observed number of subjects, irrespective of case status, in the ith genotype group, and r is the total number of genotype groups. The expected ni, E(ni), is estimated on the basis of the sample marginal genotype frequencies of each gene. For example, let n4 equal the observed number of subjects with Gi×Gj=4—in other words, genotype AA at locus i and BB at locus j—then, for two-gene sets, E(n4)=(nAAnBB)/N2, where N denotes the total sample size. We emphasize the point that use of the CSS statistic to limit the number of gene sets considered does not bias subsequent tests. Under the global null hypothesis of independence between genotype G and phenotype D, any variable that is strictly a function of G will also be independent of D. Specifically, the reduced set of gene combinations (G*) that results from screening with the CSS statistic is strictly a function of G, since case-control status is not used in computing CSS. Therefore, the reduced set is also statistically independent of D under the global null. As an initial proof of concept, we first provide evidence to show that accounting for interactions leads to increased efficiency in tests of candidate genes. We assume a model with no main effects and a two-gene interaction with an odds ratio (OR) of 2.0—that is, under equation (2), βij is set to log(2), and βi and βj are set to zero. Phenotype prevalence was set to 10%, allele frequencies were set to 0.3, power was set to 80%, and the significance level was assumed to be 0.05 with a two-sided alternative hypothesis. Conditional on all of these parameter settings, the method of Longmate (Longmate, 2001Longmate JA Complexity and power in case-control association studies.Am J Hum Genet. 2001; 68: 1229-1237Abstract Full Text Full Text PDF PubMed Scopus (81) Google Scholar) was used to estimate required sample sizes for a variety of LRTs derived from equation (2). Test 1 (see table 1) shows the sample size required (N=130) to detect G1 (or G2) by use of a standard marginal test. A 2-df test of β1 and β2 (test 2) requires only N=90, a 44% increase in efficiency. A 3-df test of the saturated model

Ver no editor

Altmetric

PlumX

Entrar

Lembrar minha senha

Receber meu e-mail de confirmação

A Testing Framework for Identifying Susceptibility Genes in the Presence of Epistasis