Robust Genomic Control for Association Studies
2006; Elsevier BV; Volume: 78; Issue: 2 Linguagem: Inglês
10.1086/500054
ISSN1537-6605
AutoresGang Zheng, Boris Freidlin, Joseph L. Gastwirth,
Tópico(s)Genetic and phenotypic traits in livestock
ResumoPopulation-based case-control studies are a useful method to test for a genetic association between a trait and a marker. However, the analysis of the resulting data can be affected by population stratification or cryptic relatedness, which may inflate the variance of the usual statistics, resulting in a higher-than-nominal rate of false-positive results. One approach to preserving the nominal type I error is to apply genomic control, which adjusts the variance of the Cochran-Armitage trend test by calculating the statistic on data from null loci. This enables one to estimate any additional variance in the null distribution of statistics. When the underlying genetic model (e.g., recessive, additive, or dominant) is known, genomic control can be applied to the corresponding optimal trend tests. In practice, however, the mode of inheritance is unknown. The genotype-based χ2 test for a general association between the trait and the marker does not depend on the underlying genetic model. Since this general association test has 2 degrees of freedom (df), the existing formulas for estimating the variance factor by use of genomic control are not directly applicable. By expressing the general association test in terms of two Cochran-Armitage trend tests, one can apply genomic control to each of the two trend tests separately, thereby adjusting the χ2 statistic. The properties of this robust genomic control test with 2 df are examined by simulation. This genomic control–adjusted 2-df test has control of type I error and achieves reasonable power, relative to the optimal tests for each model. Population-based case-control studies are a useful method to test for a genetic association between a trait and a marker. However, the analysis of the resulting data can be affected by population stratification or cryptic relatedness, which may inflate the variance of the usual statistics, resulting in a higher-than-nominal rate of false-positive results. One approach to preserving the nominal type I error is to apply genomic control, which adjusts the variance of the Cochran-Armitage trend test by calculating the statistic on data from null loci. This enables one to estimate any additional variance in the null distribution of statistics. When the underlying genetic model (e.g., recessive, additive, or dominant) is known, genomic control can be applied to the corresponding optimal trend tests. In practice, however, the mode of inheritance is unknown. The genotype-based χ2 test for a general association between the trait and the marker does not depend on the underlying genetic model. Since this general association test has 2 degrees of freedom (df), the existing formulas for estimating the variance factor by use of genomic control are not directly applicable. By expressing the general association test in terms of two Cochran-Armitage trend tests, one can apply genomic control to each of the two trend tests separately, thereby adjusting the χ2 statistic. The properties of this robust genomic control test with 2 df are examined by simulation. This genomic control–adjusted 2-df test has control of type I error and achieves reasonable power, relative to the optimal tests for each model. For mapping disease-susceptibility genes for complex human diseases, case-control studies testing linkage disequilibrium or association are useful approaches for detecting markers with small-to-moderate genetic effects on traits (Risch and Merikangas Risch and Merikangas, 1996Risch N Merikangas K The future of genetic studies of complex human diseases.Science. 1996; 273: 1516-1517Crossref PubMed Scopus (4282) Google Scholar; Khoury and Yang Khoury and Yang, 1998Khoury MJ Yang Q The future of genetic studies of complex human diseases: an epidemiologic perspective.Epidemiology. 1998; 9: 350-354Crossref PubMed Scopus (87) Google Scholar). However, because of population stratification or cryptic relatedness, case-control studies may produce spurious associations. Case-control studies, on the other hand, are easier than family-based association studies to conduct, because they use population controls and do not require genetic data from family members. Statistical methods have been developed for adjusting population stratification and/or cryptic relatedness in case-control studies. One is based on inferring the number of strata in a population and estimating the probability of each sample member belonging to these strata (Pritchard and Rosenberg Pritchard and Rosenberg, 1999Pritchard JK Rosenberg NA Use of unlinked genetic markers to detect population stratification in association studies.Am J Hum Genet. 1999; 65: 220-228Abstract Full Text Full Text PDF PubMed Scopus (941) Google Scholar; Pritchard et al. Pritchard et al., 2000Pritchard JK Stephens M Rosenberg NA Donnelly P Association mapping in structured populations.Am J Hum Genet. 2000; 67: 170-181Abstract Full Text Full Text PDF PubMed Scopus (1498) Google Scholar; Satten et al. Satten et al., 2001Satten GA Flanders WD Yang Q Account for unmeasured population substructure in case-control studies of genetic association using a novel latent-class model.Am J Hum Genet. 2001; 68: 466-477Abstract Full Text Full Text PDF PubMed Scopus (205) Google Scholar; Zhu et al. Zhu et al., 2002Zhu X Zhang SL Zhao H Cooper RS Association mapping, using a mixture model for complex traits.Genet Epidemiol. 2002; 23: 181-196Crossref PubMed Scopus (109) Google Scholar). Another approach is genomic control (GC) (Devlin and Roeder Devlin and Roeder, 1999Devlin B Roeder K Genomic control for association studies.Biometrics. 1999; 55: 997-1004Crossref PubMed Scopus (2289) Google Scholar; Bacanu et al. Bacanu et al., 2000Bacanu SA Devlin B Roeder K The power of genomic control.Am J Hum Genet. 2000; 66: 1933-1944Abstract Full Text Full Text PDF PubMed Scopus (275) Google Scholar; Devlin et al. Devlin et al., 2001Devlin B Roeder K Wasserman L Genomic control, a new approach to genetic-based association studies.Theor Popul Biol. 2001; 60: 155-166Crossref PubMed Scopus (400) Google Scholar; Reich and Goldstein Reich and Goldstein, 2001Reich DE Goldstein DB Detecting association in a case-control study while correcting for population stratification.Genet Epidemiol. 2001; 20: 4-16Crossref PubMed Scopus (300) Google Scholar; Zheng et al. Zheng et al., 2005Zheng G Freidlin B Li Z Gastwirth JL Genomic control for association studies under various genetic models.Biometrics. 2005; 61: 186-192Crossref PubMed Scopus (40) Google Scholar), which adjusts the variance of the Cochran-Armitage trend test by use of data from null loci. Here, we focus on developing a robust GC test. In case-control studies, the Cochran-Armitage (CA) trend tests are preferred to the allele-based test, as they are valid when Hardy-Weinberg equilibrium (HWE) does not hold. Furthermore, the two types of tests are asymptotically equivalent under HWE (Sasieni Sasieni, 1997Sasieni PD From genotypes to genes: doubling the sample size.Biometrics. 1997; 53: 1253-1261Crossref PubMed Scopus (752) Google Scholar). To apply the CA trend test, increasing scores are assigned a priori to the genotypes. Thus, the trend statistic is a function of scores. The choice of scores depends on the underlying genetic model—for example, recessive, additive, or dominant (Sasieni Sasieni, 1997Sasieni PD From genotypes to genes: doubling the sample size.Biometrics. 1997; 53: 1253-1261Crossref PubMed Scopus (752) Google Scholar; Zheng et al. Zheng et al., 2003Zheng G Freidlin B Li Z Gastwirth JL Choice of scores in trend tests for case-control studies of candidate-gene associations.Biom J. 2003; 45: 335-348Crossref Scopus (59) Google Scholar)—which is a typical problem in the application of trend tests (Graubard and Korn Graubard and Korn, 1987Graubard BI Korn EL Choice of column scores for testing independence in ordered 2×K contingency tables.Biometrics. 1987; 43: 471-476Crossref PubMed Scopus (113) Google Scholar). The GC developed by Devlin and Roeder (Devlin and Roeder, 1999Devlin B Roeder K Genomic control for association studies.Biometrics. 1999; 55: 997-1004Crossref PubMed Scopus (2289) Google Scholar) was based on the trend test with scores optimal for the additive model. Zheng et al. (Zheng et al., 2005Zheng G Freidlin B Li Z Gastwirth JL Genomic control for association studies under various genetic models.Biometrics. 2005; 61: 186-192Crossref PubMed Scopus (40) Google Scholar) studied GC for recessive and dominant models. For many complex diseases, the underlying genetic models are usually unknown, and a single trend test for case-control studies may lose substantial power when the model is misspecified (Freidlin et al. Freidlin et al., 2002Freidlin B Zheng G Li Z Gastwirth JL Trend tests for case-control studies of genetic markers: power, sample size and robustness.Hum Hered. 2002; 53: 146-152Crossref PubMed Scopus (264) Google Scholar). Thus, an efficiency-robust test (Gastwirth Gastwirth, 1966Gastwirth JL On robust procedures.J Am Stat Assoc. 1966; 61: 929-948Crossref Scopus (149) Google Scholar) having fairly high power across a set of models should be useful. Here, we show that the usual χ2 test of general association (GA) between the disease and the marker is robust and can be modified to account for population stratification. This test is widely used in genetic data analysis and is also supported by many existing software packages (Weir Weir, 1996Weir BS Genetic data analysis II: methods for discrete population genetic data. Sinauer Associations, Sunderland, MA1996Google Scholar; Sham Sham, 1998Sham P Statistics in human genetics. Arnold Publishers, London1998Google Scholar; Gibson and Muse Gibson and Muse, 2004Gibson G Muse SV A primer of genome science. 2nd ed. Sinauer Associations, Sunderland, MA2004Google Scholar). The GC method adjusts the variance of a trend test by estimating the variance inflation caused by population stratification by use of the null loci. It is not directly applicable to the GA test statistic, which has a complicated variance-covariance matrix. A direct adjustment at the scale level may not be applicable. To circumvent this problem, we express the GA test in terms of two CA trend tests. Then, adjusting each CA trend test by the usual GC method provides the adjustment of the GA test. Consider a genetic marker with two alleles M and N with frequencies p and q=1-p, respectively, where M is a disease-associated allele, referred to as the "risk allele." The genotype distributions of case-control data are displayed in table 1, where (r0,r1,r2) and (s0,s1,s2) are genotype counts of cases and controls. They are independent and follow multinomial distributions (r0,r1,r2)∼Mul(r;p0,p1,p2) and (s0,s1,s2)∼Mul(s;q0,q1,q2). Denote the disease prevalence in the population as K=Pr(disease); the genotypes as G0=NN, G1=NM, and G2=MM; and their frequencies by gi=Pr(Gi), i=0,1,2. The penetrances are defined as the conditional probabilities of disease given each of three genotypes fi=⪻(disease|Gi),i=0,1,2 . The genotype frequencies can be written as pi=Pr(Gi|disease)=gifi/K and qi=Pr(Gi|control)=gi(1-fi)/(1-K) for i=0,1,2 in cases and controls. Under the null hypothesis of no association, H0:pi=qi=gi for i=0,1,2; that is, H0:f0=f1=f2=K. As M is a risk allele, under the alternative hypothesis H1:f0≤f1≤f2 with at least one equality strictly holding. A genetic model is recessive, additive, or dominant when the penetrances satisfy f1=(1-λ)f0+λf2 and λ=0, 1/2, or 1, respectively. For local alternatives (f0≈f1≈f2), the multiplicative model f21=f0f2 is equivalent to the additive model. To see this, write γ1=f1/f0=1+ɛ1>1 and γ2=f2/f1=1+ɛ2>1, and ɛi≈0 for i=1,2. Then, the additive model implies that ɛ2=2ɛ1. Thus, γ21=(1+ɛ1)2≈1+2ɛ1=1+ɛ2=γ2.Table 1Genotype Distributions for Case-Control DataGenotypeDataNNNMMMTotalCaser0r1r2rControls0s1s2sTotaln0n1n2n Open table in a new tab When the genetic model is known, a more powerful and directed test is the CA trend test (Agresti Agresti, 1990Agresti A Categorical data analysis. John Wiley & Sons, New York1990Google Scholar). To apply this CA trend test, increasing scores (0,x,2) are assigned to three genotypes (NN,NM,MM), respectively, where 0≤x≤2. The trend test can be written (Sasieni Sasieni, 1997Sasieni PD From genotypes to genes: doubling the sample size.Biometrics. 1997; 53: 1253-1261Crossref PubMed Scopus (752) Google Scholar) as Z(x)=n1/2[∑i=02xi(sri-rsi)]{rs[n∑i=02xi2ni-(∑i=02xini)2]}1/2 ,(1) where (x0,x1,x2)=(0,x,2). For a given x, Z(x) asymptotically follows a standard normal distribution under H0. Thus, the null hypothesis is rejected when |Z(x)|>Z1-α/2. The trend test Z(x) is optimal when x is properly specified a priori. For recessive, additive (multiplicative), and dominant models, the respective values of optimal x are 0, 1, 2. From equation (1), it follows that the trend test Z(x) is invariant to a linear transformation of x—that is, the scores (0,x,2) and (0,x/2,1) yield the same trend test. Thus, a general model can be expressed as f1=(1-λ)f0+λf2, where λ∈[0,1] and the optimal choice of x in Z(x) is λ (Zheng et al. Zheng et al., 2003Zheng G Freidlin B Li Z Gastwirth JL Choice of scores in trend tests for case-control studies of candidate-gene associations.Biom J. 2003; 45: 335-348Crossref Scopus (59) Google Scholar). Unfortunately, Z(x) is not robust to a misspecification of the genetic model. The GC of Devlin and Roeder (Devlin and Roeder, 1999Devlin B Roeder K Genomic control for association studies.Biometrics. 1999; 55: 997-1004Crossref PubMed Scopus (2289) Google Scholar) is based on Z(1), the optimal test for the additive model. When the population is stratified, they considered the test statistic Z2*(1)=Z2(1)/ λˆ(1), which follows χ2 distribution with 1 df (χ21), where λ(1) is the variance inflation factor that can be estimated using the null loci. Let the trend test Z(1), calculated on c null loci, be denoted as Z1(1),…,Zc(1), which are realizations of a random variable Z0(1), where the subscript 0 indicates the null loci, which are not associated with diseases and are not under linkage disequilibrium with the disease loci. Since Z2*(1) follows χ21 and λ(1) is a constant, λ(1)=E[Z20(1)]. Therefore, λ(1) can be estimated by the expected value of a random variable by using its realizations Z21(1),…,Z2c(1). Devlin and Roeder (Devlin and Roeder, 1999Devlin B Roeder K Genomic control for association studies.Biometrics. 1999; 55: 997-1004Crossref PubMed Scopus (2289) Google Scholar) studied both Bayesian and frequentist approaches for estimating λ(1). Here, we use the latter—that is, λˆ(1)=median [Z21(1),…,Z2c(1)]/0.456. Zheng et al. (Zheng et al., 2005Zheng G Freidlin B Li Z Gastwirth JL Genomic control for association studies under various genetic models.Biometrics. 2005; 61: 186-192Crossref PubMed Scopus (40) Google Scholar) showed that the idea can be applied to the optimal tests for the recessive and dominant models. Also, we assume that the minor-allele frequencies of null loci are close to that of the marker (Reich and Goldstein Reich and Goldstein, 2001Reich DE Goldstein DB Detecting association in a case-control study while correcting for population stratification.Genet Epidemiol. 2001; 20: 4-16Crossref PubMed Scopus (300) Google Scholar). A robust test that does not depend on the underlying genetic model is a test of the general association for the 2×3 table (table 1), which is given by TGA=(r0-n0rn)2n0rn+(r1-n1rn)2n1rn+(r2-n2rn)2n2rn+(s0-n0sn)2n0sn+(s1-n1sn)2n1sn+(s2-n2sn)2n2sn .(2) Under the null hypothesis of no association between disease status and genotypes, TGA follows asymptotically the χ2 distribution with 2 df (χ22). Note that GC has been applied to the test statistics that have a χ21 distribution, whereas TGA has a χ22 distribution. Thus, direct application of GC to equation (2) is inappropriate. However, the general association test, TGA, is asymptotically equivalent to the 2-df score test obtained from the logistic regression model. Define two indicator variables (x1,x2) as (0,0), (0,1), and (1,1) to designate the genotypes NN, NM, and MM, respectively. For the jth individual, his genotype is denoted by two indicator variables (x1j,x2j), and its status is denoted as yj=1 for case and yj=0 for control. Then, applying ⪻(yj=1|x1j,x2j)=exp(α+β1x1j+β2x2j)1+exp(α+β1x1j+β2x2j) , the likelihood function is proportional to L(α,β1,β2)=Πj=1n{[⪻(yj=1|x1j,x2j)]yj×[1-⪻(yj=1|x1j,x2j)]1-yj} . The null hypothesis of no association is H0:β1=β2=0. The score function evaluated under H0 can be written (see appendix A) as U1=∂L∂β1|H0,αˆ=1n(sr2-rs2)U2=∂L∂β2|H0,αˆ=1n[s(r1+r2)-r(s1+s2)] , where αˆ is the maximum-likelihood estimate of the nuisance parameter α under H0. Denote U as (U1,U2) and the observed Fisher information matrix evaluated under H0 and α= αˆ as I(α,β1,β2). The submatrix of I−1(α,β1,β2) corresponding to (β1,β2) is denoted by Σ−1 and is a consistent estimate of the inverse of the covariance matrix of U. Thus, under H0, T2=UTΣ-1U=11-ρ∧2[Z2(0)+Z2(2)-2ρ∧Z(0)Z(2)](3) has an asymptotic χ22 distribution, where ρ∧=(n0n2(n1+n2)(n0+n1))1/2(4) is a consistent estimator of the null correlation between Z(0) and Z(2) (appendix A). Note that T2 is approximately χ22 when there is no population stratification. To adjust for possible population stratification, we can apply GC to equation (3) by replacing Z(0) and Z(2) by Z*(0) and Z*(2), respectively, and ρˆ by ρ*, which is estimated using null loci. The resulting test statistic will be referred to as the "robust genomic control" (RGC) test and is denoted as T*2=[Z2*(0)+Z2*(2)-2ρ*Z*(0)Z*(2)]/(1-ρ2*), which has a χ2 distribution of χ22 under H0 and population stratification. The fact that the RGC test is a function of the adjusted optimal test statistics for the two extreme genetic models, recessive and dominant, is not surprising, since the optimal tests for the "extreme" models are components of nearly all efficiency-robust tests (Gastwirth Gastwirth, 1966Gastwirth JL On robust procedures.J Am Stat Assoc. 1966; 61: 929-948Crossref Scopus (149) Google Scholar, Gastwirth, 1985Gastwirth JL The use of maximin efficiency robust tests in combining contingency tables and survival analysis.J Am Stat Assoc. 1985; 80: 380-384Crossref Scopus (121) Google Scholar). To evaluate the performance of the proposed genotype-based χ2 test, we conducted simulation studies and estimated empirical power and type I error for three trend tests Z*(2), Z*(1), and Z*(0) and the RGC test T*2 under a range of underlying conditions and genetic models. For comparison, we also applied the GC adjustment to the 2-df χ2 test statistic (eq. [3]). This modified GA test is denoted as T**2. The SAS macro running the simulations is available on request. In the simulations, we assumed that the candidate gene and the null loci have the same minor-allele frequency. Our simulations follow an algorithm similar to that of Devlin and Roeder (Devlin and Roeder, 1999Devlin B Roeder K Genomic control for association studies.Biometrics. 1999; 55: 997-1004Crossref PubMed Scopus (2289) Google Scholar), Bacanu et al. (Bacanu et al., 2000Bacanu SA Devlin B Roeder K The power of genomic control.Am J Hum Genet. 2000; 66: 1933-1944Abstract Full Text Full Text PDF PubMed Scopus (275) Google Scholar), and Zheng et al. (Zheng et al., 2005Zheng G Freidlin B Li Z Gastwirth JL Genomic control for association studies under various genetic models.Biometrics. 2005; 61: 186-192Crossref PubMed Scopus (40) Google Scholar), which assumes that each subpopulation is in HWE. We specified the minor-allele frequency p, the Wright's coefficient of inbreeding F, the penetrances f0, f1, and f2 under various genetic models, the sample sizes of cases ak and controls bk for the kth subpopulation k=1,…,m, and the number of null loci c used to estimate variance inflation factors. In step 1, the allele frequency pk was generated for the kth subpopulation from the beta distribution, Beta[(1-F)p/F,(1-F)q/F], for k=1,…,m. In step 2, for individuals from the kth subpopulation, two alleles were drawn at random from the binomial distribution (2,pk) to create a genotype at the candidate allele locus. Disease status was randomly generated conditional on the number, i, of candidate alleles in the genotype by use of the Bernoulli distribution with parameter fi. The process continued until ak cases and bk controls were obtained. In step 3, genotypes for each of c null loci were generated using the same beta-binomial algorithm as above. The statistics Zk(j) (j=0,1,2) at the kth locus (k=1,…,c) were calculated, and the variance inflation factors, λ(j), were estimated as λˆ(j)=median[Z21(j),…,Z2c(j)]/0.456. Then, the GC trend test statistics Z*(j) were obtained by Z*(j)=Z(j)/ λˆ1/2(j). The RGC test T*2 was calculated using equations (3) and (4), with ρ* estimated as the average of ρˆ over c null loci. For the 2-df χ2 test with direct GC adjustment, T**2, we calculated T2,k for the kth null locus (k=1,…,c) and the variance inflation factors as λˆ(T2)=median(T2,1,…,T2,c)/1.386, where 1.386 is the median of the χ2 distribution χ22. Then, T**2=T2/ λˆ(T2), where ρ* was used in place of ρˆ in equation (3). Table 2 reports the type I error rates and empirical power of three trend tests and two 2-df χ2 tests after GC corrections when there is no population stratification. Only the power for T*2 is reported. When there was no population stratification, the GC-adjusted type I error rates for three trend tests and the directly GC-adjusted χ2 test T**2 were slightly greater than those of the corresponding unadjusted tests, because of the variation that GC method adds by estimating the variance inflation factor from c null loci. For the RGC statistic T*2, the type I error rates were α<0.05 because of the estimation of the null correlation. For empirical power comparison, when the genetic model is unknown, a test statistic is highly efficiency robust if it has high minimum power across the genetic models—that is, if it has high power when the model is misspecified. From table 2, the RGC test T*2 was efficiency robust, relative to each of the three trend tests optimal for a specific genetic model. Across the three genetic models, Z*(0) or Z*(2) had power <20% when the dominant or recessive model is true, respectively. The trend test optimal for the additive model Z*(1) was the most efficiency robust among three trend tests. However, T*2 had greater minimum power than Z*(1). When there was population stratification (tables 3 and 4), type I error rates for all tests were inflated. When GC controls were applied, the type I error for the three trend tests and RGC was near the nominal 0.05 level. Use of direct GC adjustment of the 2-df χ2 statistic (eq. [3]), however, failed to fully adjust for population stratification. The pattern of power performance among three GC trend tests and the RGC was similar to that shown in table 2 when there is no population stratification.Table 2Type I Error and Empirical Power Performance of Three GC Trend Tests With No Population StratificationAllele Frequency and ModelZ*(2)Z*(1)Z*(0)T*2T**2p=.1: NullaWith GC..063.062.052.043.041 NullbWithout GC..051.050.040.028.028 DOMcf1=f2=0.18..795.772.079.681 ADDdf1=0.175, f2=0.25..790.800.161.702 RECef1=0.1, f2=0.552..176.424.795.678p=.5: NullaWith GC..063.062.062.056.068 NullbWithout GC..051.051.049.037.037 DOMff1=f2=0.187..811.622.158.703 ADDgf1=0.15, f2=0.2..651.774.572.701 REChf1=0.1, f2=0.175..193.677.818.715Note.—Type I error and empirical power performance are shown for the three GC trend tests Z*(2), Z*(1), and Z*(0) under the dominant (DOM), additive (ADD), and recessive (REC) models and for the RGC T*2 and the 2-df χ2 test with direct GC adjustment, T**2, by use of two subpopulations of sizes a1=200, a2=0 for cases and b1=0, b2=200 for controls, with no population stratification (F=0), and two-sided α=0.05 with 10,000 replications for power and 100,000 for type I error. In all models, the baseline penetrance f0=0.1.a With GC.b Without GC.c f1=f2=0.18.d f1=0.175, f2=0.25.e f1=0.1, f2=0.552.f f1=f2=0.187.g f1=0.15, f2=0.2.h f1=0.1, f2=0.175. Open table in a new tab Table 3Type I Error and Empirical Power Performance of Three GC Trend Tests With Population StratificationF, Allele Frequency, and ModelZ*(2)Z*(1)Z*(0)T*2T**2F=.005: p=.1: NullaWith GC..063.061.031.041.081 NullbWithout GC..254.260.072.192.192 DOMcf1=f2=0.264..807.769.087.683 ADDdf1=0.254, f2=0.408..803.798.239.721 RECef1=0.1, f2=0.717..153.282.790.631 p=.2: NullaWith GC..062.061.054.053.091 NullbWithout GC..240.258.125.201.201 DOMff1=f2=0.232..823.739.156.703 ADDgf1=0.213, f2=0.326..790.778.464.733 REChf1=0.1, f2=0.366..160.353.811.685 p=.5: NullaWith GC..062.061.061.055.097 NullbWithout GC..199.260.200.206.206 DOMif1=f2=0.254..795.481.140.654 ADDjf1=0.215, f2=0.330..789.799.673.786 RECkf1=0.1, f2=0.225..198.569.800.673F=.05: p=.2: NullaWith GC..052.043.045.045.141 NullbWithout GC..657.674.458.628.628 DOMlf1=f2=0.65..795.689.131.623 ADDmf1=0.545, f2=0.99..734.717.505.677 RECnf1=0.1, f2=0.98..168.281.779.654 p=.5: NullaWith GC..052.048.053.047.142 NullbWithout GC..613.679.613.637.637 DOMof1=f2=0.715..804.449.130.633 ADDpf1=0.52, f2=0.94..741.796.776.777 RECqf1=0.1, f2=0.539..149.443.735.558Note.—Type I error and empirical power performance of the three GC trend tests Z*(2), Z*(1), and Z*(0) under the dominant (DOM), additive (ADD), and recessive (REC) models and for the RGC T*2 and the 2-df χ2 test with direct GC adjustment, T**2, by use of two subpopulations of sizes a1=200, a2=0 for cases and b1=0, b2=200 for controls, with population stratification, and two-sided α=0.05 with the same replications as in table 2. In all models, the baseline penetrance f0=0.1a With GC.b Without GC.c f1=f2=0.264.d f1=0.254, f2=0.408.e f1=0.1, f2=0.717.f f1=f2=0.232.g f1=0.213, f2=0.326.h f1=0.1, f2=0.366.i f1=f2=0.254.j f1=0.215, f2=0.330.k f1=0.1, f2=0.225.l f1=f2=0.65.m f1=0.545, f2=0.99.n f1=0.1, f2=0.98.o f1=f2=0.715.p f1=0.52, f2=0.94.q f1=0.1, f2=0.539. Open table in a new tab Table 4Type I Error and Empirical Power Performance of Three GC Trend Tests with Larger Population Sizes and Population StratificationF, Allele Frequency, and ModelZ*(2)Z*(1)Z*(0)T*2T**2F=0: p=.1: NullaWith GC..062.063.058.053.059 NullbWithout GC..050.049.052.038.038 DOMcf1=f2=0.132..793.768.101.680 ADDdf1=0.13, f2=0.16..788.797.213.710 RECef1=0.1, f2=0.246..124.310.788.672F=.005: p=.1: NullaWith GC..061.061.060.055.096 NullbWithout GC..284.293.103.233.233 DOMff1=f2=0.166..813.774.146.696 ADDgf1=0.161, f2=0.222..788.781.404.731 REChf1=0.1, f2=0.301..105.202.819.697F=.05: p=.1: NullaWith GC..044.038.060.055.141 NullbWithout GC..691.696.336.652.652 DOMif1=f2=0.374..800.767.278.686 ADDjf1=0.355, f2=0.61..777.770.607.747 RECkf1=0.1, f2=0.630..115.189.784.680Note.—Type I error and empirical power performance of the three GC trend tests Z*(2), Z*(1), and Z*(0) under the dominant (DOM), additive (ADD), and recessive (REC) models and for the RGC T*2 and the 2-df χ2 test with direct GC adjustment, T**2, by use of two subpopulations of sizes a1=750, a2=250 for cases and b1=250, b2=750 for controls, with population stratification, and two-sided α=0.05 with the same replications as in table 2. In all models, the baseline penetrance f0=0.1a With GC.b Without GC.c f1=f2=0.132.d f1=0.13, f2=0.16.e f1=0.1, f2=0.246.f f1=f2=0.166.g f1=0.161, f2=0.222.h f1=0.1, f2=0.301.i f1=f2=0.374.j f1=0.355, f2=0.61.k f1=0.1, f2=0.630. Open table in a new tab Note.— Type I error and empirical power performance are shown for the three GC trend tests Z*(2), Z*(1), and Z*(0) under the dominant (DOM), additive (ADD), and recessive (REC) models and for the RGC T*2 and the 2-df χ2 test with direct GC adjustment, T**2, by use of two subpopulations of sizes a1=200, a2=0 for cases and b1=0, b2=200 for controls, with no population stratification (F=0), and two-sided α=0.05 with 10,000 replications for power and 100,000 for type I error. In all models, the baseline penetrance f0=0.1. Note.— Type I error and empirical power performance of the three GC trend tests Z*(2), Z*(1), and Z*(0) under the dominant (DOM), additive (ADD), and recessive (REC) models and for the RGC T*2 and the 2-df χ2 test with direct GC adjustment, T**2, by use of two subpopulations of sizes a1=200, a2=0 for cases and b1=0, b2=200 for controls, with population stratification, and two-sided α=0.05 with the same replications as in table 2. In all models, the baseline penetrance f0=0.1 Note.— Type I error and empirical power performance of the three GC trend tests Z*(2), Z*(1), and Z*(0) under the dominant (DOM), additive (ADD), and recessive (REC) models and for the RGC T*2 and the 2-df χ2 test with direct GC adjustment, T**2, by use of two subpopulations of sizes a1=750, a2=250 for cases and b1=250, b2=750 for controls, with population stratification, and two-sided α=0.05 with the same replications as in table 2. In all models, the baseline penetrance f0=0.1 For genetic case-control association studies when the genetic model is unknown and there is no population stratification, the χ2 test with 2 df testing the general association is highly efficient, relative to the optimal trend tests (Zheng et al. Zheng et al., in pressZheng G, Freidlin B, Gastwirth JL (2006) Comparison of robust tests for genetic association using case-control studies. Institute of Mathematical Statistics, Lecture Notes and Monograph Series (The 2nd special issue in honor of E. L. Lehmann) (in press)Google Scholar). Moreover, this genotype-based 2-df χ2 test has been applied more often in genetic association studies than the trend tests have. When there is population stratification, the type I error rates may be inflated using either the trend tests or the 2-df χ2 test because of the inflation of variances of the test statistics. GC is a useful method for adjusting the variance of three trend tests to ensure the desired type I error rate. However, GC cannot be directly applied to the 2-df χ2 test statistic, which has the complicated variance-covariance matrix. After expressing the 2-df score test from the logistic regression model as a function of two trend tests, we apply the GC approach to the 2-df χ2 test by adjusting the variance of each trend test. Simulation results show that this 2-df χ2 test is efficiency robust across the recessive, additive, and dominant models, compared with the three GC trend tests. The research of J.L.G. was partially supported by grant SES-0317956 from the National Science Foundation. The log-likelihood can be written as logL(α,β1,β2)=rα+β1r2+β2(r1+r2)-n0log[1+exp(α)]-n1log[1+exp(α+β2)]-n2log[1+exp(α+β1+β2)]. From ∂logL/∂α|H0=0, αˆ=log(r/s) is the maximum-likelihood estimate of α. The score functions are given by ∂logL/∂β1=r2-n2exp(α+β1+β2)/[1+exp(α+β1+β2)] and ∂logL/∂β2=(r1+r2)-n1exp(α+β2)/[1+exp(α+β2)]-n2exp(α+β1+β2)/[1+exp(α+β1+β2)]. Hence, U1 and U2 are obtained by substituting β1=β2=0 and exp( αˆ)=r/s. Let I(α,β1,β2) be the observed Fisher information matrix evaluated under H0 and α= αˆ. Then, under H0 and α= αˆ, -∂2 logL/∂α2=nϕ(1-ϕ), -∂2 logL/(∂α∂β1)=-∂2 logL/∂β21=-∂2 logL/∂β1β2=n2ϕ(1-ϕ), and -∂2 logL/∂α∂β2=-∂2 logL/∂β22=(n1+n2)ϕ(1-ϕ), where ϕ=r/n. Thus, I-1(α,β1,β2)=1φ(1-φ)(1n00-1n00n1+n2n1n2-1n1-1n0-1n1n0+n1n0n1) ,Σ-1=1φ(1-φ)(n1+n2n1n2-1n1-1n1n0+n1n0n1) , and Σ=φ(1-φ)n((n0+n1)n2n0n2n0n2n0(n1+n2)) , where Σ is a consistent estimate of covariance matrix of (U1,U2). Note that Z2(0)=U21/ Varˆ(U1) and Z2(2)=U22/ Varˆ(U2), where Varˆ(U1)=ϕ(1-ϕ)(n0+n1)n2/n and Varˆ(U2)=ϕ(1-ϕ)n0(n1+n2)/n. Hence, a consistent estimate of the asymptotic null correlation ρ=Corr[Z(0),Z(2)]=Corr(U1,U2) is ρˆ={n0n2(n1+n2)(n0+n1)}1/2 .Thus, T2=UTΣ−1U can be written as T2=1φ(1-φ)n0n1n2[U12n0(n1+n2)+U22(n0+n1)n2-2U1U2n0n2]=(n0+n1)(n1+n2)n1n[Z2(0)+Z2(2)-2Z(0)Z(2)ρ∧] , where (n1n)/[(n0+n1)(n1+n2)]=1- ρˆ2.
Referência(s)