Carta Acesso aberto Revisado por pares

The Use of Inferred Haplotypes in Downstream Analyses

2007; Elsevier BV; Volume: 80; Issue: 3 Linguagem: Inglês

10.1086/512201

ISSN

1537-6605

Autores

D.Y. Lin, Bevan E. Huang,

Tópico(s)

Genetic and phenotypic traits in livestock

Resumo

To the Editor: In the March 2006 issue of the Journal, Marchini et al.1Marchini J Cutler D Patterson N Stephens M Eskin E Halperin E Lin S Qin ZS Munro HM Abecasis GR et al.A comparison of phasing algorithms for trios and unrelated individuals.Am J Hum Genet. 2006; 78: 437-450Abstract Full Text Full Text PDF PubMed Scopus (247) Google Scholar provided a comprehensive description of phasing algorithms for inference of individual haplotypes from unphased genotype data. They stated that an unresolved question is “whether and, if so, how best to use inferred haplotypes in downstream analyses.”1Marchini J Cutler D Patterson N Stephens M Eskin E Halperin E Lin S Qin ZS Munro HM Abecasis GR et al.A comparison of phasing algorithms for trios and unrelated individuals.Am J Hum Genet. 2006; 78: 437-450Abstract Full Text Full Text PDF PubMed Scopus (247) Google Scholar(p.448) The question is important because knowledge of individual haplotypes is rarely an end in itself. We offer our perspective on this issue, particularly in the context of (case-control) association studies. Phase ambiguity is a kind of missing data, and use of inferred haplotypes in downstream analyses is a form of imputation. The voluminous statistical literature on missing data casts light on the potential pitfalls of imputation. In the words of Dempster and Rubin,2Dempster AP Rubin DB Introduction.in: Madow WG Olkin I Rubin DB Incomplete data in sample surveys, volume 2: theory and bibliography. Academic Press, New York1983: 3-10Google Scholar(p.8)The idea of imputation is both seductive and dangerous. It is seductive because it can lull the user into the pleasurable state of believing that the data are complete after all, and it is dangerous because it lumps together situations where the problem is sufficiently minor that it can be legitimately handled in this way and situations where standard estimators applied to the real and imputed data have substantial bias. As pointed out by Marchini et al.,1Marchini J Cutler D Patterson N Stephens M Eskin E Halperin E Lin S Qin ZS Munro HM Abecasis GR et al.A comparison of phasing algorithms for trios and unrelated individuals.Am J Hum Genet. 2006; 78: 437-450Abstract Full Text Full Text PDF PubMed Scopus (247) Google Scholar all the phasing algorithms assume Hardy-Weinberg equilibrium (HWE). Even when the general population is in HWE, the case sample and the pooled case-control sample may not be.3Epstein MP Satten GA Inference on haplotype effects in case-control studies using unphased genotype data.Am J Hum Genet. 2003; 73: 1316-1329Abstract Full Text Full Text PDF PubMed Scopus (214) Google Scholar Thus, the phasing algorithms may produce biased estimation of haplotype distributions with case-control data. The influence of departures from HWE on estimation accuracy depends on the directionality of the disequilibrium.4Fallin D Schork NJ Accuracy of haplotype frequency estimation for bialleli loci, via the expectation-maximization algorithm for unphased diploid genotype data.Am J Hum Genet. 2000; 67: 947-959Abstract Full Text Full Text PDF PubMed Scopus (337) Google Scholar The phasing algorithms do not acknowledge the selective-sampling feature of the case-control design and thus may produce biased results. Also, the phasing algorithms do not take into account phenotype, which is potentially informative about phase. The common practice of assigning the most likely diplotype (i.e., the pair of haplotypes with the highest posterior probability) to each individual is intrinsically biased because the most likely diplotype is not necessarily the true diplotype. Consider the simple situation of two SNPs, with the minor and major alleles coded as 1 and 0, respectively, at each SNP site. The genotype is defined as the number of minor alleles at the two SNP sites. Haplotype ambiguity arises if and only if an individual is doubly heterozygous—that is, has the 11 genotype. Both the (10,01) and (00,11) diplotypes produce the 11 genotype. There is obviously a problem if all doubly heterozygous individuals are assigned the more likely (i.e., the more common) of the two diplotypes, especially when the frequency of the less common diplotype is similar to (although lower than) that of the more common diplotype. When causal haplotypes exist, the phasing algorithms may incorrectly assign causal haplotypes to individuals without causal haplotypes or may reconstruct causal haplotypes as noncausal haplotypes. Consequently, treatment of inferred haplotypes as true haplotypes in downstream association analyses tends to attenuate the estimated haplotype effects and to reduce the power for detecting causal variants. Incorrect haplotype assignments may also induce spurious association for noncausal haplotypes and thus increase false-positive results. For illustration, we consider the diplotype distribution from a hypothetical case-control study shown in the top part of table 1. With diplotype D as the reference, the estimated odds ratios (ORs) for diplotypes A, B, and C are 3, 1, and 1, respectively. Assume that, for both cases and controls, 20% of the individuals that truly have diplotype A are incorrectly assigned diplotype B, and another 20% are incorrectly assigned diplotype D, yielding the misclassified distribution shown in the bottom part of table 1. Then, the estimated OR for diplotype A is reduced from 3 to 2.3; for diplotype B, it is increased from 1 to 1.2; and, for diplotype C, it is reduced from 1 to 0.8. This example demonstrates that treatment of inferred haplotypes as true haplotypes may bias the estimated effects of causal haplotypes downward and may also bias the estimated effects of noncausal haplotypes away from the null value in either direction. The distortions may be more profound if the misclassification rates differ between cases and controls.Table 1Effects of Incorrectly Assigned Haplotypes on Risk EstimatesDiplotypeType of Haplotype and MeasureABCDTrue haplotypes: No. of cases500100200200 No. of controls250150300300 OR3.01.01.0…Inferred haplotypes: No. of cases300200200300 No. of controls150200300350 OR2.31.20.8… Open table in a new tab Several simulation studies5Morris AP Whittaker JC Balding DJ Little loss of information due to unknown phase for fine-scale linkage-disequilibrium mapping with single-nucleotide-polymorphism genotype data.Am J Hum Genet. 2004; 74: 945-953Abstract Full Text Full Text PDF PubMed Scopus (61) Google Scholar, 6Kraft P Cox DG Paynter RA Hunter D De Vivo I Accounting for haplotype uncertainty in matched association studies: a comparison of simple and flexible techniques.Genet Epidemiol. 2005; 28: 261-272Crossref PubMed Scopus (120) Google Scholar, 7Cordell HJ Estimation and testing of genotype and haplotype effects in case-control studies: comparison of weighted regression and multiple imputation procedures.Genet Epidemiol. 2006; 30: 259-275Crossref PubMed Scopus (46) Google Scholar, 8French B Lumley T Monks SA Rice KM Hindorff LA Reiner AP Psaty BM Simple estimates of haplotype relative risks in case-control data.Genet Epidemiol. 2006; 30: 485-494Crossref PubMed Scopus (42) Google Scholar showed that imputation can yield substantial bias of estimated genetic effects, poor coverage of confidence intervals, and significant inflation of type I error, especially when the effects are large and the phase uncertainty is high. A recent article by French et al.8French B Lumley T Monks SA Rice KM Hindorff LA Reiner AP Psaty BM Simple estimates of haplotype relative risks in case-control data.Genet Epidemiol. 2006; 30: 485-494Crossref PubMed Scopus (42) Google Scholar reported the bias of the estimated log ORs in the range of −0.49 to 0.22, an actual type I error of 18% at the 5% nominal significance level, and coverage of 0.75. Our second simulation study mimicked the two-locus model Mul3 of Cordell.7Cordell HJ Estimation and testing of genotype and haplotype effects in case-control studies: comparison of weighted regression and multiple imputation procedures.Genet Epidemiol. 2006; 30: 259-275Crossref PubMed Scopus (46) Google Scholar We assumed that haplotypes 01 and 10 have ORs of 1.2 and 1.4 in reference to haplotypes 00 and 11, with additive mode of inheritance, and we tested whether locus 2 has an effect, while allowing an effect at locus 1. On the basis of 10,000 simulated data sets of 1,000 cases and 1,000 controls with 10% randomly missing genotypes, we obtained power values of 65%, 40%, and 17% at the nominal significance levels of 5%, 1%, and 0.1%, respectively, for the maximum-likelihood method, compared with 41%, 20%, and 6% power for the imputation method. As pointed out by a referee, the phasing algorithms reviewed by Marchini et al.1Marchini J Cutler D Patterson N Stephens M Eskin E Halperin E Lin S Qin ZS Munro HM Abecasis GR et al.A comparison of phasing algorithms for trios and unrelated individuals.Am J Hum Genet. 2006; 78: 437-450Abstract Full Text Full Text PDF PubMed Scopus (247) Google Scholar are often used to phase large regions, so it would be interesting to assess the performance of the imputation method in testing for haplotype-disease association on a small set of SNPs that is phased within a larger genomic context. To this end, we generated 100 SNPs according to the allele frequencies and pairwise linkage disequilibrium (LD) coefficients of the first 100 SNPs on chromosome 18 of the CEU sample in the HapMap genomewide data, and we performed haplotype analysis on SNPs 60–64. The most common haplotypes of the five SNPs are 00000, 00001, 00010, 00100, 00101, 01101, 10000, 10001, 10010, 10100, and 10101, with frequencies of 4.6%, 8.8%, 11.0%, 7.4%, 7.2%, 7.0%, 6.6%, 6.8%, 8.6%, 7.4%, and 8.4%, respectively. We assumed that the disease risk was influenced by haplotype 00000 only, with an OR of 3 under the additive mode of inheritance. We set the overall disease prevalence to ∼5% and selected 300 cases and 300 controls. We assessed the haplotype-disease association for those 5 SNPs, which were phased together with the other 95 SNPs by the PHASE algorithm. It was not computationally feasible to phase 600 subjects altogether for the 100 SNPs. Thus, we randomly divided the 600 subjects into six groups of 50 cases and 50 controls. (We found that phasing cases and controls together provided much better control of type I error than did phasing cases and controls separately.) We simulated 1,000 data sets with 2% randomly missing SNP values. We found that, at the nominal significance level of 1%, the imputation method had 60% power to detect the causal haplotype 00000 and had type I error of 5%, 3%, 4%, and 7% for null haplotypes 00001, 00010, 00100, and 10000, respectively, whereas the maximum-likelihood method had 72% power to detect the causal haplotype and had type I error close to the nominal level for all null haplotypes. The maximum-likelihood estimates had little bias, whereas the imputation method produced bias of −0.33, 0.27, 0.21, 0.26, and 0.30 for the log ORs of haplotypes 00000, 00001, 00010, 00100, and 10000, respectively. In the above study, the LD among the five SNPs was not particularly strong (table 2). In a related study, we considered SNPs 95–99, which had very high LD (table 3). The most common haplotypes of SNPs 95–99 are 00000, 00001, 01000, 01001, 01100, 01111, 10000, and 10001, with frequencies of 39.7%, 20.8%, 2%, 1.3%, 1.8%, 13.8%, 12.9%, and 5.4%, respectively. We assumed that 10001 is the causal haplotype with an OR of 2.5 under the additive mode of inheritance. The rest of the simulation setup was the same as in the previous simulation study. The imputation method had 83% power to detect the causal haplotype and had type I error of 2% and 4% for null haplotypes 00001 and 10000 at the nominal significance level of 1% and produced bias of −0.15, 0.12, and 0.14 for the log ORs of haplotypes 10001, 00001, and 10000, respectively. On the other hand, the maximum-likelihood method had 92% power to detect the causal haplotype and provided accurate control of type I error and unbiased estimates of haplotype effects.Table 2Standardized LD Coefficients (D′) for SNPs 60–64 on Chromosome 18 of the HapMap CEU SampleD′ for SNPSNP61626364601.0.86.28.6861….861.0.8462…….55.7363……….51 Open table in a new tab Table 3Standardized LD Coefficients (D′) for SNPs 95–99 on Chromosome 18 of the HapMap CEU SampleD′ for SNPSNP96979899951.01.01.0.9696….83.95.9497…….95.7798……….94 Open table in a new tab Our studies obviously do not encompass all possible scenarios. Thus, the results do not imply that imputation is always bad, but rather that it can be considerably less powerful than maximum likelihood while providing biased estimates of genetic effects and poor control of type I error in practical situations. The problems tend to be more severe when there is greater uncertainty in reconstructed haplotypes. Our simulation studies were focused on single imputation, which is the most common practice. Some alternative procedures have been proposed, including multiple imputation, expectation substitution, and weighted logistic regression.6Kraft P Cox DG Paynter RA Hunter D De Vivo I Accounting for haplotype uncertainty in matched association studies: a comparison of simple and flexible techniques.Genet Epidemiol. 2005; 28: 261-272Crossref PubMed Scopus (120) Google Scholar, 7Cordell HJ Estimation and testing of genotype and haplotype effects in case-control studies: comparison of weighted regression and multiple imputation procedures.Genet Epidemiol. 2006; 30: 259-275Crossref PubMed Scopus (46) Google Scholar, 8French B Lumley T Monks SA Rice KM Hindorff LA Reiner AP Psaty BM Simple estimates of haplotype relative risks in case-control data.Genet Epidemiol. 2006; 30: 485-494Crossref PubMed Scopus (42) Google Scholar These procedures are not theoretically valid either (for many of the reasons mentioned above) and may perform poorly. In particular, the versions of multiple imputation that have been proposed are improper because they fail to account for phenotype and case-control sampling. Proper multiple imputation would provide a good approximation to maximum likelihood. In short, no method can be more powerful than maximum likelihood while providing the same control of type I error, although some methods may approximate maximum likelihood well under certain circumstances. We recommend that maximum likelihood be generally adopted for analyses of haplotype-disease associations. A major appeal of imputation is that standard statistical software can be used to perform the desired association analyses, once individual haplotypes are inferred by a phasing algorithm. The extent to which maximum likelihood can be used in association analyses depends critically on the availability of specialized software. Several groups have developed computer programs for maximum-likelihood methods. We recently posted a user-friendly software interface called “HAPSTAT.” This software provides maximum-likelihood procedures for estimating and testing haplotype effects and haplotype-environment interactions under a wide variety of disease models.

Referência(s)