Artigo Acesso aberto Revisado por pares

Genome Scanning by Composite Likelihood

2006; Elsevier BV; Volume: 80; Issue: 1 Linguagem: Inglês

10.1086/510401

ISSN

1537-6605

Autores

Newton E. Morton, Nikolas Maniatis, Weihua Zhang, Sarah Ennis, Andrew Collins,

Tópico(s)

Genetic Mapping and Diversity in Plants and Animals

Resumo

Ambitious programs have recently been advocated or launched to create genomewide databases for meta-analysis of association between DNA markers and phenotypes of medical and/or social concern. A necessary but not sufficient condition for success in association mapping is that the data give accurate estimates of both genomic location and its standard error, which are provided for multifactorial phenotypes by composite likelihood. That class includes the Malecot model, which we here apply with an illustrative example. This preliminary analysis leads to five inferences: permutation of cases and controls provides a test of association free of autocorrelation; two hypotheses give similar estimates, but one is consistently more accurate; estimation of the false-discovery rate is extended to causal genes in a small proportion of regions; the minimal data for successful meta-analysis are inferred; and power is robust for all genomic factors except minor-allele frequency. An extension to meta-analysis is proposed. Other approaches to genome scanning and meta-analysis should, if possible, be similarly extended so that their operating characteristics can be compared. Ambitious programs have recently been advocated or launched to create genomewide databases for meta-analysis of association between DNA markers and phenotypes of medical and/or social concern. A necessary but not sufficient condition for success in association mapping is that the data give accurate estimates of both genomic location and its standard error, which are provided for multifactorial phenotypes by composite likelihood. That class includes the Malecot model, which we here apply with an illustrative example. This preliminary analysis leads to five inferences: permutation of cases and controls provides a test of association free of autocorrelation; two hypotheses give similar estimates, but one is consistently more accurate; estimation of the false-discovery rate is extended to causal genes in a small proportion of regions; the minimal data for successful meta-analysis are inferred; and power is robust for all genomic factors except minor-allele frequency. An extension to meta-analysis is proposed. Other approaches to genome scanning and meta-analysis should, if possible, be similarly extended so that their operating characteristics can be compared. Like other sciences, genetic epidemiology is both limited and driven by the techniques at its command. For nearly a century, gene localization was dominated by linkage and cytogenetics, with little opportunity to map a gene through associated markers. Finally, short physical maps of regions identified by linkage provided a basis for localization of rare major genes in haplotypes.1Kerem B Rommens JM Buchanan JA Markiewicz D Cox TK Chakravarti A Buchwald M Tsui L-C Identification of the cystic fibrosis gene: genetic analysis.Science. 1989; 245: 1073-1080Crossref PubMed Scopus (3032) Google Scholar The Malecot model was useful for this purpose,2Collins A Morton NE Mapping a disease locus by allelic association.Proc Natl Acad Sci USA. 1998; 95: 1741-1745Crossref PubMed Scopus (136) Google Scholar, 3Morton NE Zhang W Taillon-Miller P Ennis S Kwok P-Y Collins A The optimal measure of allelic association.Proc Natl Acad Sci USA. 2001; 98: 5217-5221Crossref PubMed Scopus (100) Google Scholar with subsequent extension to oligogenes and diplotypes.4Maniatis N Collins A Xu C-F McCarthy LC Hewett DR Tapper W Ennis S Ke K Morton NE The first linkage disequilibrium maps: delineation of hot and cold blocks by diplotype analysis.Proc Natl Acad Sci USA. 2002; 99: 2228-2233Crossref PubMed Scopus (147) Google Scholar, 5Maniatis N Morton NE Gibson J Xu C-F Hosking LK Collins A The optimal measure of linkage disequilibrium reduces error in association mapping of affection status.Hum Mol Genet. 2005; 14: 145-153Crossref PubMed Scopus (36) Google Scholar Maturity, if not completion, of the Human Genome Project accelerated this development by providing a physical map, which led, in turn, to genetic maps in linkage disequilibrium units (LDUs).4Maniatis N Collins A Xu C-F McCarthy LC Hewett DR Tapper W Ennis S Ke K Morton NE The first linkage disequilibrium maps: delineation of hot and cold blocks by diplotype analysis.Proc Natl Acad Sci USA. 2002; 99: 2228-2233Crossref PubMed Scopus (147) Google Scholar Various efforts, including the HapMap Project, undertook to provide evidence on marker diversity, with the goal of using that information to localize disease genes and to investigate other aspects of the diversity revealed by genetic polymorphism.6International HapMap Consortium The International HapMap Project.Nature. 2003; 426: 789-796Crossref PubMed Scopus (4690) Google Scholar, 7Morton NE Fifty years of genetic epidemiology, with special reference to Japan.J Hum Genet. 2006; 51: 269-277Crossref PubMed Scopus (6) Google Scholar, 8Kaiser J NIH goes after whole genome in search of disease genes.Science. 2006; 311: 933Crossref PubMed Scopus (7) Google Scholar After 3 years of LDU development, map construction is now rapid and accurate. Theory and practice for allelic association in small regions appear stable, but further progress will come from experience with scans of much longer tracts. From their upper limit, they have been called “genome scans,” a misnomer, since a large tract, chromosome, or set of chromosomes delimited without regard to association is analyzed in the same way: by division into contiguous, nonoverlapping regions. Whatever the terminology or method, such a scan is only stage 1 in a multistage design, since the later stages are concerned with regions selected for evidence of association. The number of markers in a dense scan is extremely large compared with the number of causal sites likely to be detected even in a large sample, whether or not supported by functional tests. This is a challenge to the false-discovery rate (FDR), originally introduced for localization of major loci by linkage.9Morton NE Sequential tests for the detection of linkage.Am J Hum Genet. 1955; 7: 277-318PubMed Google Scholar Unfortunately, it cannot be conventionally calculated when the probability of the null hypothesis approaches 1.10Storey JD Tibshirani R Statistical significance for genome-wide studies.Proc Natl Acad Sci USA. 2003; 100: 9440-9445Crossref PubMed Scopus (6631) Google Scholar, 11Dalmasso C Broet P Moreau T A simple procedure for estimating the false discovery rate.Bioinformatics. 2004; 21: 660-668Crossref PubMed Scopus (73) Google Scholar In that situation, alternative corrections have been developed to determine the real significance corresponding to nominal significance in a large number of tests. Like its predecessors in small regions, genome scans use composite likelihood that adds together individual component log likelihoods, each of which corresponds to a marginal or conditional event.12Lindsay BG Composite likelihood methods.Contemp Math. 1988; 80: 221-239Crossref Google Scholar Each component is a function of location on a map in LDUs or, less efficiently, on a correlated linkage or physical map.13Maniatis N Collins A Gibson J Zhang W Tapper W Morton NE Positional cloning by linkage disequilibrium.Am J Hum Genet. 2004; 74: 846-855Abstract Full Text Full Text PDF PubMed Scopus (51) Google Scholar This use of composite likelihood combines three different statistical problems. First, most statistical hypotheses in genetics are composite, in the sense that the hypothesis is “composed” of a group of simple hypotheses that may specify gene frequency, effect, or location, which results in a treble infinity of possible values. Composite likelihood can estimate location conditional on other variables or concurrently. Second, the number of markers varies even among regions of the same length, with corresponding variation in the estimate of composite likelihood. Third, markers in proximity are not independent but autocorrelated to an extent that is partially predictable from their LDU location but less accurately from their physical location. The effect of this autocorrelation is small (although perhaps not negligible) at low resolution but increases to an unknown extent with marker density. Until this limitation is removed, the genetic epidemiologist must choose between uncertain reliability at high density and loss of power by heavy selection of markers (commonly called “tag SNPs” because most of them are SNPs). We shall now show how these difficulties can be resolved, both in single studies and in meta-analyses of unbiased reports, recently advocated or launched without identification of the metrics that provide reliable combination of evidence.8Kaiser J NIH goes after whole genome in search of disease genes.Science. 2006; 311: 933Crossref PubMed Scopus (7) Google Scholar, 14Ioannidis JPA Gwinn M Little J Higgins JPT Bernstein JI Boffetta P Bondy M Bray MS Brenchley PE Buffler PA et al.A road map for efficient and reliable human genome epidemiology.Nat Genet. 2006; 38: 3-5Crossref PubMed Scopus (198) Google Scholar All n/2 pairs of codominant diallelic diplotypes under random mating can be reduced from a 3×3 table to a 2×2 count of haplotype frequencies|abcd| that, by interchange of rows and/or columns, satisfy ad-bc≥0 and b≤c.4Maniatis N Collins A Xu C-F McCarthy LC Hewett DR Tapper W Ennis S Ke K Morton NE The first linkage disequilibrium maps: delineation of hot and cold blocks by diplotype analysis.Proc Natl Acad Sci USA. 2002; 99: 2228-2233Crossref PubMed Scopus (147) Google Scholar The optimal measure of allelic association isρ∧=ad−bc(a+b)(b+d) ,with informationKρ=n(a+b)(b+d)(a+c)(c+d)under the null hypothesis that ρ=0, which is tested by χ21=p⌢2Kρ.3Morton NE Zhang W Taillon-Miller P Ennis S Kwok P-Y Collins A The optimal measure of allelic association.Proc Natl Acad Sci USA. 2001; 98: 5217-5221Crossref PubMed Scopus (100) Google Scholar The Malecot model predicts ρ as (1-L)Me−ɛhdh+L, where dh is the distance (in kilobases [kb]) between adjacent markers h and h+1. This is an enhancement of the less reliable prediction from the physical map that approximates distance by ɛΣdh.2Collins A Morton NE Mapping a disease locus by allelic association.Proc Natl Acad Sci USA. 1998; 95: 1741-1745Crossref PubMed Scopus (136) Google Scholar The composite likelihood over all markers in a region is lk=e−Λ/2, where Λ=ΣKρ(p⌢-ρ).2Collins A Morton NE Mapping a disease locus by allelic association.Proc Natl Acad Sci USA. 1998; 95: 1741-1745Crossref PubMed Scopus (136) Google Scholar Given a physical map, the parameters M, L, and ɛ are estimated from pairs of diallelic markers. The estimate of ɛ in the physical map is small and highly variable. The LDU map estimates M, L, and the ɛh that represent the block-and-step pattern of LD. Optionally, the exponent can be taken as ɛΣɛhdh, where ɛ is estimated for a candidate region, but ɛ is usually too close to 1 to warrant refinement. These calculations are performed by the LDMAP program,5Maniatis N Morton NE Gibson J Xu C-F Hosking LK Collins A The optimal measure of linkage disequilibrium reduces error in association mapping of affection status.Hum Mol Genet. 2005; 14: 145-153Crossref PubMed Scopus (36) Google Scholar the current version of which exploits grid technology to accommodate the high SNP density in HapMap. Quantitative traits can be studied by regression, giving scope under some sampling schemes to valid covariance analysis and inference of gene-environment interaction. Quantitative traits are especially frequent for anthropometrics, behavior, and response to drugs or other pharmaceutical agents but can occur in any branch of human genetics. Commonly, however, phenotypes are dichotomous; the two classes are “affected” and “normal” if sampled at random or “cases” and “controls” if sampled selectively. Controls may be random, matched with cases for age and other variables, or hypernormal. The last is more powerful but difficult to analyze by regression, because the sample-enrichment factor is poorly specified and environmental covariates may be distorted. For example, hypernormal controls selected to be older than cases do not imply that affection decreases with age. In the simplest case, affection is determined by a rare dominant or recessive gene and is studied in families so that the disease-associated allele can be recognized.2Collins A Morton NE Mapping a disease locus by allelic association.Proc Natl Acad Sci USA. 1998; 95: 1741-1745Crossref PubMed Scopus (136) Google Scholar Alternatively, inheritance may be more complex, and then alleles are assumed to be additive. This is not restrictive, because Maclauren’s calculus theorem predicts that causal markers of small effect tend to approach additivity, which is enhanced for predictive but noncausal SNPs by recombination. This simplifying assumption reduces diplotype data to a 2×2 table of allele counts by affection status,4Maniatis N Collins A Xu C-F McCarthy LC Hewett DR Tapper W Ennis S Ke K Morton NE The first linkage disequilibrium maps: delineation of hot and cold blocks by diplotype analysis.Proc Natl Acad Sci USA. 2002; 99: 2228-2233Crossref PubMed Scopus (147) Google Scholar deferring more elaborate analysis until causal SNPs are identified. In this way, n haplotypes are scored from n/2 diplotypes. The association parameter is z=γρ, where γ=Qw/f is the attributable risk under additivity, Q is the frequency of an allele predisposing to affection, f is the frequency of affected diplotypes, w is the penetrance in GG homozygotes with corresponding penetrance w/2 in Gg heterozygotes, and ρ is the association probability when γ=1 (table 1).2Collins A Morton NE Mapping a disease locus by allelic association.Proc Natl Acad Sci USA. 1998; 95: 1741-1745Crossref PubMed Scopus (136) Google Scholar, 5Maniatis N Morton NE Gibson J Xu C-F Hosking LK Collins A The optimal measure of linkage disequilibrium reduces error in association mapping of affection status.Hum Mol Genet. 2005; 14: 145-153Crossref PubMed Scopus (36) Google Scholar The score for z isUz=∂lnlk∂z=(ad−bc)n(a+c)(c+d) ,with informationKz=n(a+b)(b+d)(a+c)(c+d) , which is formally the same as that for Kρ in the previous section, but the counts are different. Then,z~=UzKz=ad−bc(a+b)(b+d) and χ21=Z⌢2Kz. Since SNPs with (a+c)(c+d)=0 or (a+b)(b+d)=0 are omitted because their information Kz is 0 or indeterminate, Kz must satisfy0<Kz=χ12z~2≤n . At present, we do not consider an enrichment factor, which combines with affection status to create complications not met in construction of LD maps or mapping of rare genes with high penetrance.2Collins A Morton NE Mapping a disease locus by allelic association.Proc Natl Acad Sci USA. 1998; 95: 1741-1745Crossref PubMed Scopus (136) Google Scholar Omission of an enrichment factor does not violate constraints on observed counts or K, but it may not be optimal.Table 1Frequencies Observed and Expected in Random Samples of n Haplotypes (ad-bc≥0, b≤c)aSee the work of Maniatis et al.5AlleleStatus and SampleGgTotalAffected or normal: Countaba+b Probabilityf[z+(1-z)R]f(1-z)(1-R)fNormal or affected: Countcdc+d Probability(R-f)z+R(1-f)(1-z)(1-R)[z+(1-f)(1-z)]1-fTotal: Counta+cb+dn ProbabilityR1-R1a See the work of Maniatis et al.5Maniatis N Morton NE Gibson J Xu C-F Hosking LK Collins A The optimal measure of linkage disequilibrium reduces error in association mapping of affection status.Hum Mol Genet. 2005; 14: 145-153Crossref PubMed Scopus (36) Google Scholar Open table in a new tab However the enrichment issue is resolved, association mapping accepts kb and LD maps and estimates parameters M and L appropriate to the relationship of affection status with diallelic markers and, in addition, the estimated locationS⌢ and SE of a causal marker, regardless of whether that marker was included in the data set. For this analysis, ɛΣdh in the physical map or ɛΣɛhdh in the LDU map is replaced by ɛΔ(Sh-S). Since lk is a composite likelihood, the parameters but not the error variance can be estimated by minimizing Λ as for true likelihood. Experience has shown that ɛ cannot be estimated reliably during association mapping of S, whereas the accuracy with which L can be estimated increases with the length of the LDU region.5Maniatis N Morton NE Gibson J Xu C-F Hosking LK Collins A The optimal measure of linkage disequilibrium reduces error in association mapping of affection status.Hum Mol Genet. 2005; 14: 145-153Crossref PubMed Scopus (36) Google Scholar We therefore examined three analyses that do not attempt to modify ɛ or a standard LD map derived from pairs of markers (fig. 1). The simplest one compares A and C, where A assumes no causal marker in the region, with L predicted and M=0; the alternative C takes the same value of L but estimates M and S. The second analysis compares A with D, which estimates L, M, and S. The third analysis takes B under the null hypothesis that M=0 with L estimated and compares it with D. This has the least power, because B confounds H1 with L and will not be considered here. Under the null hypothesis H0 that there is no causal marker in a region j with mj markers, composite likelihood analyses provide an estimate of xij=ΛAij−ΛCij or ΛAij−ΛDij for the ith replicate in the jth region (i=1,…rj), designated as ACij or ADij, respectively. Each replicate is obtained by shuffling, so that each individual is assigned randomly and without replacement to an observed phenotype, whether case/control, affected/normal, or quantitative. If SNPs were independent, the error variance in a region would be estimated for each replicate by VC=ΛC/(m-2) and VD by ΛD/(m-3), and this would provide an estimate of χ22 or χ23 from which the significance level P would be derived. Since autocorrelation violates the independence assumption to an extent that increases with SNP resolution, we must use different estimates of these quantities to recover Pij for replicates beginning with the fractional rank rij within the jth region, which equals the fractional rank of 1/xij, reversing the xij order. If nearly every Pij is unique within region j, the mean approaches 1/2, and the variance approaches 1/12 as rj→∞, corresponding to the uniform distribution that is a special case of a beta distribution. Then, a conventional estimate15Tukey JW The future of data analysis.Ann Math Stat. 1962; 33: 1-67Crossref Google Scholar of Pij isrij−13rj+13 , from which χ2ij2 or χ2ij3 is calculated by g01fcc of the National Algorithm Group (NAG), and the variance Vij is estimated asxijχij2 , which is not monotonic on xij, χ2ij, or Pij. There is no loss under H0, which relies exclusively on Pij. The situation is different for a single sample in a region, j, where H1 may be true, which gives only a single value xj for each analysis in that region. We estimate its error variance from replicates under H0 by fitting the regression model lnVij=b1+b2lnxij in a subregion centered as far as possible on the H1 value of xj, with up to 20 values of xij on each side. Vj under H1 is estimated as exp(b1+b2lnxj), which gives χ2j=xj/Vj for χ22 or χ23, with corresponding Pj from the g01ecc NAG subroutine. By estimation of Vj from replicates under H0, any effect of autocorrelation is avoided. As the first step toward genome scanning, a program called CHROMSCAN was constructed to perform the operations described above for tracts, chromosomes, or a whole genome (A.C., unpublished data). The region can be any length, with a 10 LDU default. The minimal number of SNPs is also specified, with a default of 30. Any number of replicates within a region can be chosen for evaluation of real data that contribute a single estimate to each region. Significance tests for association are based entirely on P, but the error variance Vj is used to compute an SE for Sj. Starting at converged values and with use of exact derivatives, simultaneous estimates of Sj and nuisance parameters that may include Mj and/or Lj provide an information matrix that is inverted to give the nominal variance K−1SS. Then, the information Kj about Sj is1/KSS−1Vj/df , where df=2 or 3 according to the model, and the corresponding SE is2Collins A Morton NE Mapping a disease locus by allelic association.Proc Natl Acad Sci USA. 1998; 95: 1741-1745Crossref PubMed Scopus (136) Google ScholarSE(Sj)=1/Kj . Its reliability under autocorrelation may be confirmed when the true value of S is known simply by fitting the nuisance parameters M and/or L at that value. Then the difference in χ2 evaluated at Vj is χ21j, andKj≈χ1j2(S∧j−Sj)2 . The concept of an FDR was introduced to map major genes by polymorphisms at single loci, which led to the conclusion that a LOD of at least 3 is required to assure an FDR <0.05.9Morton NE Sequential tests for the detection of linkage.Am J Hum Genet. 1955; 7: 277-318PubMed Google Scholar Forty years later, an analogy with Brownian motion extended this argument to a genome scan by linkage, with the conclusion that only a modest increase of the critical LOD (to 3.3) gives the same FDR with a major locus, increasing for complex inheritance to 3.6 for affected sib pairs and to 3.8 for affected second cousins.16Lander E Kruglyak L Genetic dissection of complex traits: guidelines for interpreting and reporting linkage results.Nat Genet. 1995; 11: 241-247Crossref PubMed Scopus (4382) Google Scholar Association mapping presents a different problem, since relatives of probands are usually excluded to generate large samples without requiring pedigree information. This sampling strategy is efficient for either a cohort or cases and controls and is especially useful for diseases of late onset, where DNA from one or both parents is often not available. Here we extend the FDR theory to association mapping and apply it to composite likelihood for unrelated individuals. In principle, our approach is applicable to any sampling strategy, even if standard FDR methods fail because the distribution of the nominal significance under the null hypothesis H0 is not uniform and/or the prior probability of H0 may approach 1. We assume that simulation as shown in the “Control of the Type I Error” section creates a uniform distribution under H0 and that the prior probability of H0 approaches 1. The material to be analyzed may be a tract, a chromosome, a set of chromosomes, or a genome. However defined, it is divided into nonoverlapping regions. When values of χ21 are pooled under the assumption of a Poisson distribution of extreme significance levels, the critical significance level P=α satisfies αN=.05, where N in the absence of compelling evidence of a causal locus in a defined subset of regions is the number of regions in a genome scan (α → 0, N→∞). The FDR corresponding to α isα(1−φ)α(1−φ)+Ωφ=αα+Ωφ/(1−φ) ,where Ω is the power to reach significance under H1 and ϕ is the probability that H1 is true whether significant or not. In genome scans ϕ and Ωϕ are of order 1/N or less, which makes this estimate impractical. Fortunately, we have independent estimates of the numerator and denominator, as illustrated in the “Results” section. Their ratio is a good estimate of the FDR if a uniform distribution holds under H0 and the number of H1 regions is large enough to determine Ω. Stage 1 in association mapping is currently defined as “a scan of one or more nonoverlapping regions, including but not limited to regions suggested by linkage, function, or cytogenetics.” The evidence presented for stage 1 sets a pattern and limit for subsequent meta-analysis, so its presentation is critical. Location would ideally be expressed in LDUs, which, in the few published trials, have been shown to be more efficient for association mapping than for physical units (kb).5Maniatis N Morton NE Gibson J Xu C-F Hosking LK Collins A The optimal measure of linkage disequilibrium reduces error in association mapping of affection status.Hum Mol Genet. 2005; 14: 145-153Crossref PubMed Scopus (36) Google Scholar, 13Maniatis N Collins A Gibson J Zhang W Tapper W Morton NE Positional cloning by linkage disequilibrium.Am J Hum Genet. 2004; 74: 846-855Abstract Full Text Full Text PDF PubMed Scopus (51) Google Scholar However, the latter have approached stability with revisions that are increasingly minor, infrequent, and generally accepted. LD maps, on the contrary, are more recent, with rapidly increasing density and potential changes in types and allele frequencies of markers. Although public (Genetic Epidemiology Group), these maps are not part of the HapMap database17International HapMap Consortium A haplotype map of the human genome.Nature. 2005; 437: 1299-1320Crossref PubMed Scopus (4547) Google Scholar and are not exploited by all investigators. Therefore, until consensus is reached, it is necessary to convert evidence from LD maps into more stable but reportedly less efficient locations and SEs (in kb) that the CHROMSCAN program also provides. In addition to extensive detail about reference maps (kb and LDU), number of regions studied, the basis for their choice, and the markers and methods of analysis, we assume a file with one record for each region within a given chromosome and the critical variables, where location is expressed as v.u for v kb and u additional bp (table 2). The primary and derived locations in both maps must be clearly indicated. The SE (expected to be consistently smaller than the estimate from a kb map) is the product of its estimate from the covariance matrix for composite likelihood and kb/LDU, where kb denotes the physical interval corresponding to an appropriate interval in LDU. Tentatively, we assume that the 95% confidence interval as ±1.96 SE is appropriate as a compromise between an interval so small that LDU=0 and one so large that blocks and steps outside the confidence interval are included. This structure allows regions to be sorted by any variable. For publication, only the smallest P values for composite likelihood would usually be given, together with larger values that relate to other claims. However, selective reporting of a region biases meta-analysis.Table 2Essential Data for Meta-Analysis of a RegionVariableValueKb and LDUSourceID…Chromosome1…22, X, or Y…No. of SNPsmj…First location in region…*Last location in region…*Composite likelihoodaLinkage data should be expressed in this form, but SEs are less reliable and often cover multiple LDU regions; usually S, SE, and K are not estimated.:…* P…* Estimated location (S)…* SE…* Information (K)…*Most significant marker:** Nominal χ21*…Note.—An asterisk (*) indicates that the data are often incomplete.a Linkage data should be expressed in this form, but SEs are less reliable and often cover multiple LDU regions; usually S, SE, and K are not estimated. Open table in a new tab Note.— An asterisk (*) indicates that the data are often incomplete. Given s independent and unbiased samples for a given region, and with the assumption that the same kb map was used for all samples, the simplest meta-analysis gives S =∑k=1sSkKk∑Kk , with nominal variance 1/ΣKk. Variation in allele frequencies, ascertainment, or other factors may inflateχs−12=ΣKkSk2−(ΣKkSk)2ΣKk . Interpretation becomes difficult if s is small, unless the sources of variation are controlled. If s were very large, the SE would beSE=(1∑Kk)(χs−12s−1) . In general,χs−12s−1 may be taken as 1 if nonsignificant and, otherwise, as t with s-1 df in computing a confidence interval. For example, a nominal 95% confidence limit is S ±t1∑K , where t is 12.706 for df=1 and 1.960 for df=∞. However, the block-and-step structure of LD makes any interval in kb approximate. Inclusion of LD maps in the HapMap database would increase its use for association mapping. To apply this logic to a genome scan of N regions, suppose that a subset with P<α is selected for further study in an independent sample of greater size and with denser markers. For each of the selected regions, pooling the information from the two samples with point estimates S1 and S2 and information K1 and K2, respectively, and with the assumption of homogeneity, the FDR can be calculated as for a single stage 1 sample from N regions. Within stage 1, the principal uncertainties involve regional definition and number of replicates (r). We tentatively assume that r=1,000 is adequate to estimate P for H1. For greater precision with extreme H1 probabilities, the corresponding regions may be simulated under H0 with at least 10/P replicates. The CHROMSCAN program is being extended and refined as it is applied to several association studies, the publication of which is necessarily delayed by consortium agreements. We have therefore taken as an illustrative example the U.K. case/control sample of the International Type 2 Diabetes (T2D) 1q Consortium,18Zeggini E Raynor W Morris AP Hattersley AT Walker M Hitman GA Deloukas P Cardon LR McCarthy MI An evaluation of HapMap sample size and tagging SNP performance in large-scale empirical and simulated datasets.Nat Genet. 2005; 37: 1320-1322Crossref PubMed Scopus (91) Google Scholar with affection status replaced by a random SNP for each region, deleting all information about the actual disease and reducing by 1 the number of predictive SNPs in that region. Here, we analyze the 39 regions in chromosome 1q21-24 over a 21,347-kb interval on the National Center for Biotechnology Information build 35/University of California–Santa Cruz March 2004 sequence, which was used to create a genomewide LDU map from the northwestern European (CEU) sample in International HapMap phase 2, with the exclusion of SNPs with minor-allele frequencies (MAFs) 10. The data consist of 447 controls and 443 cases. Whereas this CEU sample of 60 is smaller than the U.K. control sample, the number of SNPs is vastly greater and poss

Referência(s)