Bayesian Graphical Models for Genomewide Association Studies
2006; Elsevier BV; Volume: 79; Issue: 1 Linguagem: Inglês
10.1086/505313
ISSN1537-6605
AutoresClaudio Verzilli, Nigel Stallard, John C. Whittaker,
Tópico(s)Genetic Mapping and Diversity in Plants and Animals
ResumoAs the extent of human genetic variation becomes more fully characterized, the research community is faced with the challenging task of using this information to dissect the heritable components of complex traits. Genomewide association studies offer great promise in this respect, but their analysis poses formidable difficulties. In this article, we describe a computationally efficient approach to mining genotype-phenotype associations that scales to the size of the data sets currently being collected in such studies. We use discrete graphical models as a data-mining tool, searching for single- or multilocus patterns of association around a causative site. The approach is fully Bayesian, allowing us to incorporate prior knowledge on the spatial dependencies around each marker due to linkage disequilibrium, which reduces considerably the number of possible graphical structures. A Markov chain–Monte Carlo scheme is developed that yields samples from the posterior distribution of graphs conditional on the data from which probabilistic statements about the strength of any genotype-phenotype association can be made. Using data simulated under scenarios that vary in marker density, genotype relative risk of a causative allele, and mode of inheritance, we show that the proposed approach has better localization properties and leads to lower false-positive rates than do single-locus analyses. Finally, we present an application of our method to a quasi-synthetic data set in which data from the CYP2D6 region are embedded within simulated data on 100K single-nucleotide polymorphisms. Analysis is quick (<5 min), and we are able to localize the causative site to a very short interval. As the extent of human genetic variation becomes more fully characterized, the research community is faced with the challenging task of using this information to dissect the heritable components of complex traits. Genomewide association studies offer great promise in this respect, but their analysis poses formidable difficulties. In this article, we describe a computationally efficient approach to mining genotype-phenotype associations that scales to the size of the data sets currently being collected in such studies. We use discrete graphical models as a data-mining tool, searching for single- or multilocus patterns of association around a causative site. The approach is fully Bayesian, allowing us to incorporate prior knowledge on the spatial dependencies around each marker due to linkage disequilibrium, which reduces considerably the number of possible graphical structures. A Markov chain–Monte Carlo scheme is developed that yields samples from the posterior distribution of graphs conditional on the data from which probabilistic statements about the strength of any genotype-phenotype association can be made. Using data simulated under scenarios that vary in marker density, genotype relative risk of a causative allele, and mode of inheritance, we show that the proposed approach has better localization properties and leads to lower false-positive rates than do single-locus analyses. Finally, we present an application of our method to a quasi-synthetic data set in which data from the CYP2D6 region are embedded within simulated data on 100K single-nucleotide polymorphisms. Analysis is quick (<5 min), and we are able to localize the causative site to a very short interval. Recent advances in high-throughput technologies and the decrease in genotyping costs have made genomewide association (GWA) studies a feasible tool in the search for the genetic determinants of complex diseases. Several such studies are under way, and more are being planned, whereas some results from studies involving large panels of markers have already been published.1Cheung VG Spielman RS Ewens KG Weber TM Morley M Burdick JT Mapping determinants of human gene expression by regional and genomewide association.Nature. 2005; 437: 1365-1369Crossref PubMed Scopus (476) Google Scholar, 2Farrall M Morris AP Gearing up for genomewide gene-association studies.Hum Mol Genet. 2005; 14: R157-R162Crossref PubMed Scopus (34) Google Scholar, 3Maraganore DM de Andrade M Lesnick TG Strain KJ Farrer MJ Rocca WA Pant PVK Frazer KA Cox DR Ballinger DG High-resolution whole-genome association study of Parkinson disease.Am J Hum Genet. 2005; 77: 685-693Abstract Full Text Full Text PDF PubMed Scopus (380) Google Scholar, 4Lawrence RW Evans DM Cardon LR Prospects and pitfalls in whole genome association studies.Philos Trans R Soc Lond B Biol Sci. 2005; 360: 1589-1595Crossref PubMed Scopus (39) Google Scholar, 5Thomas DC Haile RW Duggan D Recent developments in genomewide association scans: a workshop summary and review.Am J Hum Genet. 2005; 77: 337-345Abstract Full Text Full Text PDF PubMed Scopus (163) Google Scholar The rationale behind this approach is as follows. Genetic variants that affect a trait of interest arose sometime in the past on a unique stretch of the genome, which was then transmitted to subsequent generations together with flanking variants. Subjects who show variability in a trait of interest, say cases and controls in a study of a dichotomous trait, may be genotyped at a large number of marker positions, mostly SNPs. We expect the genotypes of the two groups to be different around the causative mutation(s) because cases share the ancestral disease-bearing segment(s). We are thus able to map indirectly the (unobserved) disease-susceptibility variants without making any assumptions about their genomic location. The approach provides a more precise localization of disease-susceptibility loci (on the order of a few thousand base pairs) than do linkage studies (millions of base pairs), because chromosomes from unrelated individuals have undergone more recombination events than can be found in any realistically sized pedigree.1Cheung VG Spielman RS Ewens KG Weber TM Morley M Burdick JT Mapping determinants of human gene expression by regional and genomewide association.Nature. 2005; 437: 1365-1369Crossref PubMed Scopus (476) Google Scholar, 6Risch N Merikangas K The future of genetic studies of complex human diseases.Science. 1996; 273: 1516-1517Crossref PubMed Scopus (4157) Google Scholar Since routine complete resequencing of the genome is still not economically feasible, the success of this strategy relies, to a large extent, on exploitation of the linkage disequilibrium (LD) structure in human populations. Several coordinated efforts have therefore been initiated to characterize the patterns of human variation along the genome.7Gibbs RA Belmont JW Hardenbol P Willis TD Yu F Yang H Ch’ang LY et al.The International HapMap Project.Nature. 2003; 426: 789-796Crossref PubMed Scopus (4664) Google Scholar, 8Gibbs RA Belmont JW Boudreau A Leal SM Hardenbol P Pasternak S Wheeler DA et al.A haplotype map of the human genome.Nature. 2005; 437: 1299-1320Crossref PubMed Scopus (26) Google Scholar, 9International Map Working GroupA map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms.Nature. 2001; 409: 928-933Crossref PubMed Scopus (2350) Google Scholar, 10Hinds DA Stuve LL Nilsen GB Halperin E Eskin E Ballinger DG Frazer KA Cox DR Whole-genome patterns of common DNA variation in three human populations.Science. 2005; 307: 1072-1079Crossref PubMed Scopus (953) Google Scholar The goal is to refine our understanding of the extent of genetic variation, both within and across populations, and to inform the selection of markers that capture most of the genetic variability with minimal loss of information to detect disease-susceptibility loci in GWA studies. In practice, the complexity of human evolution introduces many uncertainties.11Wang WYS Barratt BJ Clayton DG Todd JA Genomewide association studies: theoretical and practical considerations.Nat Rev Genet. 2005; 6: 109-118Crossref PubMed Scopus (883) Google Scholar For instance, the correlation between adjacent markers or LD along the genome is characterized by considerable spatial heterogeneity, as reported by several recent studies.10Hinds DA Stuve LL Nilsen GB Halperin E Eskin E Ballinger DG Frazer KA Cox DR Whole-genome patterns of common DNA variation in three human populations.Science. 2005; 307: 1072-1079Crossref PubMed Scopus (953) Google Scholar, 12Dawson E Abecasis GR Bumpstead S Chen Y Hunt S Beare DM Pabial J et al.A first-generation linkage disequilibrium map of human chromosome 22.Nature. 2002; 418: 544-548Crossref PubMed Scopus (328) Google Scholar, 13McVean GA Myers SR Hunt S Deloukas P Bentley DR Donnelly P The fine-scale structure of recombination rate variation in the human genome.Science. 2004; 304: 581-584Crossref PubMed Scopus (699) Google Scholar, 14Evans DM Cardon LR A comparison of linkage disequilibrium patterns and estimated population recombination rates across multiple populations.Am J Hum Genet. 2005; 76: 681-687Abstract Full Text Full Text PDF PubMed Scopus (85) Google Scholar, 15Mueller JC Lohmussaar E Magi R Remm M Bettecken T Lichtner P Biskup S Illig T Pfeufer A Luedemann J Schreiber S Pramstaller P Pichler I Romeo G Gaddi A Testa A Wichmann HE Metspalu A Meitinger T Linkage disequilibrium patterns and tagSNP transferability among European populations.Am J Hum Genet. 2005; 76: 387-398Abstract Full Text Full Text PDF PubMed Scopus (116) Google Scholar Regions of tightly linked markers corresponding to haplotype blocks of limited diversity are interspersed with uncorrelated markers, whereas long-range correlations are not uncommon. This has important consequences for our ability to find disease-causing variants. Earlier estimates of the marker density necessary to capture enough genomic variation to be useful in GWA studies appear to have been optimistic, with phase II of the HapMap project now aiming to identify a panel of variants with an average spacing of 1 kb.8Gibbs RA Belmont JW Boudreau A Leal SM Hardenbol P Pasternak S Wheeler DA et al.A haplotype map of the human genome.Nature. 2005; 437: 1299-1320Crossref PubMed Scopus (26) Google Scholar Thus, a typical GWA study is now expected to contain data on ⩾500K assayed SNPs for several thousand individuals. Increasing the marker density is not a panacea, however. If the disease allele has a very low penetrance or is very rare, then the chance of detecting an association even in reasonably sized and well-designed studies is low, independent of the marker density.16Zondervan KT Cardon LR The complex interplay among factors that influence allelic association.Nat Rev Genet. 2004; 5: 89-100Crossref PubMed Scopus (445) Google Scholar It is nevertheless clear that the potential of GWA studies cannot be fully assessed until statistical methods are available that are able to cope with the size and complexity of the data sets currently being collected. It is desirable that these methods be able to account for the network of local dependencies between markers that are due to LD. On a smaller genomic scale, it may be more powerful to consider multi-SNP interaction terms, since these may capture better the pattern of alleles present around the causative mutation (and therefore better discriminate affected individuals and unaffected ones), as opposed to considering each SNP on its own. The latter strategy, in fact, though scalable to large data sets, fails, by definition, to fully exploit the LD information around a causative locus, if present. In the literature, there are several approaches for multilocus SNP haplotype analysis that exploit the excess haplotype sharing among cases around a causative locus.17Morris AP Whittaker JC Xu CF Hosking LK Balding DJ Multipoint linkage-disequilibrium mapping narrows location interval and identifies mutation heterogeneity.Proc Natl Acad Sci USA. 2003; 100: 13442-13446Crossref PubMed Scopus (27) Google Scholar, 18Morris AP Direct analysis of unphased SNP genotype data in population-based association studies via Bayesian partition modelling of haplotypes.Genet Epidemiol. 2005; 29: 91-107Crossref PubMed Scopus (39) Google Scholar, 19Thomas DC Stram DO Conti D Molitor J Marjoram P Bayesian spatial modeling of haplotype associations.Hum Hered. 2003; 56: 32-40Crossref PubMed Scopus (36) Google Scholar, 20Durrant C Zondervan KT Cardon LR Hunt S Deloukas P Morris AP Linkage disequilibrium mapping via cladistic analysis of single-nucleotide polymorphism haplotypes.Am J Hum Genet. 2004; 75: 35-43Abstract Full Text Full Text PDF PubMed Scopus (162) Google Scholar, 21Zöllner S Pritchard JK Coalescent-based association mapping of complex trait loci.Genetics. 2005; 169: 1071-1092Crossref PubMed Scopus (95) Google Scholar However, because of their computational complexity, these approaches are best suited for candidate-gene studies or studies in small candidate regions and will not scale to the size of the data sets discussed above. The aim of this article is to provide methods that allow for multilocus local dependencies while addressing the scalability problem. Because genotype data collected in GWA studies lack phase information, we focus on case-control studies with unphased genotype data and model the joint distribution of markers and the disease-status indicator as a discrete graphical model. The nodes in the graph correspond to the genotype data and the case-control indicator. The structure of dependencies both between markers (due to LD) and between markers and disease status is then learned from the data by use of a fully Bayesian approach. We are thus able to make probabilistic statements about the presence of certain edges or associations, with primary interest in those involving marker nodes and the disease-status indicator. The Bayesian approach has several advantages. For example, we are able to incorporate useful prior knowledge of the domain by restricting, for each marker node, the network of dependencies to nodes within a suitable physical distance. This reduces considerably the space of possible graphs and, in turn, the computational complexity, making the approach feasible in large candidate-gene studies or GWA studies. The performance of the proposed method is evaluated using data simulated under different scenarios that vary in marker density, disease-allele frequency, genotype relative risks (GRRs), and mode of inheritance. The results are compared with single-locus χ2 tests for association of each SNP marker with disease, an approach frequently used with large data sets. Considering a single disease variant, we show that our approach leads to a smaller localization error and fewer false-positive results than does a single-locus analysis. Scalability to GWA studies is investigated by applying the approach to a quasi-synthetic data set of 100K simulated markers with embedded real SNP genotype data from a 890-kb region flanking the CYP2D6 gene, which is recessively associated with drug metabolism.22Hosking LK Boyd PR Xu CF Nissum M Cantone K Purvis IJ Khakhar R Barnes MR Liberwirth U Hagen-Mann K Ehm MG Riley JH Linkage disequilibrium mapping identifies a 390 kb region associated with CYP2D6 poor drug metabolising activity.Pharmacogenomics J. 2002; 2: 165-175Crossref PubMed Scopus (37) Google Scholar In the next section, we describe the use of discrete graphical models for mining genotype-phenotype associations, while introducing the notation used throughout. A brief overview of Bayesian learning of discrete graphs relevant to this work is also given. That section is followed by the results from the simulation studies and the application to the synthetic CYP2D6 data. We end with a discussion of the advantages and disadvantages of the proposed method. We assume that genotype data are available from a sample of Nd cases and Nc controls at a set of M marker loci, where usually Nd=Nc. The binary variable Di∈{0,1} is a disease-status indicator for individual i with observed value di=0 for a control and di=1 otherwise, i=1,…,N=Nd+Nc. The genotype Gim of subject i at locus m takes value 1 if heterozygous and 0 (2) if homozygous wild-type (mutant), m=1,…,M. For large M, a convenient and powerful framework for representing the joint distribution of G,D over such a complex discrete domain is given by discrete graphical models.23Dawid A Lauritzen S Hyper-Markov laws in the statistical analysis of decomposable graphical models.Ann Stat. 1993; 21: 1272-1317Crossref Google Scholar, 24Madigan D York J Bayesian graphical models for discrete data.Int Stat Rev. 1995; 63: 215-232Crossref Google Scholar, 25Giudici P Green P Tarantola C Efficient model determination for discrete graphical models, technical report. University of Pavia, Pavia, Italy1999Google Scholar A discrete graph G is a mathematical object composed of a set V of vertices and a set E of edges comprising ordered pairs of elements from V. In particular, a graph G=(V,E) is called “undirected” if the edges present have no orientation—that is, if (a,b)∈E implies (b,a)∈E. The vertices in the graph correspond to discrete random variables, and edges in the graph describe the dependencies and conditional independencies that hold for the joint distribution of variables corresponding to its vertices. In practice, the set of dependencies E is seldom known in advance, and the objective is then to learn it from the data exploiting the graphical formalism. In so-called decomposable graphs, the joint distribution over the vertices V can be factorized on lower-dimensional subspaces, thus simplifying considerably the task of evaluating and comparing different dependence structures or models. Thus, if the set V and the number of possible graphical models is large, as is the case here, it is desirable to restrict the class of graphs to decomposable ones only. To ensure this, any graph considered should satisfy the running intersection property and admit a junction tree representation. For M=9 marker loci and the disease-status indicator D, an example of a decomposable graph is given in figure 1. Also shown are the two disconnected junction trees corresponding to the graph, or its junction forest. The graph is composed of five cliques—that is, complete subgraphs with all edges present—given by the sets of vertices C1,…,C5. The running intersection property is satisfied if, given the set of cliques of a graph C={C1,…,CL}, there is an ordering of the cliques such that, for each Cl, the set of vertices in common with previous cliques, Sl=Cl∩(C1∪…∪Cl−1), is contained in at least one previous clique. This is trivially satisfied for the graph in figure 1, which also shows the separator sets Sl, with S1 and S5 empty sets. The edges in the graph and the running intersection property imply the conditional independence between cliques given separators between them. Then, if Rl=(Cl\Si) defines the residue of each clique, the joint probability distribution of the vertices factorizes into f(G,D)=∏l=1Lf(Rl|Sl)or, equivalently, f(G,D)=∏l=1Lf(Cl)∏r=1Rf(Sr),(1) where R is the number of nonempty separator sets. Thus, the joint distribution associated with a decomposable graph factorizes conveniently into local terms corresponding to marginal densities of cliques and separators.26Cowell RG Dawid AP Lauritzen SL Spiegelhalter DJ Probabilistic networks and expert systems. Springer-Verlag, New York1999Google Scholar, 27Borgelt C Kruse R Graphical models: methods for data analysis and mining. John Wiley, Chichester, United Kingdom2002Google Scholar It should be noted that decomposable graphs are limited in the number of dependencies and conditional independencies that they can represent, compared with, say, hierarchical log-linear models.28Dellaportas P Forster JJ Markov chain Monte Carlo model determination for hierarchical and graphical log-linear models.Biometrika. 1999; 86: 615-633Crossref Scopus (100) Google Scholar However, if the number of vertices is very large, as is the case here, the process of probabilistically learning the graph from data is only possible by acceptance of restrictions of this sort. Because all variables are discrete, we adopt a multinomial likelihood for the cell entries of the multiway contingency tables obtained by cross-classifying the genotype and disease-status data (G,D) according to the variables in cliques and separators. Namely, by indicating with nlg and nrg the vector of cell entries of the contingency table corresponding to generic clique and separator Cl and Sr for graph g, with corresponding vectors of cell probabilities θlg and θrg, respectively, from equation (1), the multinomial likelihood is f(G,D|θ,g)=∏l=1L∏jθjlgnjlg∏r=1R∏kθkrgnkrg.The subscript for θ in the previous expression highlights the dependence of the factorization on the current graph g. Here, we develop a Markov chain–Monte Carlo (MCMC) algorithm to sample over the space of possible discrete graphs while exploiting our prior knowledge of the domain. The edges in the graph model the joint distribution of G and D, with links between variables in G reflecting the LD structure and those between G and the case-control indicator D suggesting the presence of a disease-susceptibility locus in the region. Our approach is fully Bayesian and yields a sample of graphical models from their posterior distribution conditional on the data, f(g|G,D).24Madigan D York J Bayesian graphical models for discrete data.Int Stat Rev. 1995; 63: 215-232Crossref Google Scholar, 25Giudici P Green P Tarantola C Efficient model determination for discrete graphical models, technical report. University of Pavia, Pavia, Italy1999Google Scholar From the posterior sample of graphs, the frequency with which any two vertices are connected by an edge is then an estimate of the posterior probability of association. We therefore exploit the well-known Bayesian model-averaging paradigm, which has been shown to perform better than methods that rely on a single “best” model, in both classification and variable-selection tasks.29Viallefont V Raftery AE Richardson S Variable selection and Bayesian model averaging in case-control studies.Stat Med. 2001; 20: 3215-3230Crossref PubMed Scopus (184) Google Scholar, 30Denison DGT Holmes CC Mallick BK Smith AFM Bayesian methods for nonlinear classification and regression. John Wiley, Chicester, United Kingdom2002Google Scholar, 31Verzilli CJ Stallard N Whittaker JC Bayesian modelling of multivariate quantitative traits using seemingly unrelated regressions.Genet Epidemiol. 2005; 28: 313-325Crossref PubMed Scopus (20) Google Scholar Note that we marginalize over the cell probabilities in θ, with the aim of comparing the different structures—and, in particular, identifying which SNPs are associated with D—rather than making probabilistic statements about the distribution of the vector θ. In the literature on graphical models, this is referred to as “qualitative learning.” Indeed, constructing MCMC schemes that sample over the space of both graphical structures and parameter values in large domains is extremely difficult and unfeasible in data-mining applications.32Murray I Ghahramani Z Bayesian learning in undirected graphical models: approximate MCMC algorithms.in: Chickering M Halpern J Proceedings of the 20th Annual Conference on Uncertainty in Artificial Intelligence, Banff, Canada. AUAI Press, Arlington, VA2004Google Scholar The MCMC scheme uses a Metropolis-Hastings (MH) algorithm as used by Madigan and York,24Madigan D York J Bayesian graphical models for discrete data.Int Stat Rev. 1995; 63: 215-232Crossref Google Scholar with proposal distributions tuned to reflect the spatial features of the data at hand induced by the LD structure. Given the current graphical structure g, a new structure, g′, is proposed and accepted with probability min[1,f(g′|G,D)f(g|g′)f(g|G,D)f(g′|g)]=min[1,f(G,D|g′)f(g′)f(g|g′)f(G,D|g)f(g)f(g′|g)],(2) where f(g) is the prior distribution over structures, f(·|·) is the proposal distribution, and f(G,D|g) is the marginal likelihood or evidence for graph g. For computational efficiency, it is critical to be able to calculate quickly the latter quantity, which is given by the integral f(G,D|g)=∫f(G,D|θ,g)f(θ|g)dθ.To this end, the Bayesian metric is particularly convenient since, under the assumption of a Dirichlet prior on the vector of parameters θ, the integral above is available analytically (appendix A). It is also important that any proposed graph is decomposable, because this facilitates considerably the computation of the marginal likelihood corresponding to any new model, as discussed in the previous section. To ensure this, the MCMC scheme allows only moves in the space of graphs admitting a junction tree representation, which are decomposable by construction. Thus, rather than modifying the current graph by deleting or adding a single edge at a time, our moves involve changes to the set of cliques and separators. To reduce the space of possible graphs, any clique contains vertices corresponding to a set of possibly noncontiguous markers within a prespecified maximum physical distance. That is, we restrict the set of markers forming each clique to the set of neighboring but not necessarily adjacent markers, where a neighborhood is defined in terms of physical distance. The rationale is to incorporate prior knowledge on the extent of LD likely to exist around a marker while allowing a degree of flexibility by considering cliques of noncontiguous markers. A clique is then assigned a dichotomous label, T∈{0,1}, depending on whether edges are present between any of its marker vertices and the disease-status indicator D (T=1 if one or more edges are present and 0 otherwise). An example of a possible graph is given in figure 2, in which, for clarity, we have omitted edges connecting all vertices within cliques. The graph contains three cliques, C1=(G1,G3,G4),C2=(G2,G5,G6), and C3=(G5,G6,D) and a separator, S1=(G5,G6). In this setting, the separator expresses the multilocus association of genotype variables 5 and 6 with disease status and determines the label T=1 currently assigned to C2 (shaded in the fig.). Finally, we limit the maximum size of each clique and separators, to mitigate problems of sparsity in the corresponding multiway contingency tables. Again, this is not a very restrictive assumption, considering recent results showing that clusters of densely connected common SNPs appear, in most cases, to be made up of few SNPs (≤10), with minor differences across ethnic groups.10Hinds DA Stuve LL Nilsen GB Halperin E Eskin E Ballinger DG Frazer KA Cox DR Whole-genome patterns of common DNA variation in three human populations.Science. 2005; 307: 1072-1079Crossref PubMed Scopus (953) Google Scholar Given the current set of cliques and separators, our sampler then iterates randomly among the following three steps: Merge step: Propose to merge a randomly selected clique with another clique in the graph. The latter is chosen at random from the set of cliques containing neighboring markers of the former. If the size of the proposed clique exceeds the maximum size allowed, the move is rejected.Split step: Propose to split, at random, a randomly selected clique into two. The move is not attempted if the selected clique is a singleton.Switch-clique-label step: Propose to change the label T of a randomly selected clique. If the chosen clique contains edges between any marker vertices and D (T=1), these are deleted in the proposed graph. Otherwise (T=0), we select a set of separator markers at random from the vertices of the clique and propose edges between them and D. Note, in passing, that the correct retrospective likelihood for case-control ascertainment is used here when evidence is contrasted in favor of or against association—that is, P(G|D) versus P(G) for the chosen clique.It is trivial to check that the resulting graph is decomposable, since all moves involve changes to the set of cliques and separators. Thus, in each case, the marginal likelihood needed to compute the MH ratio in equation (2) is available in closed form. Further details on the prior used over graphical structures, proposal distributions, and expressions for the acceptance probabilities are given in appendix A. At the core of our approach is, therefore, the joint modeling of the dependence structure between genotype markers (the merge and split steps) and between these and the disease-status indicator (the switch-clique-label step). By allowing for single- and multilocus marker-disease associations within each clique, we are able to filter out false association better than by using single-locus methods. This is because, with a high-density marker map, we expect single- and multilocus association to be more prevalent around a true disease-susceptibility locus than around a spurious association. Model averaging then captures this self-reinforcing process as, for each marker, the marginal posterior probability of association is calculated by combining the single- and multilocus evidence around that position. Specifically, we use Bayes factors to measure evidence in favor of association versus no association; this is given by the ratio of posterior to prior odds of association and can be interpreted as the amount by which the prior odds get updated by observation of the data.33O’Hagan A Forster J Kendall’s advanced theory of statistics, volume 2B: Bayesian inference. Arnold, London2004Google Scholar In this section, we present results from simulation studies investigating the performance of the proposed method under various scenarios. In particular, we consider different marker densities, GRRs, minor-allele frequencies (MAFs) of the high-risk variant, and disease models. Data consist of unphased genotype data in a 1-Mb region for ∼1,000
Referência(s)