Artigo Acesso aberto Revisado por pares

Online System for Faster Multipoint Linkage Analysis via Parallel Execution on Thousands of Personal Computers

2006; Elsevier BV; Volume: 78; Issue: 6 Linguagem: Inglês

10.1086/504158

ISSN

1537-6605

Autores

Mark Silberstein, Anna Tzemach, Nickolay Dovgolevsky, M. Fishelson, Assaf Schuster, Dan Geiger,

Tópico(s)

Face and Expression Recognition

Resumo

Computation of LOD scores is a valuable tool for mapping disease-susceptibility genes in the study of Mendelian and complex diseases. However, computation of exact multipoint likelihoods of large inbred pedigrees with extensive missing data is often beyond the capabilities of a single computer. We present a distributed system called “SUPERLINK-ONLINE,” for the computation of multipoint LOD scores of large inbred pedigrees. It achieves high performance via the efficient parallelization of the algorithms in SUPERLINK, a state-of-the-art serial program for these tasks, and through the use of the idle cycles of thousands of personal computers. The main algorithmic challenge has been to efficiently split a large task for distributed execution in a highly dynamic, nondedicated running environment. Notably, the system is available online, which allows computationally intensive analyses to be performed with no need for either the installation of software or the maintenance of a complicated distributed environment. As the system was being developed, it was extensively tested by collaborating medical centers worldwide on a variety of real data sets, some of which are presented in this article. Computation of LOD scores is a valuable tool for mapping disease-susceptibility genes in the study of Mendelian and complex diseases. However, computation of exact multipoint likelihoods of large inbred pedigrees with extensive missing data is often beyond the capabilities of a single computer. We present a distributed system called “SUPERLINK-ONLINE,” for the computation of multipoint LOD scores of large inbred pedigrees. It achieves high performance via the efficient parallelization of the algorithms in SUPERLINK, a state-of-the-art serial program for these tasks, and through the use of the idle cycles of thousands of personal computers. The main algorithmic challenge has been to efficiently split a large task for distributed execution in a highly dynamic, nondedicated running environment. Notably, the system is available online, which allows computationally intensive analyses to be performed with no need for either the installation of software or the maintenance of a complicated distributed environment. As the system was being developed, it was extensively tested by collaborating medical centers worldwide on a variety of real data sets, some of which are presented in this article. Computation of LOD is a valuable tool for mapping disease-susceptibility genes in the study of Mendelian and complex diseases. Computation of the LOD score—defined as log10(LHA/LH0), where LH0 is the likelihood under the null hypothesis of no linkage between markers and a disease locus and LHA is the likelihood of linkage—requires efficient methods, especially for large pedigrees and many markers. A linkage analysis is performed by placing a trait locus at various positions along a map of markers, to locate regions that show evidence of linkage and that merit further study. To extract full linkage information from pedigree data, it is desirable to perform multipoint likelihood computations with the joint use of all available relevant data. There are two main approaches for computing multipoint likelihoods: the Elston-Stewart1Elston R Stewart J A general model for the analysis of pedigree data.Hum Hered. 1971; 21: 523-542Crossref PubMed Scopus (1055) Google Scholar and the Lander-Green2Lander E Green P Construction of multilocus genetic maps in humans.Proc Natl Acad Sci. 1987; 84: 2363-2367Crossref PubMed Scopus (1163) Google Scholar algorithms. The complexity of the Elston-Stewart algorithm is linear in the number of individuals but exponential in the number of markers. On the other hand, the complexity of the Lander-Green algorithm increases linearly in the number of markers but exponentially in the number of individuals in the pedigree. A recently proposed approach is to combine and generalize the previous two methods by using the framework of Bayesian networks as the internal representation of linkage analysis problems.3Fishelson M Geiger D Optimizing exact genetic linkage computations.J Comput Biol. 2004; 11: 263-275Crossref PubMed Scopus (71) Google Scholar Using this representation enables the efficient handling of a wide variety of likelihood computations by automatic choice of computation order, according to the problem at hand. The computation of exact multipoint likelihoods of large inbred pedigrees with extensive missing data is often beyond the computational capabilities of a single computer. Two complementary approaches can facilitate more-demanding linkage computations: designing more-efficient algorithms and parallelizing computation to use multiple computers. Both approaches have been pursued over the years. Several algorithmic improvements of exact likelihood computations have been reported.3Fishelson M Geiger D Optimizing exact genetic linkage computations.J Comput Biol. 2004; 11: 263-275Crossref PubMed Scopus (71) Google Scholar, 4Cottingham RW Idury RM Schäffer AA Faster sequential genetic linkage computations.Am J Hum Genet. 1993; 53: 252-263PubMed Google Scholar, 5O’Connell J Weeks D The VITESSE algorithm for rapid exact multilocus linkage analysis via genotype set-recoding and fuzzy inheritance.Nat Genet. 1995; 11: 402-408Crossref PubMed Scopus (416) Google Scholar, 6Schäffer AA Faster linkage analysis computations for pedigrees with loops or unused alleles.Hum Hered. 1996; 46: 226-235Crossref PubMed Scopus (104) Google Scholar, 7Kruglyak L Lander E Faster multipoint linkage analysis using Fourier transform.J Comput Biol. 1998; 5: 1-7Crossref PubMed Scopus (185) Google Scholar, 8Gudbjartsson F Jonasson K Frigge ML Kong A Allegro, a new computer program for multipoint linkage analysis.Nat Genet. 2000; 25: 12-13Crossref PubMed Scopus (650) Google Scholar, 9O’Connell J Rapid multipoint linkage analysis via inheritance vectors in the Elston-Stewart algorithm.Hum Hered. 2001; 51: 226-240Crossref PubMed Scopus (30) Google Scholar, 10Abecasis GR Cherny SS Cookson WO Cardon LR Merlin—rapid analysis of dense genetic maps using sparse gene flow trees.Nat Genet. 2002; 30: 97-101Crossref PubMed Scopus (2693) Google Scholar For example, efficient implementation of the Lander-Green algorithm by the GENEHUNTER program allows multipoint analysis of medium-sized pedigrees with large numbers of markers7Kruglyak L Lander E Faster multipoint linkage analysis using Fourier transform.J Comput Biol. 1998; 5: 1-7Crossref PubMed Scopus (185) Google Scholar, 11Kruglyak L Daly M Reeve-Daly M Lander E Parametric and nonparametric linkage analysis: a unified multipoint approach.Am J Hum Genet. 1996; 58: 1347-1363PubMed Google Scholar; VITESSE v.2 implements optimization for the Elston-Stewart algorithm, extending its computational boundaries by orders of magnitude5O’Connell J Weeks D The VITESSE algorithm for rapid exact multilocus linkage analysis via genotype set-recoding and fuzzy inheritance.Nat Genet. 1995; 11: 402-408Crossref PubMed Scopus (416) Google Scholar, 9O’Connell J Rapid multipoint linkage analysis via inheritance vectors in the Elston-Stewart algorithm.Hum Hered. 2001; 51: 226-240Crossref PubMed Scopus (30) Google Scholar; and SUPERLINK applies enhanced optimization techniques for finding better orders of computation in Bayesian networks, making it possible to perform multipoint analysis of larger inbred families.3Fishelson M Geiger D Optimizing exact genetic linkage computations.J Comput Biol. 2004; 11: 263-275Crossref PubMed Scopus (71) Google Scholar, 12Fishelson M Geiger D Exact genetic linkage computations for general pedigrees.Bioinformatics. 2002; 18: S189-S198Crossref PubMed Scopus (205) Google Scholar, 13Fishelson M Dovgolevsky N Geiger D Maximum likelihood haplotyping for general pedigrees.Hum Hered. 2005; 59: 41-60Crossref PubMed Scopus (56) Google Scholar Several parallel algorithms for linkage analysis have been reported.14Miller P Nadkarni P Gelernter G Carriero N Pakstis A Kidd K Parallelizing genetic linkage analysis: a case study for applying parallel computation in molecular biology.Comput Biomed Res. 1991; 24: 234-248Abstract Full Text PDF PubMed Scopus (18) Google Scholar, 15Dwarkadas S Schäffer A Cottingham R Cox A Keleher P Zwaenepoel W Parallelization of general linkage analysis problems.Hum Hered. 1994; 44: 127-141Crossref PubMed Scopus (38) Google Scholar, 16Matise T Schroeder M Chiarulli D Weeks D Parallel computation of genetic likelihoods using CRI-MAP, PVM, and a network of distributed workstations.Hum Hered. 1995; 45: 103-116Crossref PubMed Scopus (11) Google Scholar, 17Gupta S Schäffer A Cox A Dwarkadas S Zwaenepoel W Integrating parallelization strategies for linkage analysis.Comput Biomed Res. 1995; 28: 116-139Abstract Full Text PDF PubMed Scopus (29) Google Scholar, 18Rai A Lopez-Benitez N Hargis J Poduslo S On the parallelization of Linkmap from the LINKAGE/FASTLINK package.Comput Biomed Res. 2000; 33: 350-364Abstract Full Text PDF PubMed Scopus (6) Google Scholar, 19Kothari K Lopez-Benitez N Poduslo S High-performance implementation and analysis of the Linkmap program.J Biomed Inform. 2001; 34: 406-414Abstract Full Text PDF PubMed Scopus (6) Google Scholar, 20Conant G Plimpton S Old W Wagner A Fain P Pacheco T Heffelfinger G Parallel Genehunter: implementation of a linkage analysis package for distributed-memory architectures.J Parallel Distrib Comput. 2003; 63: 674-682Crossref Scopus (10) Google Scholar, 21Dietter J Spiegel A an Mey D Pflug HJ al Kateb H Hoffmann K Wienker T Strauch K Efficient two-trait-locus linkage analysis through program optimization and parallelization: application to hypercholesterolemia.Eur J Hum Genet. 2004; 12: 542-550Crossref PubMed Scopus (15) Google Scholar Parallel computing was successfully applied to improve the performance of LINKAGE and FASTLINK packages, speeding up the computations with the use of a set of dedicated processors.14Miller P Nadkarni P Gelernter G Carriero N Pakstis A Kidd K Parallelizing genetic linkage analysis: a case study for applying parallel computation in molecular biology.Comput Biomed Res. 1991; 24: 234-248Abstract Full Text PDF PubMed Scopus (18) Google Scholar, 15Dwarkadas S Schäffer A Cottingham R Cox A Keleher P Zwaenepoel W Parallelization of general linkage analysis problems.Hum Hered. 1994; 44: 127-141Crossref PubMed Scopus (38) Google Scholar, 17Gupta S Schäffer A Cox A Dwarkadas S Zwaenepoel W Integrating parallelization strategies for linkage analysis.Comput Biomed Res. 1995; 28: 116-139Abstract Full Text PDF PubMed Scopus (29) Google Scholar, 18Rai A Lopez-Benitez N Hargis J Poduslo S On the parallelization of Linkmap from the LINKAGE/FASTLINK package.Comput Biomed Res. 2000; 33: 350-364Abstract Full Text PDF PubMed Scopus (6) Google Scholar, 19Kothari K Lopez-Benitez N Poduslo S High-performance implementation and analysis of the Linkmap program.J Biomed Inform. 2001; 34: 406-414Abstract Full Text PDF PubMed Scopus (6) Google Scholar Efficient parallel implementations of GENEHUNTER, which involve dividing the computations over high-performance processors, achieve significant speedups compared with the serial version and allow for the analysis of larger pedigrees.20Conant G Plimpton S Old W Wagner A Fain P Pacheco T Heffelfinger G Parallel Genehunter: implementation of a linkage analysis package for distributed-memory architectures.J Parallel Distrib Comput. 2003; 63: 674-682Crossref Scopus (10) Google Scholar Despite the advantages of parallel computations, the use of parallel programs for linkage analysis is quite limited by their dependency on the availability of high-performance execution environments, such as a cluster of high-performance, dedicated machines or a supercomputer. Such hardware can usually be found only in specialized research centers because of its high cost and operational complexity. In this article, we introduce a distributed system for exact linkage analysis called “SUPERLINK-ONLINE,” which is capable of analyzing inbred families of several hundreds of individuals with extended missing data, thereby outperforming all existing tools for exact linkage computations on such inputs. The system is based on the parallelization of SUPERLINK, which is a stateof-the-art program for parametric linkage analysis of dichotomous/binary traits with large inbred pedigrees. Our approach, made possible by recent advances in distributed computing,22Thain D Livny M Building reliable clients and servers.in: Foster I Kesselman C The grid: blueprint for a new computing infrastructure. Morgan Kaufmann, San Francisco2003: 285-318Google Scholar eliminates the need for expensive hardware by distributing the computations over thousands of nondedicated personal computers, using their idle cycles. The main algorithmic challenge has been to efficiently split a large linkage analysis task for execution in a dynamic, distributed running environment. Such environments are characterized by the presence of many computers with different capabilities and operating systems, frequent failures, and extreme fluctuations of the number of computers available for execution. SUPERLINK-ONLINE delivers the newest information technology to geneticists via a simple Internet interface that completely hides the complexity of the underlying distributed system. The system allows for the concurrent submission of linkage tasks by multiple users, dynamically adapting the parallelization strategy in accordance with the current load and the number of computers available to perform the computations. As the system was being developed, it was extensively tested by collaborating medical centers worldwide on a variety of real data sets, some of which are presented in this article. Our results show improved running times of >2 orders of magnitude versus the serial version of the program. This allows users to avoid the undesirable breakup of larger pedigrees into smaller pieces, which often weakens the linkage signal. Linkage analysis tests for cosegregation between alleles at markers in a chromosomal region and at a trait locus of interest. In parametric linkage analysis, the likelihood for evidence of linkage L(θ) is computed under a given model of disease-allele frequency, penetrances, recombination fraction between markers, and recombination fraction (θ) between a disease locus and a reference locus. Computations of L(θ) in this article assume Hardy-Weinberg equilibrium, linkage equilibrium, and no interference, which are common assumptions in most linkage analyses performed to date. It is common to consider LOD(θ1)=log10L(θ=θ1)/L(θ=0.5)>3.3 as indication of linkage.23Lander E Kruglyak L Genetic dissection of complex traits: guidelines for interpreting and reporting linkage results.Nat Genet. 1995; 11: 241-247Crossref PubMed Scopus (4379) Google Scholar Consider a pedigree and let xi denote the observations that include affection status and marker information at one or multiple loci of the ith pedigree member. The likelihood is the probability of the observations defined by L(θ)=P(x1,x2,…,xm|θ). Elston and Stewart have shown that, for simple pedigrees without loops, the likelihood may be represented as the telescoping sum L(X|θ)=∑g1P(x1|g1)P(g1|·)…∑gm-1P(xm-1|gm-1)P(gm-1|·)∑gmP(xm|gm)P(gm|·),where the individuals are ordered such that parents precede their children and where P(gi|·) represents either the ith child’s multilocus genotype, given the parental multilocus genotypes, or the probability that a founder individual (with no parents in the pedigree) has a multilocus genotype gi.1Elston R Stewart J A general model for the analysis of pedigree data.Hum Hered. 1971; 21: 523-542Crossref PubMed Scopus (1055) Google Scholar The Elston-Stewart algorithm and its extensions have been employed in many linkage analysis programs. Whereas the early implementations (i.e., LIPED24Ott J A computer program for linkage analysis of general human pedigrees.Am J Hum Genet. 1976; 28: 528-529PubMed Google Scholar and LINKAGE25Lathrop G Lalouel J Julier C Ott J Strategies for multilocus linkage analysis in humans.Proc Natl Acad Sci USA. 1984; 81: 3443-3446Crossref PubMed Scopus (2173) Google Scholar, 26Lathrop GM Lalouel JM Easy calculations of Lod scores and genetic risks on small computers.Am J Hum Genet. 1984; 36: 460-465PubMed Google Scholar) allow for computation of two-point and multipoint LOD scores of small pedigrees with the use of few markers, their successors—such as later versions of LINKAGE and FASTLINK4Cottingham RW Idury RM Schäffer AA Faster sequential genetic linkage computations.Am J Hum Genet. 1993; 53: 252-263PubMed Google Scholar—extend their capabilities considerably. The current serial version of FASTLINK improves the analysis of complex pedigrees by efficient loop-breaking algorithms.27Becker A Geiger D Schäffer A Automatic selection of loop breakers for genetic linkage analysis.Hum Hered. 1998; 48: 49-60Crossref PubMed Scopus (42) Google Scholar Further versions of FASTLINK use parallel computing to achieve higher performance.15Dwarkadas S Schäffer A Cottingham R Cox A Keleher P Zwaenepoel W Parallelization of general linkage analysis problems.Hum Hered. 1994; 44: 127-141Crossref PubMed Scopus (38) Google Scholar, 17Gupta S Schäffer A Cox A Dwarkadas S Zwaenepoel W Integrating parallelization strategies for linkage analysis.Comput Biomed Res. 1995; 28: 116-139Abstract Full Text PDF PubMed Scopus (29) Google Scholar, 18Rai A Lopez-Benitez N Hargis J Poduslo S On the parallelization of Linkmap from the LINKAGE/FASTLINK package.Comput Biomed Res. 2000; 33: 350-364Abstract Full Text PDF PubMed Scopus (6) Google Scholar, 19Kothari K Lopez-Benitez N Poduslo S High-performance implementation and analysis of the Linkmap program.J Biomed Inform. 2001; 34: 406-414Abstract Full Text PDF PubMed Scopus (6) Google Scholar Another example of efficient optimizations is VITESSE v.1, which raises the computational boundaries of the Elston-Stewart algorithm and allows for computation of multipoint LOD scores for several polymorphic markers with many unknown genotypes via “set recoding.”5O’Connell J Weeks D The VITESSE algorithm for rapid exact multilocus linkage analysis via genotype set-recoding and fuzzy inheritance.Nat Genet. 1995; 11: 402-408Crossref PubMed Scopus (416) Google Scholar The complexity of the Elston-Stewart algorithm grows linearly with the size of the pedigree but exponentially with the number of markers in the analysis, since the algorithm essentially performs summation over all possible genotype vectors. Also, in the extension for the analysis of complex pedigrees, the loop-breaking method grows exponentially in the number of different loop breakers to be considered. Exact multipoint linkage computations involving many markers for small-to-medium–sized families was first made practical with the introduction of the GENEHUNTER software,11Kruglyak L Daly M Reeve-Daly M Lander E Parametric and nonparametric linkage analysis: a unified multipoint approach.Am J Hum Genet. 1996; 58: 1347-1363PubMed Google Scholar which uses the Lander-Green algorithm.2Lander E Green P Construction of multilocus genetic maps in humans.Proc Natl Acad Sci. 1987; 84: 2363-2367Crossref PubMed Scopus (1163) Google Scholar Whereas, in the Elston-Stewart algorithm, the unobserved quantity is the genotype, the Lander-Green algorithm proposes to condition observations on inheritance vectors for each locus, where the inheritance vector is a binary vector indicating the parental source of each allele for each nonfounder in the pedigree. The Lander-Green algorithm can potentially compute multipoint likelihoods on practically unbounded numbers of markers. However, the computational capabilities of the algorithm are restricted to the analysis of pedigrees of moderate size, since its computing time and memory requirements scale exponentially in the number of nonfounders, which is translated into the number of inheritance vectors to be considered. Later versions of GENEHUNTER optimized the performance by applying the Fast Fourier Transform for matrix multiplication7Kruglyak L Lander E Faster multipoint linkage analysis using Fourier transform.J Comput Biol. 1998; 5: 1-7Crossref PubMed Scopus (185) Google Scholar and by considering only the set of inheritance vectors compatible with the observed genotypes.28Markianos K Daly M Kruglyak L Efficient multipoint linkage analysis through reduction of inheritance space.Am J Hum Genet. 2001; 68: 963-977Abstract Full Text Full Text PDF PubMed Scopus (111) Google Scholar ALLEGRO8Gudbjartsson F Jonasson K Frigge ML Kong A Allegro, a new computer program for multipoint linkage analysis.Nat Genet. 2000; 25: 12-13Crossref PubMed Scopus (650) Google Scholar and MERLIN10Abecasis GR Cherny SS Cookson WO Cardon LR Merlin—rapid analysis of dense genetic maps using sparse gene flow trees.Nat Genet. 2002; 30: 97-101Crossref PubMed Scopus (2693) Google Scholar present alternative implementations of the Lander-Green algorithm that show speedups of up to 2 orders of magnitude versus GENEHUNTER, achieved through further reduction of the inheritance vector space and application of software optimization techniques. Advances in parallel computing allowed parallelization of GENEHUNTER for execution on clusters of high-performance workstations, enabling faster computations.20Conant G Plimpton S Old W Wagner A Fain P Pacheco T Heffelfinger G Parallel Genehunter: implementation of a linkage analysis package for distributed-memory architectures.J Parallel Distrib Comput. 2003; 63: 674-682Crossref Scopus (10) Google Scholar Also, GENEHUNTER-TWOLOCUS29Strauch K Fimmers R Kurz T Deichmann KA Wienker TF Baur MP Parametric and nonparametric multipoint linkage analysis with imprinting and two-locus-trait models: application to mite sensitization.Am J Hum Genet. 2000; 66: 1945-1957Abstract Full Text Full Text PDF PubMed Scopus (151) Google Scholar has been parallelized.21Dietter J Spiegel A an Mey D Pflug HJ al Kateb H Hoffmann K Wienker T Strauch K Efficient two-trait-locus linkage analysis through program optimization and parallelization: application to hypercholesterolemia.Eur J Hum Genet. 2004; 12: 542-550Crossref PubMed Scopus (15) Google Scholar Another approach is presented by SUPERLINK, in which the pedigree data are represented as a Bayesian network, allowing one to significantly improve the performance by optimizing the order in which variables are eliminated.3Fishelson M Geiger D Optimizing exact genetic linkage computations.J Comput Biol. 2004; 11: 263-275Crossref PubMed Scopus (71) Google Scholar, 12Fishelson M Geiger D Exact genetic linkage computations for general pedigrees.Bioinformatics. 2002; 18: S189-S198Crossref PubMed Scopus (205) Google Scholar As opposed to the Elston-Stewart and Lander-Green algorithms, in which variables are eliminated in a predetermined order, SUPERLINK finds the order of variable elimination at run time according to the problem at hand, enabling multipoint-likelihood computations of large inbred families that cannot be performed by other current linkage analysis software. Bayesian networks,30Pearl J Probabilistic reasoning in intelligent systems. Morgan Kaufmann, San Francisco1988Google Scholar, 31Lauritzen SL Graphical models. Oxford University Press, Oxford, United Kingdom1996Google Scholar also known as directed graphical models, are a knowledge representation formalism that offers a powerful framework to model complex multivariate problems, such as the ones posed by genetic analysis. A Bayesian network is defined via directed acyclic graphs (DAGs). A DAG is a directed graph with no directed cycles. Each node v has a set of parents Pav—namely, a set of vertices from which there are edges leading into v in the DAG. A Bayesian network is a DAG, where each vertex v corresponds to a discrete variable Xv∈X with a conditional probability distribution P(Xv = xv|Pav=pav), and the joint probability distribution P(X) is the product of the conditional probability distributions of all variables. In other words, P(X1=x1,…,Xn=xn)=ΠvP(Xv=xv|Pav=pav),(1) where pav is the joint assignment {Xi = xi|Xi∈Pav} to the variables in Pav. When Pav=∅, the respective term in equation (1) reduces to P(Xv = xv). Note that each missing edge represents a conditional independence assertion. This is the key factor for using Bayesian networks to efficiently handle joint probability distributions for large numbers of variables. An example is shown in figure 1. This example shows a Bayesian network for three-point analysis (two markers and one disease locus) of a nuclear family with two typed children. The genetic loci variables of individual i at locus j are denoted by Gmi,j and Gpi,j for maternal and paternal alleles, respectively. Variables Ei,j, Smi,j, and Spi,j denote the marker phenotypes (unordered pair of alleles) and the maternal and the paternal selector variables, respectively, of individual i at locus j. Variable Ei denotes the affection status of individual i. Individual 4 is affected, as indicated by node E4. Highlighted nodes represent evidence variables, including available marker phenotypes and affection status. Marker phenotypes of the parents are unknown in this example. The quantities P(Gmi,j) and P(Gpi,j) represent allele frequencies, P(Smi,j|Smi,j-1) and P(Spi,j|Spi,j-1) represent recombination probabilities, and P(Ei|Gmi,j,Gpi,j) represent penetrances. The joint distribution is the product of all probability tables. The assumptions of no interference and of Hardy-Weinberg and linkage equilibria are encoded by edges missing in the Bayesian network. More details are given elsewhere.32Friedman N Geiger D Lotner N Likelihood computation with value abstraction.in: Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence (UAI). Morgan Kaufmann, San Francisco2000: 192-200Google Scholar Bayesian networks are used in this article for computing the probability of evidence, which is represented as a joint assignment e = {Xe1 = e1,Xe2 = e2,…,Xem = em}, to a subset of variables E = {Xe1,Xe2,…,Xem}. It is computable from the joint probability distribution table by summing over all variables X1,…,Xk not in E—namely, P(e)=∑x1…∑xkΠvP*(Xv=xv|Pav=pav),(2) where P* is obtained from P by the assignment of the observed values {e1,e2,…,em} to the respective variables in E. In figure 1, evidence variables such as marker phenotypes and affection status are highlighted. The straightforward approach to compute P(e)—by first multiplying all conditional probability tables and then computing all sums—is infeasible because of the exponential size of the joint probability distribution. Instead, it is possible to interleave summations and multiplications, summing variables one after another at early stages of the computation by pushing the summation sign in equation (2) as far to the right as possible. When summing over all values of a variable X, it suffices to compute the product of only those probability tables that contain X, yielding an intermediate table over the variables of the tables being multiplied. The summation over variable X eliminates it from the product, reducing the dimensions of the intermediate table by a factor equal to the number of values of X. This technique of computing equation (2) is called “variable elimination” in the Bayesian network literature.33Dechter R Bucket elimination: a unifying framework for probabilistic inference.in: Jordan M Learning in graphical models. Kluwer Academic Press, Dordecht, the Netherlands1998: 75-104Crossref Google Scholar Variable elimination alone is often inapplicable because of the prohibitively large size of intermediate tables generated during the computations, which exceed the physical memory of contemporary computers. An alternative approach to computing equation (2) is to simplify a given problem by first assigning values to some subset of variables C⊆X and then performing the computation for every joint assignment c to the variables in C. Assigning a value to variables in C decreases the size of the corresponding probability tables and, consequently, the size of intermediate tables, reducing the original problem and fitting it for computation via variable elimination. Equation (2) can then be rewritten as P(e)=∑cɛc ,(3) where ɛc = Σy∈{X\C}∏vP*(Yv = yv|Pav = pav) represents the computation of the problem for specific joint assignment c to the variables in C. The variables in C are called the “conditioning variables,” and this method is called “conditioning.”30Pearl J Probabilistic reasoning in intelligent systems. Morgan Kaufmann, San Francisco1988Google Scholar It has been used to extend the Elston-Stewart algorithm to looped pedigrees.34Lange K Elston R Extentions to pedigree analysis. I. Likelihood calculations for simple and complex pedigrees.Hum Hered. 1975; 25: 95-105Crossref PubMed Scopus (109) Google Scholar Conditioning, as described above, is inefficient because of the repetitive evaluation of identical subexpressions when computing ɛc for different joint assignments c to the conditioning variables. A more efficient algorithm, called “constrained variable elimination,” significantly reduces the amount of redundant computations by interleaving conditioning and elimination. This algorithm is the basis of the genetic linkage software SUPERLINK.12Fishelson M Geiger D Exact genetic linkage computations for general pedigrees.Bioinformatics. 2002; 18: S189-S198Crossref PubMed Scopus (205) Google Scholar The constrained elimination algorithm applies variable elimination until no more variables can be eliminated without exceeding the specified memory constraints. Conditioning is then performed on the smallest subset of variables, which suffices to reduce the size of intermediate tables to meet the specified memory

Referência(s)