Machine learning approaches to explore digenic inheritance
2022; Elsevier BV; Volume: 38; Issue: 10 Linguagem: Inglês
10.1016/j.tig.2022.04.009
ISSN1362-4555
Autores Tópico(s)RNA and protein synthesis mechanisms
ResumoWhile many genetic traits follow a dominant or recessive Mendelian mode of inheritance, non-Mendelian disease transmission may occur in the form of digenic inheritance (two mutant variants at different locations are required to confer disease) or as modifier genes affecting the expression of another gene.Machine learning methods are increasingly employed in the search for pairs of variants underlying digenic traits.Highly promising approaches are based on association rules, which originated in the analysis of consumer transaction patterns some 30 years ago and have blossomed into highly sophisticated computer-based methods. As applications of these methods are becoming more widespread, digenic disease transmission may well appear to be more common than Mendelian inheritance. Some rare genetic disorders, such as retinitis pigmentosa or Alport syndrome, are caused by the co-inheritance of DNA variants at two different genetic loci (digenic inheritance). To capture the effects of these disease-causing variants and their possible interactive effects, various statistical methods have been developed in human genetics. Analogous developments have taken place in the field of machine learning, particularly for the field that is now called Big Data. In the past, these two areas have grown independently and have started to converge only in recent years. We discuss an overview of each of the two fields, paying special attention to machine learning methods for uncovering the combined effects of pairs of variants on human disease. Some rare genetic disorders, such as retinitis pigmentosa or Alport syndrome, are caused by the co-inheritance of DNA variants at two different genetic loci (digenic inheritance). To capture the effects of these disease-causing variants and their possible interactive effects, various statistical methods have been developed in human genetics. Analogous developments have taken place in the field of machine learning, particularly for the field that is now called Big Data. In the past, these two areas have grown independently and have started to converge only in recent years. We discuss an overview of each of the two fields, paying special attention to machine learning methods for uncovering the combined effects of pairs of variants on human disease. For heritable traits, classical human genetic approaches have investigated one DNA variant at a time for linkage or association with disease, which has been very fruitful for Mendelian traits (see Glossary), that is, heritable diseases that are due to a single variant. Powerful genome sequencing technologies, such as exome sequencing and whole-genome sequencing, have enabled us to identify many novel genes in Mendelian diseases caused by changes in a single gene or locus [1.Ng S.B. et al.Exome sequencing identifies the cause of a Mendelian disorder.Nat. Genet. 2010; 42: 30-35Crossref PubMed Scopus (1519) Google Scholar,2.Turro E. et al.Whole-genome sequencing of patients with rare diseases in a national health system.Nature. 2020; 583: 96-102Crossref PubMed Scopus (239) Google Scholar]. However, several large-scale genetic cohorts revealed that more than half of the rare genetic disease patients are yet to be diagnosed [3.Boycott K.M. et al.International cooperation to enable the diagnosis of all rare genetic diseases.Am. J. Hum. Genet. 2017; 100: 695-705Abstract Full Text Full Text PDF PubMed Scopus (243) Google Scholar,4.Smedley D. et al.100,000 Genomes pilot on rare-disease diagnosis in health care - preliminary report.N. Engl. J. Med. 2021; 385: 1868-1880Crossref PubMed Scopus (192) Google Scholar]. One of the underlying mechanisms for rare genetic diseases are digenic or oligogenic inheritances, where interaction of two or more genes are observed [5.Boycott K.M. et al.A diagnosis for all rare genetic diseases: the horizon and the next frontiers.Cell. 2019; 177: 32-37Abstract Full Text Full Text PDF PubMed Scopus (77) Google Scholar, 6.Zuk O. et al.The mystery of missing heritability: genetic interactions create phantom heritability.Proc. Natl. Acad. Sci. U. S. A. 2012; 109: 1193-1198Crossref PubMed Scopus (1072) Google Scholar, 7.Kuzmin E. et al.Systematic analysis of complex genetic interactions.Science. 2018; 360eaao1729Crossref PubMed Scopus (151) Google Scholar]. Digenic inheritance is the simplest form among such inheritances and has been identified in some rare genetic diseases, seemingly in a Mendelian fashion [8.Cerrone M. et al.Beyond the one gene-one disease paradigm: complex genetics and pleiotropy in inheritable cardiac disorders.Circulation. 2019; 140: 595-610Crossref PubMed Scopus (80) Google Scholar,9.Li M. et al.Digenic inheritance of mutations in EPHA2 and SLC26A4 in Pendred syndrome.Nat. Commun. 2020; 11: 1343Crossref PubMed Scopus (20) Google Scholar]. Many efforts have been made to identify novel digenic inheritances in genetic rare diseases. One of the accomplishments of such efforts is a novel database providing detailed information on genes and genetic variants seen in digenic diseases [10.Gazzo A.M. et al.DIDA: a curated and annotated digenic diseases database.Nucleic Acids Res. 2016; 44: D900-D907Crossref PubMed Scopus (70) Google Scholar]. Using the database, a study succeeded in identifying digenic disease genes via machine learning [11.Mukherjee S. et al.Identifying digenic disease genes via machine learning in the Undiagnosed Diseases Network.Am. J. Hum. Genet. 2021; 108: 1946-1963Abstract Full Text Full Text PDF PubMed Scopus (10) Google Scholar]. Other than using databases, it is still important to apply suitable methods for identifying novel interactions between variants. To deal with large numbers of variants obtained from genome sequencing analyses, statistical and machine learning methods have been developed and will be discussed in subsequent sections. Here, we mainly discuss machine learning approaches from a user's perspective for the detection of pairs of variants underlying digenic traits. In human genetics, various specialized ways have been developed to combine information on disease association over multiple DNA variants, which we will briefly discuss here before embarking on the main topic of our outline. With genome-wide sequencing, analysts are faced with 100 000s if not millions of variants available for analysis, for example, case-control association analysis. One approach known as the collapsing method (Box 1) assigns an index variable K to each individual, where K = 1 if the individual carries a (rare) minor allele at any of the variants in a given gene and K = 0 otherwise [12.Li B. Leal S.M. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data.Am. J. Hum. Genet. 2008; 83: 311-321Abstract Full Text Full Text PDF PubMed Scopus (1130) Google Scholar]. Thus, analysis proceeds at the level of genes rather than variants, where the latter far outnumber the former. While such approaches appear to be economical, the effect of a single disease-associated variant or genotype might be diluted by nonassociated variants in the same gene. A telling example of such a situation in opioid dependency demonstrated a significant pair of genotypes, but that effect vanished in an analysis based on variants rather than genotypes [13.Okazaki A. et al.Genotype pattern mining for pairs of interacting variants underlying digenic traits.Genes. 2021; 12: 1160Crossref PubMed Scopus (6) Google Scholar]. A collapsing method was recently applied in a large study on epilepsy [14.Epi C. Sub-genic intolerance, ClinVar, and the epilepsies: a whole-exome sequencing study of 29,165 individuals.Am. J. Hum. Genet. 2021; 108: 965-982Abstract Full Text Full Text PDF PubMed Scopus (15) Google Scholar]. Collapsing methods have also been extended to work with two genes at a time for analysis of digenic traits [15.Kerner G. et al.A genome-wide case-only test for the detection of digenic inheritance in human exomes.Proc. Natl. Acad. Sci. U. S. A. 2020; 117: 19367-19375Crossref PubMed Scopus (10) Google Scholar].Box 1Three degrees of resolutionSearching for genetic elements underlying digenic traits may be carried out at the level of genotypes, DNA variants, or genes. A given pair of variants comprises up to nine pairs of genotypes and a given gene may contain a large number of SNPs or variants. Each of these search strategies involves inherent advantages and shortfalls as follows:1.Traditionally, disease association has been carried out on the level of alleles or genotypes. When applied to pairs of genotypes, the total of pairs can be prohibitively large so that shortcuts need to be taken, such as limiting attention to genotypes involving rare or common minor alleles, or genotypes with interpretable biological functions. While this level of analysis generally requires the most effort, it also entails the highest level of precision in the sense that disease-causing elements can be directly traced down to nucleotides.2.Working with pairs of variants provides some economy of computational effort but may 'dilute' a signal from a single genotype pair when all nine genotype pairs in a pair of variants are analyzed jointly [13.Okazaki A. et al.Genotype pattern mining for pairs of interacting variants underlying digenic traits.Genes. 2021; 12: 1160Crossref PubMed Scopus (6) Google Scholar].3.Finally, focusing on pairs of genes [15.Kerner G. et al.A genome-wide case-only test for the detection of digenic inheritance in human exomes.Proc. Natl. Acad. Sci. U. S. A. 2020; 117: 19367-19375Crossref PubMed Scopus (10) Google Scholar] represents the most economical approach but is also the most imprecise among the three strategies. Perhaps more importantly, focusing on genes disregards susceptibility elements outside of genes, 'genetic variants associated with common diseases are usually located in noncoding parts of the human genome' [57.Elkon R. Agami R. Characterization of noncoding regulatory DNA in the human genome.Nat. Biotechnol. 2017; 35: 732-746Crossref PubMed Scopus (71) Google Scholar]. Distant-acting transcriptional enhancers have been known for over 10 years to affect susceptibility to human disease [58.Alexanian M. et al.A transcriptional switch governs fibroblast activation in heart disease.Nature. 2021; 595: 438-443Crossref PubMed Scopus (68) Google Scholar] and noncoding RNAs have been shown to be associated with many diseases [59.Wang Y. et al.LncRNA functional annotation with improved false discovery rate achieved by disease associations.Comput. Struct. Biotechnol. J. 2022; 20: 322-332Abstract Full Text Full Text PDF PubMed Scopus (4) Google Scholar], for example, cardiac hypertrophy [60.Wu S. et al.Circular RNAs in the regulation of cardiac hypertrophy.Mol. Ther. Nucleic Acids. 2022; 27: 484-490Abstract Full Text Full Text PDF PubMed Scopus (6) Google Scholar].A combination of these search strategies or a sequential application of them may well be the best overall strategy. Searching for genetic elements underlying digenic traits may be carried out at the level of genotypes, DNA variants, or genes. A given pair of variants comprises up to nine pairs of genotypes and a given gene may contain a large number of SNPs or variants. Each of these search strategies involves inherent advantages and shortfalls as follows:1.Traditionally, disease association has been carried out on the level of alleles or genotypes. When applied to pairs of genotypes, the total of pairs can be prohibitively large so that shortcuts need to be taken, such as limiting attention to genotypes involving rare or common minor alleles, or genotypes with interpretable biological functions. While this level of analysis generally requires the most effort, it also entails the highest level of precision in the sense that disease-causing elements can be directly traced down to nucleotides.2.Working with pairs of variants provides some economy of computational effort but may 'dilute' a signal from a single genotype pair when all nine genotype pairs in a pair of variants are analyzed jointly [13.Okazaki A. et al.Genotype pattern mining for pairs of interacting variants underlying digenic traits.Genes. 2021; 12: 1160Crossref PubMed Scopus (6) Google Scholar].3.Finally, focusing on pairs of genes [15.Kerner G. et al.A genome-wide case-only test for the detection of digenic inheritance in human exomes.Proc. Natl. Acad. Sci. U. S. A. 2020; 117: 19367-19375Crossref PubMed Scopus (10) Google Scholar] represents the most economical approach but is also the most imprecise among the three strategies. Perhaps more importantly, focusing on genes disregards susceptibility elements outside of genes, 'genetic variants associated with common diseases are usually located in noncoding parts of the human genome' [57.Elkon R. Agami R. Characterization of noncoding regulatory DNA in the human genome.Nat. Biotechnol. 2017; 35: 732-746Crossref PubMed Scopus (71) Google Scholar]. Distant-acting transcriptional enhancers have been known for over 10 years to affect susceptibility to human disease [58.Alexanian M. et al.A transcriptional switch governs fibroblast activation in heart disease.Nature. 2021; 595: 438-443Crossref PubMed Scopus (68) Google Scholar] and noncoding RNAs have been shown to be associated with many diseases [59.Wang Y. et al.LncRNA functional annotation with improved false discovery rate achieved by disease associations.Comput. Struct. Biotechnol. J. 2022; 20: 322-332Abstract Full Text Full Text PDF PubMed Scopus (4) Google Scholar], for example, cardiac hypertrophy [60.Wu S. et al.Circular RNAs in the regulation of cardiac hypertrophy.Mol. Ther. Nucleic Acids. 2022; 27: 484-490Abstract Full Text Full Text PDF PubMed Scopus (6) Google Scholar]. A combination of these search strategies or a sequential application of them may well be the best overall strategy. So-called complex traits are under the influence of a large number of genes, with schizophrenia being a prime example. A common approach for such traits is to compute for each individual a polygenic risk score, that is, the sum of risk allele frequencies over all SNPs; or the risk allele frequency may be replaced by the number of risk alleles (minor alleles). This approach has been fruitful in traits like Alzheimer's disease and other complex traits [16.Chasioti D. et al.Progress in polygenic composite scores in Alzheimer's and other complex diseases.Trends Genet. 2019; 35: 371-382Abstract Full Text Full Text PDF PubMed Scopus (40) Google Scholar] and can at least demonstrate genetic effects, but there is still the task of pinpointing which of the potentially large numbers of SNPs are disease-causing. More and more traits emerge as being digenic, that is, determined by two variants [17.Deltas C. Digenic inheritance and genetic modifiers.Clin. Genet. 2018; 93: 429-438Crossref PubMed Scopus (60) Google Scholar]; for example, severe immunodeficiency and autoimmunity [18.Ameratunga R. et al.Clinical implications of digenic inheritance and epistasis in primary immunodeficiency disorders.Front. Immunol. 2017; 8: 1965Crossref PubMed Scopus (30) Google Scholar], non-syndromic hearing impairment [19.Schrauwen I. et al.Novel digenic inheritance of PCDH15 and USH1G underlies profound non-syndromic hearing impairment.BMC Med. Genet. 2018; 19: 122Crossref PubMed Scopus (15) Google Scholar], Noonan syndrome [20.Ferrari L. et al.Digenic inheritance of subclinical variants in Noonan syndrome patients: an alternative pathogenic model?.Eur. J. Hum. Genet. 2020; 28: 1432-1445Crossref PubMed Scopus (8) Google Scholar], cancer [21.Schubert S.A. et al.Digenic inheritance of MSH6 and MUTYH variants in familial colorectal cancer.Genes Chromosom. Cancer. 2020; 59: 697-701Crossref Scopus (6) Google Scholar], and familial hypercholesterolemia [22.Kamar A. et al.The digenic causality in familial hypercholesterolemia: revising the genotype-phenotype correlations of the disease.Front. Genet. 2020; 11572045PubMed Google Scholar]. In fact, as we outline later, digenic traits may be more common than monogenic traits. While many digenic traits have been found in a fortuitous manner, a systemic search may be attempted by an exhaustive enumeration of all pairs of genotypes. Such analyses have been carried out for psoriasis [23.Lee K.-Y. et al.Discovering genetic factors for psoriasis through exhaustively searching for significant second order SNP-SNP interactions.Sci. Rep. 2018; 8: 15186Crossref PubMed Scopus (14) Google Scholar] and schizophrenia [24.Lee K.-Y. et al.Genome-wide search for SNP interactions in GWAS data: algorithm, feasibility, replication using schizophrenia datasets.Front. Genet. 2020; 11: 1003Crossref PubMed Scopus (14) Google Scholar], but these authors had to restrict attention to biologically interpretable variants. However, investigating all pairs of genes is certainly possible [15.Kerner G. et al.A genome-wide case-only test for the detection of digenic inheritance in human exomes.Proc. Natl. Acad. Sci. U. S. A. 2020; 117: 19367-19375Crossref PubMed Scopus (10) Google Scholar]. More sophisticated approaches are afforded by machine learning methods, as outlined next. Machine learning [25.Schmidt J. et al.Recent advances and applications of machine learning in solid-state materials science.NPJ Comput. Mat. 2019; 5: 83Crossref Scopus (1072) Google Scholar] refers to algorithms in artificial intelligence (AI) that 'learn' patterns in data, gradually improving the accuracy of classifications or predictions. Six machine learning methods (stochastic gradient boosting, random forest, neural network (NN), support vector machines, adaptive gradient boosting, and elastic-net penalized logistic regression) have recently been applied to predict clinical improvement in 442 patients after an invasive procedure in sports medicine [26.Kunze K.N. et al.Application of machine learning algorithms to predict clinically meaningful improvement after arthroscopic anterior cruciate ligament reconstruction.Orthop. J. Sports Med. 2021; 923259671211046575Crossref Scopus (8) Google Scholar]. The method listed last showed the best performance. Also, the relative performances of several data mining approaches have been compared for predicting coronary artery disease [27.Xu S. et al.The novel coronary artery disease risk gene JCAD/KIAA1462 promotes endothelial dysfunction and atherosclerosis.Eur. Heart J. 2019; 40: 2398-2408Crossref PubMed Scopus (42) Google Scholar] and cancer [28.Gopi L.K. Kidder B.L. Integrative pan cancer analysis reveals epigenomic variation in cancer type and cell specific chromatin domains.Nat. Commun. 2021; 12: 1419Crossref PubMed Scopus (24) Google Scholar]. Most recently, a random forest approach has been developed for identifying candidate digenic disease gene pairs based on biological networks, evolutionary history, and functional annotations [11.Mukherjee S. et al.Identifying digenic disease genes via machine learning in the Undiagnosed Diseases Network.Am. J. Hum. Genet. 2021; 108: 1946-1963Abstract Full Text Full Text PDF PubMed Scopus (10) Google Scholar]. Useful overviews of data mining algorithms have been published [29.Chee C.-H. et al.Algorithms for frequent itemset mining: a literature review.Artif. Intell. Rev. 2019; 52: 2603-2621Crossref Scopus (55) Google Scholar]. In classical statistical analysis, the number of observations should be a multiple of the number of variables. Nowadays, however, the situation is often reversed. For example, the number of DNA variants tends to greatly exceed the number of individuals, a situation known as the curse of dimensionality. While classical approaches like stepwise multiple regression can cope with this situation to some degree, modern machine learning methods do this in a much more sophisticated manner, by iteratively improving successive solutions to a prediction or classification problem. Among the earliest examples of AI are artificial NNs, first proposed some 80 years ago [30.McCulloch W.S. Pitts W. A logical calculus of the ideas immanent in nervous activity.Bull. Math. Piophys. 1943; 5: 115-133Crossref Scopus (10551) Google Scholar]. An early application of NNs to genetic data, published 25 years ago [31.Lucek P.R. Ott J. Neural network analysis of complex traits.Genet. Epidemiol. 1997; 14: 1101-1106Crossref PubMed Scopus (59) Google Scholar], worked with 367 variants and an artificial quantitative trait presented at Genetic Analysis Workshop 10. Today's NNs consist of many layers of 'neurons', hence the term deep learning [32.Cao Y. et al.Ensemble deep learning in bioinformatics.Nat. Mach. Intell. 2020; 2: 500-508Crossref Scopus (105) Google Scholar]. NNs are used in many areas of human genetics, for example, to perform cancer type classification and gene identification [33.Zeng Z. et al.Deep learning for cancer type classification and driver gene identification.BMC Bioinforma. 2021; 22: 491Crossref PubMed Scopus (9) Google Scholar]. Next, we focus on methods in human genetics for the genetic mapping of deleterious variants underlying digenic traits. First published 20 years ago [34.Moore J.H. Hahn L.W. A cellular automata approach to detecting interactions among single-nucleotide polymorphisms in complex multifactorial diseases.Pac. Symp. Biocomput. 2002; 2002: 53-64Google Scholar], the MDR method carries out an exhaustive evaluation of all possible pairs of variants and then ranks variant pairs by balanced accuracy, that is, a compromise between high power and low P value [35.Winham S.J. Motsinger-Reif A.A. An R package implementation of multifactor dimensionality reduction.BioData Min. 2011; 4: 24Crossref PubMed Scopus (21) Google Scholar]. For the best pair, MDR can compute an empirical significance level by permutation analysis [36.Moore J.H. Andrews P.C. Epistasis analysis using multifactor dimensionality reduction.in: Moore J.H. Williams S.M. Epistasis: Methods and Protocols. Springer, 2015: 301-314Google Scholar]. MDR has been successfully applied to a large number of diseases and is being used to this day, for example, in an investigation of gene–gene interactions on chronic obstructive pulmonary disease [37.Ganbold C. et al.The cumulative effect of gene-gene interactions between GSTM1, CHRNA3, CHRNA5 and SOD3 gene polymorphisms combined with smoking on COPD risk.Int. J. Chron. Obstruct. Pulmon. Dis. 2021; 16: 2857-2868Crossref PubMed Scopus (5) Google Scholar]. The origin of these methods is the Apriori algorithm [38.Agrawal R. Srikant R. Fast algorithms for mining association rules, 20th VLCB Conference.Proceedings of the 20th VLCB Conference, Santiago, Chile. 1994. 1994: 487-499Google Scholar], which was developed to handle the ever increasing very large databases of items a consumer pays for at a cash register. Each such set of items (an itemset) represents a transaction, and the frequency of a given itemset (e.g., X = 'bread–butter–milk') in the database is called its support, s. Another important notion is the probability that a consumer will purchase another item, for example, Y = wine, if they buy X. This conditional probability is called confidence, c = P(Y|X). The Apriori algorithm has been tweaked to work on human DNA variant genotypes, X, and digenic traits with Y = 2 for 'case' (affected with disease) and Y = 1 for 'control'. The resulting AprioriGWAS approach [39.Zhang Q. et al.AprioriGWAS, a new pattern mining strategy for detecting genetic variants associated with disease through interaction effects.PLoS Comput. Biol. 2014; 10e1003627Crossref Scopus (27) Google Scholar] has been applied genome-wide to all variants in the well-known age-related macular degeneration (AMD) study [40.Klein R.J. et al.Complement factor H polymorphism in age-related macular degeneration.Science. 2005; 308: 385-389Crossref PubMed Scopus (3611) Google Scholar] and in the Wellcome Trust bipolar disease data, which required an enormous computational effort. Memory usage in AprioriGWAS is kept low through the use of the HDF5 data format [41.Kyoda K. et al.BD5: an open HDF5-based data format to represent quantitative biological dynamics data.PLoS One. 2020; 15e0237468Crossref PubMed Scopus (1) Google Scholar]. In contrast, off-the-shelf implementations of Apriori and derivatives like fpgrowth [29.Chee C.-H. et al.Algorithms for frequent itemset mining: a literature review.Artif. Intell. Rev. 2019; 52: 2603-2621Crossref Scopus (55) Google Scholar] tend to be extremely memory-hungry, as observed with an implementation of fpgrowth for genetic association data, genotype pattern mining (GPM) [13.Okazaki A. et al.Genotype pattern mining for pairs of interacting variants underlying digenic traits.Genes. 2021; 12: 1160Crossref PubMed Scopus (6) Google Scholar]. Large values of observed confidence, such as 95% and 99%, for a genotype pattern are highly predictive of disease and are indicative of such a pattern representing a true pathogenic result [42.Papadimitriou S. et al.Predicting disease-causing variant combinations.Proc. Natl. Acad. Sci. U. S. A. 2019; 116: 11878-11887Crossref PubMed Scopus (51) Google Scholar]. A major difference between MDR and Apriori-like approaches is that the former is based on variants, whereas the latter focuses on genotypes, which can be expected to result in greater accuracy (Box 1) as one pair of variants represents up to nine pairs of genotypes [13.Okazaki A. et al.Genotype pattern mining for pairs of interacting variants underlying digenic traits.Genes. 2021; 12: 1160Crossref PubMed Scopus (6) Google Scholar]. This has been documented in a small case-control dataset with opioid dependence, for which GPM found one significant pair of genotypes (P = 0.03) that was missed by MDR (P > 0.78) [13.Okazaki A. et al.Genotype pattern mining for pairs of interacting variants underlying digenic traits.Genes. 2021; 12: 1160Crossref PubMed Scopus (6) Google Scholar]. Another difference to MDR is that Apriori and related algorithms avoid evaluating all possible pairs of items. These approaches reduce the complexity of patterns among large numbers of items by postulating an underlying set of unobserved variables (Bayesian networks) whose parameters are estimated based on the given data. An early implementation of Bayesian methods, called Bayesian epistasis association mapping, demonstrated that genome-wide case-control epistasis mapping with many thousands of markers is both computationally and statistically feasible [43.Zhang Y. Liu J.S. Bayesian inference of epistatic interactions in case-control studies.Nat. Genet. 2007; 39: 1167-1173Crossref PubMed Scopus (394) Google Scholar]. A more recent approach uses Bayesian networks and genetic algorithms in the search for epistatically interacting variants [44.Chen Y. et al.EpiMOGA: an epistasis detection method based on a multi-objective genetic algorithm.Genes (Basel). 2021; 12: 191Crossref PubMed Scopus (9) Google Scholar]. Also, a recent study proposed a Bayesian network to model personalized gene–environment interactions [45.Li J. et al.Gene-environment interaction in the era of precision medicine.Cell. 2019; 177: 38-44Abstract Full Text Full Text PDF PubMed Scopus (51) Google Scholar]. Various pattern search methods have been applied to the AMD dataset. Comparisons between AprioriGWAS and Bayesian network methods show that the different approaches detect some of the same variants involved in interaction effects [13.Okazaki A. et al.Genotype pattern mining for pairs of interacting variants underlying digenic traits.Genes. 2021; 12: 1160Crossref PubMed Scopus (6) Google Scholar]. Many of the approaches discussed here are not well-known in human genetics as they are often only published in conference proceedings. For example, as recently as 2020, it has been claimed that 'all previously successful digenic inheritance studies relied on candidate gene approaches to overcome the lack of appropriate statistical resources to search for digenic inheritance at the genome-wide level' [15.Kerner G. et al.A genome-wide case-only test for the detection of digenic inheritance in human exomes.Proc. Natl. Acad. Sci. U. S. A. 2020; 117: 19367-19375Crossref PubMed Scopus (10) Google Scholar]. Many published algorithms for detecting disease-associated genotype patterns refer to their approach as being able to detect epistasis, that is, genetic interactions between two variants, as being causative for disease. However, the association effect of a pattern of two genotypes consists of two main effects, one each for the two participating variants, and an interaction effect (Table 1 in [13.Okazaki A. et al.Genotype pattern mining for pairs of interacting variants underlying digenic traits.Genes. 2021; 12: 1160Crossref PubMed Scopus (6) Google Scholar]). Thus, genotype patterns may appear prominent because of their main effects. If the aim is to detect interaction effects, then variants with large main effects (significant in single-variant association tests) should be disregarded in a pattern analysis procedure. However, researchers sometimes do the exact opposite. For example, to reduce the burden of identifying genotype patterns underlying bipolar disease in a relatively large dataset of variants, only those variants with significant main effects were used [46.Breuer R. et al.Detecting significant genotype–phenotype association rules in bipolar disorder: market research meets complex genetics.Int. J. Bipolar Disord. 2018; 6: 24Crossref PubMed Scopus (7) Google Scholar]. An analogous situation exists for patterns of length 3, for example, involving three genotypes, each from different genomic positions. If prominent patterns of length 2 exist, then patterns of length 3 will preferentially involve those strong patterns of length 2, and 'pure' length 3 patterns will only be obtained when strong length 2 patterns are disregarded in the search for length 3 patterns. However, such an approach appears unnecessarily complicated as patterns of length 2 may be combined to form longer chains of items [13.Okazaki A. et al.Genotype pattern mining for pairs of interacting variants underlying digenic traits.Genes. 2021; 12: 1160Crossref PubMed Scopus (6) Google Scholar]. Machine learning methods like the pattern detection approaches discussed here have opened many doors to previously unknown interactions among DNA variants and their contributions to disease. However, this comes at a certain price: many of the off-the-shelf pattern recognition methods like fpgrowth require large amounts of RAM memory. Even though AprioriGWAS does not use much memory, its execution of whole-genome data required weeks of runtime of a computer cluster of 1000 CPUs. Thus, many of the currently developed approaches can only handle relatively small datasets, so judicious selection of variants and individuals is crucial. However, computer power tends to progress rapidly, so that what is impossible today will be possible tomorrow. Also, more sophisticated computational approaches utilizing vectorization and multicore parallelization are expected to make analysis of large datasets feasible [47.Titarenko S.S. et al.Fast implementation of pattern mining algorithms with time stamp uncertainties and temporal constraints.J. Big Data. 2019; 6: 37Crossref Scopus (13) Google Scholar]. While machine learning approaches are extremely useful tools in genomics research, some of them, like Bayesian networks, rely on statistical models, which involve specific assumptions that must be met for obtaining unbiased results. Various pitfalls have been pointed out that could derail successful application of machine learning methods [48.Whalen S. et al.Navigating the pitfalls of applying machine learning in genomics.Nat. Rev. Genet. 2021; 23: 169-181Crossref PubMed Scopus (45) Google Scholar]. However, detecting genotype patterns underlying digenic traits seems to be largely immune to any such pitfalls, particularly when statistical significance of results is based on permutation analysis [49.Llinares-López F. et al.CASMAP: detection of statistically significant combinations of SNPs in association mapping.Bioinformatics. 2018; 35: 2680-2682Crossref Scopus (6) Google Scholar]. We should also be aware of the dark side of machine learning, that is, the machine-learned relationship is often lacking actual biological mechanisms explaining certain phenotypes [50.Burgess D.J. Illuminating the dark side of machine learning.Nat. Rev. Genet. 2019; 20: 374-375Crossref PubMed Scopus (4) Google Scholar]. To shed light on the dark side, it is essential to develop methods to validate in model organisms or to apply novel data analysis approaches to genetic regulatory variations [51.Costanzo M. et al.Global genetic networks and the genotype-to-phenotype relationship.Cell. 2019; 177: 85-100Abstract Full Text Full Text PDF PubMed Scopus (120) Google Scholar, 52.Costanzo M. et al.A global genetic interaction network maps a wiring diagram of cellular function.Science. 2016; 353aaf1420Crossref PubMed Scopus (706) Google Scholar, 53.Yang J.H. et al.A white-box machine learning approach for revealing antibiotic mechanisms of action.Cell. 2019; 177: 1649-1661Abstract Full Text Full Text PDF PubMed Scopus (170) Google Scholar, 54.Mohammadi P. et al.Genetic regulatory variation in populations informs transcriptome analysis in rare disease.Science. 2019; 366: 351-356Crossref PubMed Scopus (58) Google Scholar]. In addition, in the worldwide era, machine learning approaches should be applied to genetic rare disease patients worldwide. Machine learning methods could play an important role in international collaborative approaches to synthesize the outcomes of genetic analyses for rare diseases [55.Baynam G.S. et al.A call for global action for rare diseases in Africa.Nat. Genet. 2020; 52: 21-26Crossref PubMed Scopus (23) Google Scholar]. Finally, using novel data analysis methods, we should explore interaction between nuclear DNA and mitochondrial DNA, which may explain unrevealed causes of mitochondrial diseases, one of the largest inborn errors of metabolism [56.Imai-Okazaki A. et al.Long-term prognosis and genetic background of cardiomyopathy in 223 pediatric mitochondrial disease patients.Int. J. Cardiol. 2021; 341: 48-55Abstract Full Text Full Text PDF PubMed Scopus (10) Google Scholar] (see Outstanding questions).Outstanding questionsWhat is the role of machine learning in analyzing interactions between variants responsible for diseases with digenic inheritance?What are major advantages and disadvantages for each data mining algorithm?How can each data mining algorithm become more powerful by combining it with an appropriate statistical model?What are the pitfalls of machine learning algorithms we should care about when combined with statistical models?How can we maximize the effects of combining machine learning algorithm and statistical approaches?Which is more efficient, a one-sided or two-sided search for genotype patterns in digenic disease? One-sided testing refers to looking for patterns that are more frequent in cases than controls, while a two-sided approach searches for patterns with frequencies that are different in the two phenotype classes. What is the role of machine learning in analyzing interactions between variants responsible for diseases with digenic inheritance? What are major advantages and disadvantages for each data mining algorithm? How can each data mining algorithm become more powerful by combining it with an appropriate statistical model? What are the pitfalls of machine learning algorithms we should care about when combined with statistical models? How can we maximize the effects of combining machine learning algorithm and statistical approaches? Which is more efficient, a one-sided or two-sided search for genotype patterns in digenic disease? One-sided testing refers to looking for patterns that are more frequent in cases than controls, while a two-sided approach searches for patterns with frequencies that are different in the two phenotype classes. This work was supported by JSPS KAKENHI Grant Number JP20K08497 and JP21H03137 (A.O.). J.O. acknowledges support by his wife, Jian Ning, and daughter, Anna Ott, for freeing him up from most common obligations so he was able to devote time to this work. No interests are declared. genetic association studies try to establish an association between a DNA variant and a disease phenotype. Due to the very large number of variants, collapsing methods have been developed to 'collapse' variants into a single variate in certain regions such as genes. This has the effect that only the number of such regions need to be tested rather than all variants, which reduces the multiple testing penalty in statistical analysis. However, a single 'signal' of one variant might be missed when that variant is combined with many other variants. some traits are due to the co-inheritance of DNA variants at two different genetic loci. This mechanism is referred to as digenic inheritance and has been known for many decades. However, a systematic search for patterns of two or more variants or genotypes has become feasible only with modern machine learning methods. the combined effect of multiple DNA variants on a phenotype, also called epistatic interaction. Until recently, only main effects of single variants were investigated for their association to disease. Owing to the ubiquitous nature of biological interactions, epistatic interaction is likely the rule rather than the exception, presumably leading to the discovery of many more digenic traits than are known today. a mode of inheritance where a single DNA variant is causative for a heritable disease. based on decision trees, random forest is a classification algorithm and one of the most common machine learning methods. It consists of multiple independent trees (a 'forest'), and a prediction based on all trees is more accurate that of a single tree. Even for large data with many features, decision trees can be constructed relatively fast. The method is also known as group learning or ensemble learning. a given DNA position is called polymorphic when multiple nucleotides occur at this position in a population and the site is then called a single nucleotide polymorphism or SNP (pronounced SNIP). With only a single nucleotide occurring, a site is denoted as monomorphic. SNPs are the most common types of DNA variation in certain populations.
Referência(s)