Lessons Learned from Bugs in Models of Human History
2020; Elsevier BV; Volume: 107; Issue: 4 Linguagem: Inglês
10.1016/j.ajhg.2020.08.017
ISSN1537-6605
AutoresAaron P. Ragsdale, Dominic Nelson, Simon Gravel, Jerome Kelleher,
Tópico(s)Epigenetics and DNA Methylation
ResumoSimulation plays a central role in population genomics studies. Recent years have seen rapid improvements in software efficiency that make it possible to simulate large genomic regions for many individuals sampled from large numbers of populations. As the complexity of the demographic models we study grows, however, there is an ever-increasing opportunity to introduce bugs in their implementation. Here, we describe two errors made in defining population genetic models using the msprime coalescent simulator that have found their way into the published record. We discuss how these errors have affected downstream analyses and give recommendations for software developers and users to reduce the risk of such errors. Simulation plays a central role in population genomics studies. Recent years have seen rapid improvements in software efficiency that make it possible to simulate large genomic regions for many individuals sampled from large numbers of populations. As the complexity of the demographic models we study grows, however, there is an ever-increasing opportunity to introduce bugs in their implementation. Here, we describe two errors made in defining population genetic models using the msprime coalescent simulator that have found their way into the published record. We discuss how these errors have affected downstream analyses and give recommendations for software developers and users to reduce the risk of such errors. In the effort to build more realistic simulations of genetic diversity, scientific software developers often focus on computational speed and biological realism. As the models simulated become more realistic, however, they also become more complex and difficult to specify. The interface through which users define their models is therefore increasingly important. Without an intuitive and thoroughly documented interface, it is very difficult to simulate complex population models correctly. The msprime coalescent simulator1Kelleher J. Etheridge A.M. McVean G. Efficient coalescent simulation and genealogical analysis for large sample sizes.PLoS Comput. Biol. 2016; 12: e1004842Crossref PubMed Scopus (195) Google Scholar, 2Nelson D. Kelleher J. Ragsdale A.P. Moreau C. McVean G. Gravel S. Accounting for long-range correlations in genome-wide simulations of large cohorts.PLoS Genet. 2020; 16: e1008619Crossref Scopus (7) Google Scholar, 3Kelleher J. Lohse K. Coalescent simulation with msprime.in: Dutheil J.Y. Statistical Population Genomics. Springer US, New York, NY2020: 191-230Crossref Scopus (4) Google Scholar is now widely used in genetics studies. Much of its appeal is the large increase in efficiency over the classical ms program,4Hudson R.R. Generating samples under a Wright-Fisher neutral model of genetic variation.Bioinformatics. 2002; 18: 337-338Crossref PubMed Scopus (1724) Google Scholar which makes it feasible to simulate large samples of whole chromosomes for the first time. Another distinct advantage of msprime is its Python application programming interface (API), which greatly increases the flexibility and ease of use over the standard approach of text-based command line interfaces. In particular, programs like ms require users to specify cryptic command line options to describe demographic models. For example, the Gutenkunst et al.5Gutenkunst R.N. Hernandez R.D. Williamson S.H. Bustamante C.D. Inferring the joint demographic history of multiple populations from multidimensional snp frequency data.PLoS Genet. 2009; 5: e1000695Crossref PubMed Scopus (937) Google Scholar demographic model (which is the subject of this note), as written in ms syntax, is shown in Figure 1A. This model is relatively simple, and models with many more populations and parameters are increasingly common. Descriptions of such models in ms syntax are not easy to comprehend. The Python interface for msprime, by contrast, allows the user to state models in a more human-readable and programmatic manner and has many advantages over ms' command line interface. Even when using a high-level programming language like Python, however, implementing multi-population models of demographic history is difficult and prone to error. In this note, we discuss two implementation errors that arose through unfortunate design decisions in msprime's demography API and that then found their way into the scientific record. The first error has relatively mild effects on genetic diversity but was used in many publications, whereas the second error was used only once but had a large impact on the simulation results. In light of these implementation errors, we discuss improvements to msprime's API motivated by these discoveries and, more generally, best practices for implementing and simulating complex multi-population demography. To illustrate the demography API, msprime included a description of a widely used three-population Out-of-Africa model5Gutenkunst R.N. Hernandez R.D. Williamson S.H. Bustamante C.D. Inferring the joint demographic history of multiple populations from multidimensional snp frequency data.PLoS Genet. 2009; 5: e1000695Crossref PubMed Scopus (937) Google Scholar as part of its tutorial documentation. In this model (Figure 2A), Eurasian (CEU and CHB) and African (YRI) populations split from each other in the deep past, followed by a more recent split of European and Asian populations with variable rates of continuous migration between each of the populations. Regrettably, the implementation in the msprime tutorial was incorrect. Before the time of the split of African and Eurasian populations, when there should have been just a single randomly mating population, migration was allowed to occur between the ancestral population and a second population with size equal to the Eurasian bottleneck size for all time into the past (Figure 2B). This incorrect model was introduced into the tutorial for msprime version 0.3.0 and remained in the documentation for around 4 years. Fortunately, the effects of this error are subtle. Population sizes and structure since the time of the earliest split are unaffected, so differences in expected FST are negligible between the correct and incorrect models. However, the ancient structure distorts the distribution of early TMRCAs (times to most recent common ancestor) (Figures 2E–2G). The extraneous ancient population increases the long-term effective population size, resulting in roughly 4% excess heterozygosity in contemporary populations compared to the intended model, but the overall effects on patterns of diversity are minimal (Figures 2C and 2D). Even though the error has a limited effect on simulated data, the tutorial code has been copied many times and used in publications. By searching for some identifying strings from the model definition on GitHub, we found 32 repositories containing either direct copies of the erroneous model code or code that was obviously derived from it (we have opened issues on each of these repositories to alert the authors). In most cases, the publications used simulations to test a non-demographic inference method, such as inferring gene genealogies from sequencing data,7Kelleher J. Wong Y. Wohns A.W. Fadil C. Albers P.K. McVean G. Inferring whole-genome histories in large population datasets.Nat. Genet. 2019; 51: 1330-1338Crossref PubMed Scopus (59) Google Scholar determining the age of a genetic variant,8Albers P.K. McVean G. Dating genomic variants and shared ancestry in population-scale sequencing data.PLoS Biol. 2020; 18: e3000586Crossref Scopus (31) Google Scholar or studying the power of variant association tests.9Tong D.M.H. Hernandez R.D. Population genetic simulation study of power in association testing across genetic architectures and study designs.Genet. Epidemiol. 2020; 44: 90-103Crossref Scopus (3) Google Scholar In each of these studies, this model was used as an example of a complex population history. Zhou et al.10Zhou Y. Tian X. Browning B.L. Browning S.R. POPdemog: visualizing population demographic history from simulation scripts.Bioinformatics. 2018; 34: 2854-2855Crossref Scopus (1) Google Scholar used the incorrect model as an example of how their method for visualizing demographic models can support msprime input. Finally, Pfaffelhuber et al.11Pfaffelhuber P. Grundner-Culemann F. Lipphardt V. Baumdicker F. How to choose sets of ancestry informative markers: A supervised feature selection approach.Forensic Sci. Int. Genet. 2020; 46: 102259Abstract Full Text Full Text PDF PubMed Scopus (4) Google Scholar used simulations of the incorrect model demography to evaluate their method for choosing ancestry-informative markers. Given the very subtle effect of the incorrect model on demography (and the fact the method was evaluated with other simulations and real data), it seems unlikely that the model details had any qualitative effect on their conclusions. This long-standing error could have been prevented by better API design. To model a population split currently in msprime, a user must specify a "MassMigration" event that moves lineages from one population to another and then must also remember to turn off migration between those populations at the same time. The demographic events for the correct model are given in Figure 1B. Note that the migration rate change on line 17 was missing in the incorrect model. The release of msprime version 1.0 will introduce a "PopulationSplit" event, which more intuitively links the movement of lineages with appropriate changes in migration rates at the time of the split. In another publication using this model,6Martin A.R. Gignoux C.R. Walters R.K. Wojcik G.L. Neale B.M. Gravel S. Daly M.J. Bustamante C.D. Kenny E.E. Human demographic history impacts genetic risk prediction across diverse populations.Am. J. Hum. Genet. 2017; 100: 635-649Abstract Full Text Full Text PDF PubMed Scopus (485) Google Scholar a separate error was introduced: the model itself was defined as suggested in the documentation (with updated parameters from Gravel et al.12Gravel S. Henn B.M. Gutenkunst R.N. Indap A.R. Marth G.T. Clark A.G. Yu F. Gibbs R.A. Bustamante C.D. 1000 Genomes Project. Demographic history and rare allele sharing among human populations.Proc. Natl. Acad. Sci. USA. 2011; 108: 11983-11988Crossref PubMed Scopus (379) Google Scholar) and inspected with the msprime debugging tools. Despite these initial checks, the simulation was performed without passing the list of demographic events, such as historical changes in size or population splits, so that the three populations never merged and remained separated with low levels of migration (Figure 3A). This leads to a vast overestimate of the divergence across human populations. Although the correct model predicts a mean FST of 0.05−0.10 across the three populations, the simulated model generated FST ranging between 0.3−0.6, depending on the populations considered. Overall diversity was also strongly affected: heterozygosity was more than doubled in African populations and reduced by more than half in Eurasian populations relative to the correct model. This simulation was performed to assess the transferability of polygenic risk scores across human populations. In other words, it sought to explore how human demographic history and population structure affect our ability to predict genetic risk in diverse populations given the well-documented unequal representation in medical genetic studies.13Popejoy A.B. Fullerton S.M. Genomics is failing on diversity.Nature. 2016; 538: 161-164Crossref PubMed Scopus (641) Google Scholar The resulting publication has been influential in the discussion of health inequalities and genomics: it has been cited over 350 times since 2017. The large excess in divergence under the incorrect model here exaggerated the role of demography and genetic drift in limiting the transferability of genetic risk scores across populations. Difficulties in transferability remain in the corrected model (Figure 3), although risk prediction in each population is significantly improved (compare to Figure 5 in Martin et al.6Martin A.R. Gignoux C.R. Walters R.K. Wojcik G.L. Neale B.M. Gravel S. Daly M.J. Bustamante C.D. Kenny E.E. Human demographic history impacts genetic risk prediction across diverse populations.Am. J. Hum. Genet. 2017; 100: 635-649Abstract Full Text Full Text PDF PubMed Scopus (485) Google Scholar). Simulations under the correct model indicate that the accuracy of genetic risk scores is still substantially reduced in understudied populations (Figures 3E–3G), supporting one of the main conclusions of Martin et al.6Martin A.R. Gignoux C.R. Walters R.K. Wojcik G.L. Neale B.M. Gravel S. Daly M.J. Bustamante C.D. Kenny E.E. Human demographic history impacts genetic risk prediction across diverse populations.Am. J. Hum. Genet. 2017; 100: 635-649Abstract Full Text Full Text PDF PubMed Scopus (485) Google Scholar However, the reduction is much less pronounced than reported. In particular, we do not observe large differences in mean predicted risk across populations (Figures 3C and 3D) that were present in data and simulations from Martin et al.6Martin A.R. Gignoux C.R. Walters R.K. Wojcik G.L. Neale B.M. Gravel S. Daly M.J. Bustamante C.D. Kenny E.E. Human demographic history impacts genetic risk prediction across diverse populations.Am. J. Hum. Genet. 2017; 100: 635-649Abstract Full Text Full Text PDF PubMed Scopus (485) Google Scholar Thus, a model with continental-scale demographic structure, highly polygenic architecture, and neutral evolution does not appear to explain the large directional biases in mean predicted risk identified in Martin et al.6Martin A.R. Gignoux C.R. Walters R.K. Wojcik G.L. Neale B.M. Gravel S. Daly M.J. Bustamante C.D. Kenny E.E. Human demographic history impacts genetic risk prediction across diverse populations.Am. J. Hum. Genet. 2017; 100: 635-649Abstract Full Text Full Text PDF PubMed Scopus (485) Google Scholar From a software perspective, this error was easy to make and could have been prevented by better API design. The original msprime API requires the user to pass three separate parameters to specify a demographic model (see example in Figure 1C). Note that the same three parameters must be passed to both the debugger and the simulate function. To help prevent such errors, msprime version 1.0 will introduce a demography class. The above snippet would then be rewritten (approximately) as shown in Figure 1D. This simple change to the interface makes it much less likely that different models are passed to the debugger and simulator, reducing the potential for error. The implementation of complex demographic models is error prone, and such errors can have a large impact on downstream analyses and interpretation. The discovery and correction of the demographic models discussed here underscore how API design choice can lead to the propagation of mistakes that are difficult to notice. There are myriad other ways to get models slightly or very wrong. Migration rates can be set in the wrong direction, for example, and parameters can be rescaled incorrectly when a new mutation rate is used. We therefore recommend the following steps to ensure more robust simulations.
Referência(s)