Phylogenomics — principles, opportunities and pitfalls of big‐data phylogenetics
2019; Wiley; Volume: 45; Issue: 2 Linguagem: Inglês
10.1111/syen.12406
ISSN1365-3113
AutoresAndrew D. Young, Jéssica P. Gillung,
Tópico(s)Evolution and Paleontology Studies
ResumoPhylogenetics is the science of reconstructing the evolutionary history of life on Earth. Traditionally, phylogenies were constructed using morphological data only, but the introduction of Sanger sequencing and PCR in the late 1970s enabled genetic information to be incorporated into phylogenetic analyses. Early phylogenetic studies employing multilocus analyses contributed greatly to our knowledge of phylogenetic history and challenged some well-established views of the relationships among many groups of plants and animals. Since the publication of these pioneering studies, significant methodological advances in both sequencing and analytical techniques have been made, and molecular phylogenies are now broadly accepted to represent robust hypotheses of organismal relationships. Next-generation sequencing techniques, developed in the mid-2000s, revolutionized DNA sequencing and led to a dramatic reduction in sequencing cost per nucleotide and a sharp increase in data generation speed. As a result, the generation of unprecedented amounts of sequence data for both model and nonmodel organisms has become affordable. This development has transformed the field of molecular phylogenetics into phylogenomics—where genome-scale data are obtained from multiple samples at once at a much reduced cost (Mardis, 2011). The phylogenomic pipeline can be very complex, presenting an overwhelming array of methodologies available for the acquisition, manipulation, analysis and interpretation of massive datasets. Researchers also have to overcome the challenges of sequencing strategy design, identification of orthologous loci, model selection and phylogeny estimation. This can be particularly daunting for researchers new to the field—both students and established scientists—who wish to delve into novel methods and data to reconstruct the evolution of their study group. Here we present an entry-level overview of the theory and tools that are central to phylogenomics, with an emphasis on the appropriate application of techniques useful for phylogenetic analysis of genomic data. We focus on the sequencing technologies and statistical methods for phylogeny estimation, and the software implementing these methods and their application to large molecular datasets. We also discuss the tools and tradeoffs for improving the accuracy of phylogenomic analyses, including the biological and methodological sources of systematic error in phylogeny estimation. Finally, we provide a glossary of commonly encountered terms used in phylogenomics that may be useful for those entering the field and hoping to sort through the multitude of methods, analytical tools and terminology inherent to this relatively new, but rapidly advancing field. The word 'phylogenomics' was first introduced in the context of prediction of gene function for genome-scale data (Eisen, 1998), and soon after in the context of phylogenetic inference (O'Brien & Stanyon, 1999). The discipline of phylogenomics owes its existence to the advances made in DNA sequencing technology over the past two decades (Metzker, 2010). It comprises several areas of research at the interface between molecular and evolutionary biology and has two major goals: (i) to infer phylogenetic relationships between taxa and gain insights into the mechanisms of molecular evolution; and (ii) to use multispecies phylogenetic comparisons to infer putative functions for DNA or protein sequences. Traditional Sanger sequencing studies include relatively few loci and are therefore limited by stochastic or sampling error. As there is a relatively small number of phylogenetically informative characters available in one or a few genes, this random 'noise' influences the inference of backbone nodes, potentially leading to poorly resolved or poorly supported phylogenetic trees. This problem can be addressed successfully by using much larger amounts of sequence data. Modern phylogenomic analyses, which take advantage of hundreds to thousands of loci from across the genome, are, on average, orders of magnitude larger than traditional Sanger sequencing datasets. The size of these datasets therefore significantly reduces the impact of stochastic error and data availability as a limiting factor, offering great promise for resolving historically recalcitrant nodes in the tree of life. High-throughput sequencing technologies [also called next-generation sequencing (NGS)] (Fig. 1) have yielded genome-scale data in immense quantities. Next-generation sequencing technologies differ fundamentally from the Sanger method in that they allow for massively parallel DNA sequencing, providing extremely high throughput from multiple samples simultaneously and at a much reduced cost (Mardis, 2011). Millions to billions of DNA nucleotides can be sequenced in parallel, yielding orders of magnitude more data and minimizing the need for the fragment-cloning methods that are used with Sanger sequencing (Fig. 1). Recent progress in NGS technology and the rapid development of bioinformatics tools now allow research groups of any size to generate large amounts of genomic sequences for organisms of interest. High-throughput sequencing can be used for whole-genome sequencing (Lam, 2012), whole-transcriptome shotgun sequencing (also called RNA sequencing, RNA-seq, or transcriptomics; Wang, 2009), whole-exome sequencing (Rabbani, 2014), and reduced-representation genome sequencing (also called target enrichment) (e.g., Faircloth, 2012; Lemmon, 2012). Table 1 summarizes the most commonly used sequencing technologies in phylogenomics. For more details on these different technologies see the Beginner's Handbook of Next Generation Sequencing by Genohub (https://genohub.com/next-generation-sequencing-handbook/) (see also Ambardar, 2016; Besser et al., 2018, and references therein). Choosing the appropriate sequencing technology for a phylogenomic study has important effects on downstream workflows, especially in terms of read length, as library preparation in some phylogenomic techniques (e.g. ultraconserved elements and anchored hybrid enrichment, discussed later) requires a read size selection step. Strict experimental reproducibility is an integral—albeit uncommon—aspect of biological sciences, mainly due to varied technical challenges with implementation and curation of experimental methods and procedures. Despite the importance of phylogenetic analyses to most fields of biology, the reproducibility of phylogenetic experiments can be very low, with an estimated 60% of published phylogenetic analyses being 'lost to science' due to the unavailability of the underlying data and methods (Magee, 2014). Published phylogenetic studies can be difficult or impossible to replicate or expand upon, as the utilized analytical software, software versions, software parameters, dependencies and operating system versions can be very challenging to uncover or recreate. The promotion of open science and reproducible research can create a more productive and responsible scientific culture in phylogenomics, enabling researchers to build upon previous studies and continuously address larger and more complex questions. This philosophy encompasses the sharing of data and code used to produce the analysis, as well as open archiving of all raw data (Mork, 2015; Shade & Teal, 2015). Data provenance, the recording of the input and transformation of information used to generate a result, is a key issue in reproducibility. Several recommendations and guidelines to promote the best practices in reproducibility and data management in phylogenomics and bioinformatics have been proposed (Cranston et al., 2014; Magee, 2014; Debiasse & Ryan, 2019), and many tools for ensuring provenance and curation of both data and methods have been developed (e.g. Dunn, 2013; Oakley, 2014; Szitenberg, 2015). To ensure the best practices in phylogenomics and bioinformatics, it is vital that reproducibility checkpoints are enforced—places in a workflow devoted to scrutinizing its integrity, so results are validated across multiple iterations to ensure consistency of results. Additionally, adopting an iterative, branching workflow to systematically explore the methodological space is crucial. Linear methodology, with experimental and computational procedures lined up one after the other, as presented in most published studies, is rarely the reality of phylogenetic analysis. Instead, estimating phylogenetic trees is more often than not a messy enterprise, and a systematic exploration of the methods and data is recommended in order to select the best tools and pipelines to answer the question at hand. Finally, for good provenance of experimental procedures and computational tools utilized in a particular study, it is highly recommended that comprehensive notes are kept throughout the process. In particular, keeping a 'readme' file at every step can be extremely helpful in keeping track of the software versions used, parameter values utilized, goals of each step and how they relate to the software utilized, as well as indication of data format changes. All these can greatly contribute to standardization and ease of downstream efforts (Shade & Teal, 2015). Phylogenomic data are a precious scientific resource: molecular sequence alignments and phylogenies are expensive to generate, difficult to replicate and have seemingly infinite potential for synthesis and reuse. For most phylogenomic analyses, phylogeneticists are faced with a large combination of algorithms, models and data manipulation techniques. To address this issue, here we present a flowchart containing the major steps and tools utilized in phylogenomics (Fig. 2). The flowchart is not meant to be exhaustive, but merely a visualization of the commonly utilized methodologies and pipelines in recent phylogenomic studies. Taxon sampling is of extreme importance for phylogenetic inference, and increased sampling of taxa—coupled with increased sampling of loci—is commonly advocated as a solution to resolving recalcitrant nodes of the tree of life. Ideally, sampling of both taxa and sequences should be increased at the same pace, but the advances in high-throughput sequencing have caused increases in gene sampling to far outpace taxon sampling. As greater amounts of data are incorporated into phylogenetic studies, new evidence and hypotheses regarding relationships among taxa can emerge, and placement of lineages within clades can change dramatically. Taxon sampling can thus greatly influence hypotheses supported by phylogenetic inference (Rosenberg & Kumar, 2003; Nabhan & Sarkar, 2012). Taxon selection meant to address a specific research question should take place early in a phylogenomic study. 'Sufficient' taxon sampling is always dependent on the questions being addressed. Ideally, in order to unravel the phylogeny of an entire taxonomic unit, most, if not all, subordinate taxa in that unit should be sampled. Even though increasing the number of taxa results in a more complex computational problem for phylogenetic analysis, it has been demonstrated that denser taxon sampling improves phylogenetic accuracy (Heath et al., 2008). Taxon sampling, however, can be greatly limited by the phylogenomic method of choice. Transcriptomics, for instance, requires specimens collected and stored directly into liquid nitrogen or RNAlater, whereas other sequencing methods, including target enrichment, and shotgun and exome sequencing, will require molecular-grade specimens, preferably preserved in high-grade ethyl ethanol and stored in a laboratory ultrafreezer. A notable exception to this is target enrichment of ultraconserved elements (UCEs), a method that can successfully generate phylogenomic data from old, pinned insect museum specimens (Blaimer, 2016). Genome-scale projects may be particularly vulnerable to systematic error caused by nonproportional phylogenetic sampling. As dataset size increases, so does the accumulation of nonrandom systematic error and accompanying nonphylogenetic signal (Jeffroy, 2006). Bayesian analyses of macroevolutionary patterns—including divergence-time estimation, ancestral state reconstruction, and diversification rate estimation—assume proportional sampling of lineages within a clade, and deviations from it may potentially lead to biases (Stadler, 2009). However, some implementations enable 'corrections' for uneven taxon sampling (e.g. revbayes implements corrections for birth-death and various diversification rate models, except for fossilized birth-death). Before sequencing new specimens, it is also worth evaluating previously sequenced resources. The National Center for Biotechnology Information's Sequence Read Archive (NCBI SRA) contains user-uploaded raw sequence data and alignment information from high-throughput sequencing projects (Leinonen, 2011). Other resources include FlyBase (Thurmond, 2019), a large database of Drosophila genes and genomes, WormBase (https://www.wormbase.org), containing genomic data of Caenorhabditis elegans and related nematodes, and the UCSC Genome Browser (Kent et al., 2002), a large repository of mostly vertebrate genomes. Utilizing sequences from these databases can save money and/or increase taxon sampling in ongoing phylogenomic projects. For a comprehensive overview of insect DNA methods, see Moreau (2014), which offers a detailed description of DNA extraction methods using either commercial kits or phenol/chloroform protocols. After DNA extraction, specimens should be deposited in publicly accessible collections in association with their unique identifier, and publications utilizing these data should always include unique identifier, repository and specimen metadata (including specimen collector, date and method of collection, and geographic origin). Vouchering specimens with unique identifiers (alphanumeric database number) is crucial for all phylogenetic projects. Therefore, nondestructive or partially destructive DNA extraction methods should be used whenever possible, and in these cases, the extracted specimen itself becomes the voucher. By contrast, when nondestructive DNA extraction is not possible, such as in transcriptomic projects or small-bodied organisms, a photographic voucher can be associated with the sequence data. Moreover, when the destroyed specimen is part of a sample of conspecifics (e.g. in communal or social insects), another specimen from the same sample can serve as a voucher, provided it is made clear that it is not the extracted specimen. Properly vouchering specimens used for DNA extraction greatly increases reproducibility by alleviating issues related to sample identity and unstable taxonomy (Pleijel et al., 2008; Turney, 2015). Although large phylogenomic datasets have become increasingly more accessible and cost-efficient in recent years, it is now widely accepted that simply increasing the amount of sequence data will not unambiguously resolve some of the most difficult nodes in the tree of life, mainly due to systematic error from nonphylogenetic signal or model inadequacy. Appropriate locus selection is therefore crucial in phylogenomics, but knowledge of the best molecular markers for resolving difficult branches at various evolutionary depths is still incipient. Questions still remain about whether to use coding or noncoding sequence data, conserved or highly variable loci, and long or short alignments (Betancur-R. et al., 2014; Edwards et al., 2016; Chen et al., 2017). Therefore, one of the most critical decisions in a phylogenomic project is the sequencing method to be utilized, a decision that must be made a priori as each method will result in different types of genomic data sequenced. Different methods have their own characteristics, advantages, and limitations, including cost-effectiveness, ease of use, sample quality required, and downstream data filtering and analysis workflow. Phylogenomic sequencing methods (Table 2) can be broadly subdivided into shotgun sequencing and target enrichment sequencing. Shotgun sequencing is the process of sequencing from the entire fragmented genome at random, returning part or all of the genome depending on the sequencing depth achieved, whereas target enrichment uses bidirectional probes (analogous to primers in Sanger sequencing) to recover only genomic regions of interest. Popular methods of shotgun sequencing include genome skimming, whole-genome shotgun sequencing and transcriptome sequencing (i.e. RNA-seq). Popular methods of target enrichment for phylogenetics include anchored hybrid enrichment (AHE) (Lemmon, 2012) and UCEs (McCormack et al., 2012; Faircloth et al., 2012) [see also Mandel (2014) for an alternative method developed for plants in the family Compositae]. These techniques are reviewed briefly in Table 2 and have been covered in more detail elsewhere (e.g. Lemmon & Lemmon, 2013; McCormack, 2013; Wen et al., 2015; Zhang et al., 2019). Shotgun sequencing (Fig. 3) involves fragmenting template DNA into short pieces, which are then randomly sequenced to obtain reads. Next, various methods and software are used to overlap different reads and assemble them into a longer DNA sequence called a contig. RNA-seq can be considered a special form of shotgun sequencing, where whole mRNA is first extracted and reverse-transcribed into reverse-complement DNA, which is then sequenced. Sequencing depth, or the average number of times an individual base in the genome is sequenced, is a key concept in shotgun sequencing. Because the genome is sequenced at random, multiple-copy regions of the genome (i.e. mitochondrial, ribosomal, and plastid DNA) are sequenced more frequently than single-copy regions. Therefore, when a genome is sequenced at a relatively shallow depth, only fragments from multiple-copy regions of the genome are sequenced in sufficient quantities to be successfully recovered. Shallow-depth whole-genome shotgun sequencing is also called genome skimming (Straub et al., 2012), a time and cost-efficient method of sequencing mostly mitochondrial, ribosomal and plastid DNA. Conversely, when near-complete genomes are desired from whole-genome shotgun sequencing, a much greater sequencing depth is required in order to sequence sufficient numbers of fragments from single-copy regions of the genome. By contrast, in RNA-seq (or transcriptomics) the extracted mRNA is used as a template to generate reverse-complement DNA. This reverse-complement DNA is then sequenced, resulting in data generated from only the genomic regions undergoing active transcription at the time of tissue preservation. This method is therefore not only a genome-reduction strategy, but also facilitates the comparison of transcription activity between individual tissues, life stages, rearing conditions, etc. One of the major drawbacks of transcriptomics is the high tissue quality required—specimens must be flash-frozen in liquid nitrogen or collected directly into RNAlater, thus precluding the utilization of specimens already available in tissue collections or museums. Targeted sequence capture, or target enrichment (Fig. 4), is an umbrella term for multiple efficient, cost-effective methods for generating phylogenomic datasets for nonmodel organisms. These methods effectively reduce genomic DNA complexity through the use of short (60–120 bp), single-stranded nucleotide baits or probes that hybridize with template sequences, thus enabling the recovery of particular sequences of interest with high coverage. As a result, mostly genomic regions of interest are recovered, although nontarget DNA (including mitochondrial and symbiont sequences) can be present in the resulting set of reads. Multiple samples can be multiplexed and sequenced together, which enables the generation of DNA sequence data for hundreds of loci from over 100 samples simultaneously. There are two main methods of target enrichment commonly used in animal phylogenomics: AHE (Lemmon et al., 2012) and UCEs (McCormack et al., 2012). Both methods are reduced-representation approaches that rely on the utilization of a phylogenetically informative subset of the study organisms' genomes. Other target enrichment methods have been proposed, with varied locus selection criteria, including Hyb-Seq (Weitemier et al., 2014), Compositae COS loci (Mandel et al., 2014) and RELEC (Karin, 2019), but AHE and UCEs have thus far been the most commonly used methods in phylogenomic studies of animals. In these methods, probes hybridize in-solution with targeted anchor sequences, which are subsequently enriched. Both the genomic regions targeted by the probes and their flanking regions are sequenced, such that both conserved and more variable (and thus more phylogenetically informative) regions are sequenced at once. These reduced-representation methods enable researchers not only to capture the same sets of loci across all taxa of interest, but also to exclude repetitive or phylogenetically misleading regions of the genome, including pseudogenes and paralogues. A great benefit of consistently using the same sets of markers across studies is that it allows for meta-analyses as more phylogenomic data accumulate over time. By contrast, for RNA-seq datasets to be consistent, the transcriptomes must be gathered from the same tissue types, and it can be challenging to accurately assess orthologues and align different isoforms during analysis. Anchored hybrid enrichment sequencing targets mainly protein-coding regions of the genome, meaning that enriched loci comprise mostly coding, and in some cases intronic or other genomic elements (e.g. untranslated region). This means that phylogenetic data can be more easily coded and analysed both as nucleotides and as amino acids. Candidate target loci are identified using genomes, transcriptomes or raw genomic reads from two or more reference species. Then, sequences for the target loci for each reference species to be included in the probe kit design are isolated, and an alignment for each locus is generated. Probes are developed based on these alignments, for which a substantial amount of quality control is applied such that target loci are single copy and have the appropriate amount of sequence variation to ensure both phylogenetic accuracy and efficient enrichment (Lemmon, 2012). Full probe kits target 500–800 protein-coding loci on average, and marker genes (also called traditional or legacy genes, e.g. COI) are often included in the target pool of loci, which facilitates integration with previous Sanger sequencing phylogenetic studies. Ultraconserved elements, in turn, are highly conserved regions of the genome shared among evolutionarily distant taxa. As universal genetic markers for particular taxa of interest, UCE data are useful for reconstructing the evolutionary history and population-level relationships of many organisms. In short, UCEs are identified by aligning several genomes to each other, with subsequent detection and filtering of areas of very high (95–100%) sequence conservation across all taxa of interest. There are a number of different ways of identifying UCEs for use as genetic markers and designing baits to target them, but the most commonly used approach in animal phylogenetics was described in detail in Faircloth (2017). Ultraconserved elements have been shown to perform well when collecting data from museum samples (Blaimer et al., 2016; McCormack et al., 2016), which can greatly facilitate expansion of taxon sampling as sequencing is no longer restricted to fresh specimens. Despite older specimens producing fewer and shorter loci in general, it is still possible to retrieve hundreds of markers from relatively old specimens (McCormack et al., 2016). A major advantage of UCEs over AHE data is phyluce, an user-friendly and open-source software pipeline developed for the processing and analysis of UCE data (Faircloth, 2016). phyluce contains several software packages and tutorials that are extremely helpful and accessible, especially to beginners. These resources, however, require some familiarity with working in the command-line environment of a Unix or Unix-like system. For a comprehensive and user-friendly guide to working on the command line see the Happy Belly Bioinformatics tutorials (https://astrobiomike.github.io/unix/). For a comprehensive and informative overview of the theory and practice of UCEs for arthropod phylogenomics, see Zhang et al. (2019). Performing quality-control on raw reads obtained from high-throughput sequencing is a crucial, yet sometimes overlooked step. Reads should ideally be inspected for sequence quality, guanine-cytosine (GC) content, presence of adapter sequences and read errors (i.e. base calling errors and small insertions/deletions). Several programs have been proposed for quality control of NGS data, including fastqc for Illumina reads (Andrews, 2010), and ngsqc for reads from all other platforms (Dai et al., 2010). These two resources offer a means to detect and visualize potential errors, after which a complementary software, such as trimmomatic (Bolger et al., 2014), can be used to trim low-quality bases and remove adapter contamination. While trimmomatic offers an all-in-one solution to sequence quality control, being the most commonly used in animal phylogenomics, other similar programs have been proposed. cutadapt, for instance, is used only for read trimming, but is especially useful for working with sequences obtained from Applied Biosystems' SOliD sequencer (Martin, 2011). Likewise, scythe is a Bayesian approach for removing adapters from the 3′ end of sequences, where read quality is often degraded (https://github.com/vsbuffalo/scythe). sickle is a standalone program used only for read quality trimming (Joshi & Fass, 2011). Finally, htstream consists of a comprehensive quality control pipeline that allows streaming from application to application (https://github.com/ibest/HTStream). The software can simultaneously handle both single-end and paired-end reads and enables process parallelization. Once quality control has been performed on raw reads, the next step is the assembly of contigs. Sequence assembly refers to aligning and merging small DNA fragments obtained from a high-throughput sequencing platform in order to reconstruct longer DNA sequences. Sequence assembly is necessary because whole genomes cannot be sequenced in one go, but rather small pieces of DNA of approximately 20 000–30 000 bases in length are sequenced at a time, depending on the technology used. These short DNA pieces are called reads, and these reads are then assembled into longer DNA sequences called contigs. There are two main techniques of genomic assembly: de novo and reference-based. De novo assembly methods consist of constructing, simplifying and resolving de Bruijn graphs to extract contigs, and no reference genome is needed [see Compeau (2011) for a general introduction to de Bruijn graphs]. In the case of reference-based methods, a previously assembled genome is used as a reference to which sequenced reads are independently aligned. Ultimately, almost every read is placed at its most likely position, and in contrast to de novo assembly, no synergies between reads occur. De novo assembly is often the preferred method in phylogenomics, as it does not require a fully assembled reference genome. A multitude of methods for de novo assembly have been developed, and the field is constantly evolving. Performance of different methods depends greatly on data type and species to be assembled, and each method has its own set of tradeoffs, including computer memory needed and computational complexity exhibited. The most commonly used programs for de novo assembly in phylogenomics include velvet (Zerbino & Birney, 2008), trinity (Grabherr, 2011), soapdenovo (Li et al., 2010), spades (Bankevich, 2012) and abyss 2.0 (Jackman et al., 2017). For a comprehensive comparison of them, see Narzisi & Mishra (2011) and Hölzer & Marz (2019). Finally, the software atram 2.0 (Allen et al., 2015) enables easy manipulation of whole-genome, genome-skimming and transcriptomic data. The program implements most of the aforementioned assemblers in a pipeline that enables automation and parallelization of assembly tasks. In addition, the program performs targeted assembly of contigs of interest, which can greatly reduce the computational scale of the assembly problem compared with full genome/transcriptome de novo assembly, as well as avoiding errors introduced during whole-genome/transcriptome assembly. As assembly is a major step in any phylogenomic analysis, problems at this stage (incomplete assembly, assembly errors and redundancy) can lead to difficulties in downstream workflows, including orthologue and paralogue identification, alignment and matrix construction. These problems increase the amount of missing data in the final aligned matrix, ultimately limiting the amount of useful data. For the best assembly, high coverage, long read lengths, and good read quality are all needed. However, current sequencing technologies do not inherently possess all three; for instance, Illumina sequencing results in good-quality reads, but short length, while pacbio sequencing generates very long reads, but with low quality (see Table 1). Phylogenetic relationships should always be estimated based on sequences that are related by orthology, i.e. whose common ancestor diverged as a result of speciation (orthologues), rather than a duplication event (paralogues) (Fitch, 1970). Genes arising from duplication events complicate the inference of a species tree based on concatenated gene alignments, because the gene tree describing the relationships among paralogues may differ from the species tree. Orthology assessment has become a central problem for evolutionary and molecular biologists. In phylogenetic inference, it is assumed a priori that the loci used to i
Referência(s)