Enchilada redux: how complete is your genome sequence?
2008; Wiley; Volume: 179; Issue: 2 Linguagem: Inglês
10.1111/j.1469-8137.2008.02527.x
ISSN1469-8137
AutoresRenyi Liu, Jeffrey L. Bennetzen,
Tópico(s)RNA and protein synthesis mechanisms
ResumoIn 2004, Jorgensen made a pungent argument for sequencing entire plant genomes, entitled ‘Sequencing maize: just sample the salsa or go for the whole enchilada?’ (Jorgensen, 2004). For some of us, the genes in the genome are the enchilada, and the 80–90% plus of repeats in genomes, like those of maize and barley, amount to eight to nine orders of refried beans. Although we like these refritos more than most, we think that scientists would digest the data better, and produce less gaseous reports, if they mostly concentrated on the genes (as most do). Now, as multiple plant genomes proceed towards completed or draft full-genome status, it seems a timely moment to revisit this issue. Although the question of completeness is largely moot with current technology (no higher eukaryotic genome has been fully sequenced, although the nematode Caenorhabditis elegans and rice (IRGSP, 2005) come close), the argument has now shifted to consider what degree of incompleteness is tolerable. We think that this discussion, still somewhat a matter of personal philosophy, needs to continue. However, most importantly, we feel that investigators need to know to what degree the sequence that has been generated and assembled actually approaches completion. We propose here a simple and low-cost method to determine the level of genome sequence completeness, using the Arabidopsis thaliana genome as an example. Back in ancient genome-sequencing days (i.e. 2000), when Arabidopsis provided the first comprehensive plant genome analysis, most sequencing involved clone-by-clone (e.g. BAC-by-BAC) sequencing and assembly across a minimum tiling path (MTP). Some genome projects, like the ongoing maize genome sequence, continue to follow this approach. However, it is now accepted that a certain level of shotgun genome sequence from whole-genomic DNA is a vital adjunct to clone-by-clone approaches, in order to account for sequences not fully represented in the MTP. This idea had not yet been fully conceived when the Arabidopsis genome was published, so there was no such accompanying data. The participants in the Arabidopsis Genome Initiative (2000) knew that their c. 115 Mb assembly of largely contiguous sequence did not cover all of the genome, and they estimated that approx. 10 Mb of DNA had been missed (mostly in pericentromeric, centromeric and ribosomal DNA regions). This estimate appears to have under-represented the true case. By generating a detailed physical map of Arabidopsis centromeres, Hosouchi et al. (2002) predicted that > 30 Mb of the genome had not yet been sequenced. By using an unrelated approach (nuclear microfluorescence, with flanking genome size standards (c. 100 Mb of C. elegans and c. 175 Mb of Drosophila melanogaster)), Bennett et al. (2003) predicted that the Arabidopsis genome in the Columbia ecotype was c. 157 Mb, indicating an absence of > 40 Mb of data, or > 25% of the genome! These two techniques have their own issues, however, and might be inaccurate in either direction. Another approach to determine genome sequence assembly coverage is to compare the results of a whole-genome shotgun sequence analysis with the assembled sequence. If the shotgun sequence data are truly random and present in sufficient quantity, then the percentages of DNA in each sequence type will be exactly identical to their percentages in the full-genome assembly, if it is complete. For instance, if 25% of a genome comprises some specific repeat (as shown by the shotgun analysis), then this repeat should comprise 25% of the completed assembly. In order to test this idea, we sheared, cloned and generated 1583 high-quality random reads from the same source of the Columbia Arabidopsis ecotype that was used for the full-genome sequence, generating c. 1.36 Mb of data (Liu, 2005). The sequences of these clones indicated that c. 18.3% of the genome contained identified repeats, compared with 7.7% in the Arabidopsis genome sequence assembly that was available in 2005. If one assumes that all of the missed DNA comprises known repeats, then the minimum Arabidopsis genome size can be calculated to be slightly larger than 134 Mb. Of course, some of the missed sequences are likely to be unknown (e.g. low-copy-number and/or centromere-specific) repeats, or even genes. Hence, our prediction is a minimum, but there is no possibility that the genome could be smaller than this size. We therefore predict that the original genome size estimate for Arabidopsis was not too inaccurate, with somewhat over 19 Mb of missed sequence. As a side benefit, a small additional analysis of the shotgun genome sequence data also allows a prediction to be made of how many genes may have been missed by an assembly. Once again, using Arabidopsis as an example, we found no gene candidates in our data set that were not identified in the Arabidopsis genome sequence. From our reconstruction experiments with known genes, using data with the size distribution present in our 1583 reads, we would have a likelihood of c. 66% of identifying genes that are present on these reads. Taking these factors into account, we predict, with a 95% confidence level, that < 250 genes were missed in the Arabidopsis genome-sequencing project, and we have a certainty of > 70% that 100 genes or fewer were missed (Liu, 2005). Although we believe that our approach provides an ironclad minimum estimate of sequences and genes missing from the Arabidopsis genome assembly, we are also aware that specific biases against the successful cloning of some sequences could skew our analysis. For this reason, sequencing approaches that do not involve cloning (e.g. 454 or Solexa technologies) might provide the most appropriate route to pursue such confirmation. The ease of generating such an analysis of genome sequence and assembly completion, particularly when most current sequencing projects already contain large dollops of random shotgun sequence data, is difficult to overstate. We propose that such an analysis should be an absolute publication requirement for all future genome-sequencing projects (including those in plants) that describe a ‘completed’ genome sequence. Although we can, and should, argue about whether we want our genome projects to concentrate on the salsa, enchilada or frijoles, we should all agree that it is necessary to know how much of the feast that we have ordered, and paid for, has been set on the table.
Referência(s)