A Comparison of the Celera and Ensembl Predicted Gene Sets Reveals Little Overlap in Novel Genes
2001; Cell Press; Volume: 106; Issue: 4 Linguagem: Inglês
10.1016/s0092-8674(01)00467-6
ISSN1097-4172
AutoresJohn B. Hogenesch, Keith A. Ching, Serge Batalov, Andrew I. Su, John R. Walker, Yingyao Zhou, Steve A. Kay, Peter G. Schultz, M. Cooke,
Tópico(s)Chromosomal and Genetic Variations
ResumoThe recent description of the human genome and the subsequent annotation of putative novel genes has ushered in a new era in biology. One of the revelations of the human genome project was the remarkably consistent prediction that the genome harbors around 30,000 genes. This observation was based on independent analyses done by a public genome consortium (29,691 transcripts, Ensembl v0.8) (Lander et al. 2001Lander E.S. Linton L.M. Birren B. Nusbaum C. Zody M.C. Baldwin J. Devon K. Dewar K. Doyle M. FitzHugh W. et al.Initial sequencing and analysis of the human genome. International Human Genome Sequencing Consortium.Nature. 2001; 409: 860-921Crossref PubMed Scopus (16519) Google Scholar), by work done at Celera Genomics (39,114 transcripts) (Venter et al. 2001Venter J.C. Adams M.D. Myers E.W. Li P.W. Mural R.J. Sutton G.G. Smith H.O. Yandell M. Evans C.A. Holt R.A. et al.The sequence of the human genome.Science. 2001; 291: 1304-1351Crossref PubMed Scopus (9973) Google Scholar), and by Green and colleagues using expressed sequence tag (EST) clustering incorporating quality scores (35,000 genes) (Ewing and Green 2000Ewing B. Green P. Analysis of expressed sequence tags indicates 35,000 human genes.Nat. Genet. 2000; 25: 232-234Crossref PubMed Scopus (284) Google Scholar). This conclusion was surprising for two reasons. First, less complex organisms like Arabidopsis (25,000) and C. elegans (19,000) have approximately the same number of genes (C. elegans Sequencing Consortium 1998C. elegans Sequencing ConsortiumGenome sequence of the nematode C. elegans a platform for investigating biology.Science. 1998; 282: 2012-2018Crossref PubMed Scopus (3283) Google Scholar, Arabidopsis Genome Initiative 2000Arabidopsis Genome InitiativeAnalysis of the genome sequence of the flowering plant Arabidopsis thaliana.Nature. 2000; 408: 796-815Crossref PubMed Scopus (6587) Google Scholar). Second, earlier estimates of gene number based on EST clustering and detailed chromosomal analysis were much higher, ranging from 45,000 to 140,000 (Dunham et al. 1999Dunham I. Shimizu N. Roe B.A. Chissoe S. Hunt A.R. Collins J.E. Bruskiewich R. Beare D.M. Clamp M. Smink L.J. et al.The DNA sequence of human chromosome 22.Nature. 1999; 402: 489-495Crossref PubMed Scopus (889) Google Scholar, Fields et al. 1994Fields C. Adams M.D. White O. Venter J.C. How many genes in the human genome?.Nat. Genet. 1994; 7: 345-346Crossref PubMed Scopus (224) Google Scholar, Liang et al. 2000Liang F. Holt I. Pertea G. Karamycheva S. Salzberg S.L. Quackenbush J. Gene index analysis of the human genome estimates approximately 120,000 genes.Nat. Genet. 2000; 25: 239-240Crossref PubMed Scopus (215) Google Scholar, Scott 1999Scott, R. (1999). The future in understanding the molecular basis of life. In 11th International Genome Sequencing and Analysis Conference (Miami, FL).Google Scholar). While the Celera and Ensembl annotation efforts predicted approximately the same number of genes, a direct comparison of the predicted transcript sets has not been made. If the predictions are accurate and complete, then one would expect them to be largely overlapping. To address this point, we compared the predicted transcript sequences from the two genome efforts with each other and with a well-curated set of 11,015 reference transcripts from Refseq using BLAST (Altschul et al. 1990Altschul S.F. Gish W. Miller W. Myers E.W. Lipman D.J. Basic local alignment search tool.J. Mol. Biol. 1990; 215: 403-410PubMed Scopus (0) Google Scholar, Pruitt et al. 2000Pruitt K.D. Katz K.S. Sicotte H. Maglott D.R. Introducing RefSeq and LocusLink curated human genome resources at the NCBI.Trends Genet. 2000; 16: 44-47Abstract Full Text Full Text PDF PubMed Scopus (198) Google Scholar). Given the difficulty of precisely predicting genes, we chose a permissive clustering method that requires only a short (>100 bp) region with at least 98% identity to combine transcripts into a single cluster. Using this method, transcripts that share only a single average size exon (∼140 bp; Lander et al. 2001Lander E.S. Linton L.M. Birren B. Nusbaum C. Zody M.C. Baldwin J. Devon K. Dewar K. Doyle M. FitzHugh W. et al.Initial sequencing and analysis of the human genome. International Human Genome Sequencing Consortium.Nature. 2001; 409: 860-921Crossref PubMed Scopus (16519) Google Scholar, Venter et al. 2001Venter J.C. Adams M.D. Myers E.W. Li P.W. Mural R.J. Sutton G.G. Smith H.O. Yandell M. Evans C.A. Holt R.A. et al.The sequence of the human genome.Science. 2001; 291: 1304-1351Crossref PubMed Scopus (9973) Google Scholar) cluster together. We first compared the Celera and Ensembl transcripts with the known genes from Refseq. The combined Celera and Ensembl datasets contained a fragment (at least 100 bp) of nearly all known genes (Figure 1). More than 84% of Refseq transcripts contained a match in both datasets, with the remaining Refseq genes matching either Celera (7%) or Ensembl (5%) alone. Surprisingly, when we compared the novel gene predictions that are not represented in Refseq, we found little agreement between the two transcriptomes. Collectively nearly 80% of the 31,098 novel transcripts were predicted by only one of the groups. Further breakdown of the Celera predicted transcripts shows that nearly all Celera transcripts supported by only a single line of evidence are unique to the Celera predictions. When these are removed from the analysis, 64% of the novel transcripts are predicted by only one group. Taken in sum, these data reveal that the predicted transcripts collectively contain partial nucleotide matches to nearly all known genes, but the novel genes predicted by both groups are largely nonoverlapping. To validate the existence of the transcript predictions, we used RNA expression profiling and a bank of 13 diverse human tissues. The commercial high-density oligonucleotide arrays used are based on Expressed Sequence Tags (ESTs) represented in Unigene (release 95). BLASTN was used to assign the transcript predictions to a Unigene cluster, and the RNA expression pattern was determined for the 8,000 known and 5,000 novel predicted genes with a corresponding Unigene cluster on the arrays (see legend to Figure 2 for details). Using these methods, we found evidence of expression for more than 80% of the known genes in at least one of the tissue samples analyzed (Figure 2A). Similarly, more than 80% of the novel predicted transcripts were detected as expressed in at least one of the 13 human tissues. Hierarchical clustering and visualization of these expression data revealed a similar fraction of tissue-restricted transcripts for both known and novel genes (Figure 2B). These data support the view that the novel transcripts predicted by both groups encode bona fide differentially expressed mRNAs. Since many of these verified transcripts were contained in only one of the two predicted transcriptomes, we conclude that the computational methods used for gene prediction by either group are inadequate and that the respective transcriptomes are individually incomplete. What could explain the discrepancies in the predicted transcriptomes? The lack of overlap in the predicted transcripts may reflect differences in the underlying genome assembly, or the algorithms and types of evidence used for transcript prediction. While the draft nature of the human genome sequence may contribute to some of these discrepancies, similar findings from the annotation of the finished Drosophila genome support the view that the evidence-based gene prediction methods used may be too conservative (Gopal et al. 2001Gopal S. Schroeder M. Pieper U. Sczyrba A. Aytekin-Kurban G. Bekiranov S. Fajardo J.E. Eswar N. Sanchez R. Sali A. Gaasterland T. Homology-based annotation yields 1,042 new candidate genes in the Drosophila melanogaster genome.Nat. Genet. 2001; 27: 337-340Crossref PubMed Scopus (52) Google Scholar). An alternative approach would be to overpredict and use experimental methods such as RNA expression analysis to validate predicted transcripts (Reboul et al. 2001Reboul J. Vaglio P. Tzellas N. Thierry-Mieg N. Moore T. Jackson C. Shin-i T. Kohara Y. Thierry-Mieg D. Thierry-Mieg J. et al.Open-reading-frame sequence tags (OSTs) support the existence of at least 17,300 genes in C. elegans.Nat. Genet. 2001; 27: 332-336Crossref PubMed Scopus (137) Google Scholar). Our initial results using high-density DNA arrays support such an approach. Collectively, these studies suggest caution in the use of the current predicted transcript sets and cast doubt on these latest estimates of human gene numbers. We conclude that an integrated approach combining computational predictions, human curation, and experimental validation will be required to complete a finished picture of the human transcriptome.
Referência(s)