Improvements to the Rice Genome Annotation Through Large-Scale Analysis of RNA-Seq and Proteomics Data Sets
2018; Elsevier BV; Volume: 18; Issue: 1 Linguagem: Inglês
10.1074/mcp.ra118.000832
ISSN1535-9484
AutoresZhe Ren, Da Qi, Nina Pugh, Kai Li, Bo Wen, Ruo Zhou, Shaohang Xu, Siqi Liu, Andrew R. Jones,
Tópico(s)RNA and protein synthesis mechanisms
ResumoRice (Oryza sativa) is one of the most important worldwide crops. The genome has been available for over 10 years and has undergone several rounds of annotation. We created a comprehensive database of transcripts from 29 public RNA sequencing data sets, officially predicted genes from Ensembl plants, and common contaminants in which to search for protein-level evidence. We re-analyzed nine publicly accessible rice proteomics data sets. In total, we identified 420K peptide spectrum matches from 47K peptides and 8,187 protein groups. 4168 peptides were initially classed as putative novel peptides (not matching official genes). Following a strict filtration scheme to rule out other possible explanations, we discovered 1,584 high confidence novel peptides. The novel peptides were clustered into 692 genomic loci where our results suggest annotation improvements. 80% of the novel peptides had an ortholog match in the curated protein sequence set from at least one other plant species. For the peptides clustering in intergenic regions (and thus potentially new genes), 101 loci were identified, for which 43 had a high-confidence hit for a protein domain. Our results can be displayed as tracks on the Ensembl genome or other browsers supporting Track Hubs, to support re-annotation of the rice genome. Rice (Oryza sativa) is one of the most important worldwide crops. The genome has been available for over 10 years and has undergone several rounds of annotation. We created a comprehensive database of transcripts from 29 public RNA sequencing data sets, officially predicted genes from Ensembl plants, and common contaminants in which to search for protein-level evidence. We re-analyzed nine publicly accessible rice proteomics data sets. In total, we identified 420K peptide spectrum matches from 47K peptides and 8,187 protein groups. 4168 peptides were initially classed as putative novel peptides (not matching official genes). Following a strict filtration scheme to rule out other possible explanations, we discovered 1,584 high confidence novel peptides. The novel peptides were clustered into 692 genomic loci where our results suggest annotation improvements. 80% of the novel peptides had an ortholog match in the curated protein sequence set from at least one other plant species. For the peptides clustering in intergenic regions (and thus potentially new genes), 101 loci were identified, for which 43 had a high-confidence hit for a protein domain. Our results can be displayed as tracks on the Ensembl genome or other browsers supporting Track Hubs, to support re-annotation of the rice genome. The development of next-generation and third generation sequencing technologies mean that genome sequences are now being routinely generated for an ever-expanding range of species, strains, breeds, and even individuals within populations. For the genome to be useful for fundamental and applied research requires high-quality annotation. Following genome assembly, annotation involves the discovery of the start codons for all genes, and their exon splicing patterns, which is a highly challenging task. Gene finding in most genome projects is performed via software that makes ab initio predictions of coding sequences and, where possible, uses homology to other annotated genomes. Experimental data in the form of large-scale RNA Sequencing (RNA-Seq) 1The abbreviations used are:RNA-SeqRNA sequencingMS/MStandem Mass SpectrometryCDSprotein-coding sequencesORFopen reading framePSMpeptide spectrum matchPTMpost-translational modificationFDRfalse discovery ratePEPposterior error probabilityPSIProteomics Standards Initiative. is also commonly used to find mRNAs and align reads that cross-intron junctions to infer splicing. Undoubtedly the use of large-scale RNA-Seq data vastly improves genome annotation but nevertheless, all genomes suffer from some proportion of mistaken annotation, such as incorrect translation initiation sites, incorrect splicing or pseudogenes called as protein-coding. RNA sequencing tandem Mass Spectrometry protein-coding sequences open reading frame peptide spectrum match post-translational modification false discovery rate posterior error probability Proteomics Standards Initiative. It is now becoming widely recognized that inference of the protein-coding elements of the genome can be greatly improved using large-scale mass spectrometry (MS) data on peptide sequences, in so-called proteogenomics approaches (1Nesvizhskii A.I. Proteogenomics: concepts, applications and computational strategies.Nat. Meth. 2014; 11: 1114-1125Crossref PubMed Scopus (0) Google Scholar). In a typical proteogenomics pipeline, MS/MS spectra are searched against a customized protein sequence database, produced from curated gene predictions, as well as incorporating predicted possible sequences from ab initio gene finders and/or aligned RNA-Seq derived transcripts. Therefore, proteogenomics not only provides expression-level evidence of protein-coding genes but also has the potential to improve the protein-coding gene sets i.e. proteogenomics can provide evidence that novel transcripts or alternative predictions (for known genes) have supporting evidence at the protein sequence level. There have been several proteogenomics studies on plants that have shown the ability to discover novel protein-coding genes and predict or improve splicing annotation. For instance, in 2008, Castellana et al. performed a proteogenomics analysis on Arabidopsis tissues (2Castellana N.E. Payne S.H. Shen Z. Stanke M. Bafna V. Briggs S.P. Discovery and revision of Arabidopsis genes by proteogenomics.Proc. Natl. Acad. Sci. U.S.A. 2008; 105: 21034-21038Crossref PubMed Scopus (232) Google Scholar). They successfully identified 778 novel genes and made 695 gene model refinements. Later in 2014, they developed an automatic method of proteogenomics and performed analysis on Zea mays, finding 165 novel protein-coding genes and proposing updated models for 741 additional genes (3Castellana N.E. Shen Z. He Y. Walley J.W. Cassidy C.J. Briggs S.P. Bafna V. An automated proteogenomic method uses mass spectrometry to reveal novel genes in Zea mays.Mol. Cell Proteomics. 2014; 13: 157-167Abstract Full Text Full Text PDF PubMed Scopus (65) Google Scholar). Rice (Oryza sativa) is the staple food for half the world's population. The completion of the rice genome sequencing, and several rounds of annotation, have provided a base for molecular and genetic studies (4International Rice Genome Sequencing Project The map-based sequence of the rice genome.Nature. 2005; 436: 793-800Crossref PubMed Scopus (2965) Google Scholar, 5Goff S.A. Ricke D. Lan T.H. Presting G. Wang R. Dunn M. Glazebrook J. Sessions A. Oeller P. Varma H. Hadley D. Hutchison D. Martin C. Katagiri F. Lange B.M. Moughamer T. Xia Y. Budworth P. Zhong J. Miguel T. Paszkowski U. Zhang S. Colbert M. Sun W.L. Chen L. Cooper B. Park S. Wood T.C. Mao L. Quail P. Wing R. Dean R. Yu Y. Zharkikh A. Shen R. Sahasrabudhe S. Thomas A. Cannings R. Gutin A. Pruss D. Reid J. Tavtigian S. Mitchell J. Eldredge G. Scholl T. Miller R.M. Bhatnagar S. Adey N. Rubano T. Tusneem N. Robinson R. Feldhaus J. Macalma T. Oliphant A. Briggs S. A draft sequence of the rice genome (Oryza sativa L. ssp. japonica).Science. 2002; 296: 92-100Crossref PubMed Scopus (2663) Google Scholar, 6Yu J. Hu S. Wang J. Wong G.K. Li S. Liu B. Deng Y. Dai L. Zhou Y. Zhang X. Cao M. Liu J. Sun J. Tang J. Chen Y. Huang X. Lin W. Ye C. Tong W. Cong L. Geng J. Han Y. Li L. Li W. Hu G. Huang X. Li W. Li J. Liu Z. Li L. Liu J. Qi Q. Liu J. Li L. Li T. Wang X. Lu H. Wu T. Zhu M. Ni P. Han H. Dong W. Ren X. Feng X. Cui P. Li X. Wang H. Xu X. Zhai W. Xu Z. Zhang J. He S. Zhang J. Xu J. Zhang K. Zheng X. Dong J. Zeng W. Tao L. Ye J. Tan J. Ren X. Chen X. He J. Liu D. Tian W. Tian C. Xia H. Bao Q. Li G. Gao H. Cao T. Wang J. Zhao W. Li P. Chen W. Wang X. Zhang Y. Hu J. Wang J. Liu S. Yang J. Zhang G. Xiong Y. Li Z. Mao L. Zhou C. Zhu Z. Chen R. Hao B. Zheng W. Chen S. Guo W. Li G. Liu S. Tao M. Wang J. Zhu L. Yuan L. Yang H. A draft sequence of the rice genome (Oryza sativa L. ssp. indica).Science. 2002; 296: 79-92Crossref PubMed Scopus (2554) Google Scholar). Comprehensive genomic and transcriptomic studies of rice have been conducted worldwide, serving as a base for research aimed at matching the demand of increasing food supplies (7Li J.Y. Wang J. Zeigler R.S. The 3,000 rice genomes project: new opportunities and challenges for future rice research.Gigascience. 2014; 3: 8Crossref PubMed Scopus (201) Google Scholar, 83,000 Rice Genomes Project The 3,000 rice genomes project.Gigascience. 2014; 3: 7Crossref PubMed Scopus (246) Google Scholar, 9Rice Annotation P. Tanaka T. Antonio B.A. Kikuchi S. Matsumoto T. Nagamura Y. Numa H. Sakai H. Wu J. Itoh T. Sasaki T. Aono R. Fujii Y. Habara T. Harada E. Kanno M. Kawahara Y. Kawashima H. Kubooka H. Matsuya A. Nakaoka H. Saichi N. Sanbonmatsu R. Sato Y. Shinso Y. Suzuki M. Takeda J. Tanino M. Todokoro F. Yamaguchi K. Yamamoto N. Yamasaki C. Imanishi T. Okido T. Tada M. Ikeo K. Tateno Y. Gojobori T. Lin Y.C. Wei F.J. Hsing Y.I. Zhao Q. Han B. Kramer M.R. McCombie R.W. Lonsdale D. O'Donovan C.C. Whitfield E.J. Apweiler R. Koyanagi K.O. Khurana J.P. Raghuvanshi S. Singh N.K. Tyagi A.K. Haberer G. Fujisawa M. Hosokawa S. Ito Y. Ikawa H. Shibata M. Yamamoto M. Bruskiewich R.M. Hoen D.R. Bureau T.E. Namiki N. Ohyanagi H. Sakai Y. Nobushima S. Sakata K. Barrero R.A. Sato Y. Souvorov A. Smith-White B. Tatusova T. An S. An G. S. O.O. Fuks G. Fuks G. Messing J. Christie K.R. Lieberherr D. Kim H. Zuccolo A. Wing R.A. Nobuta K. Green P.J. Lu C. Meyers B.C. Chaparro C. Piegu B. Panaud O. Echeverria M. The Rice Annotation Project Database (RAP-DB): 2008 update.Nucleic Acids Res. 2008; 36: D1028-D1033Crossref PubMed Scopus (293) Google Scholar, 10Zhang G. Guo G. Hu X. Zhang Y. Li Q. Li R. Zhuang R. Lu Z. He Z. Fang X. Chen L. Tian W. Tao Y. Kristiansen K. Zhang X. Li S. Yang H. Wang J. Wang J. Deep RNA sequencing at single base-pair resolution reveals high complexity of the rice transcriptome.Genome Res. 2010; 20: 646-654Crossref PubMed Scopus (395) Google Scholar, 11Lu T. Lu G. Fan D. Zhu C. Li W. Zhao Q. Feng Q. Zhao Y. Guo Y. Li W. Huang X. Han B. Function annotation of the rice transcriptome at single-nucleotide resolution by RNA-seq.Genome Res. 2010; 20: 1238-1249Crossref PubMed Scopus (276) Google Scholar, 12Kawahara Y. de la Bastide M. Hamilton J.P. Kanamori H. McCombie W.R. Ouyang S. Schwartz D.C. Tanaka T. Wu J. Zhou S. Childs K.L. Davidson R.M. Lin H. Quesada-Ocampo L. Vaillancourt B. Sakai H. Lee S.S. Kim J. Numa H. Itoh T. Buell C.R. Matsumoto T. Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data.Rice. 2013; 6: 4Crossref Scopus (1017) Google Scholar). A previous effort at proteogenomics analysis on rice has been performed, and a database produced (although no longer searchable online), in which LC-MS data sets were queried against gene predictions made from the relevant genome build at that time (13Helmy M. Tomita M. Ishihama Y. OryzaPG-DB: Rice Proteome Database based on Shotgun Proteogenomics.BMC Plant Biol. 2011; 11: 63Crossref PubMed Scopus (58) Google Scholar). Herein, we have performed a comprehensive proteogenomics analysis on rice through collecting public genomics, transcriptomics and proteomics data, to discover novel protein-coding genes and new splice sites. With the development of genomics, transcriptomics and proteomics techniques, the ability to detect ever higher proportions of the transcribed genes and evidence for translated proteins has become possible via proteogenomics. In addition, tools and strategies, such as customProDB and SpliceDB (14Burset M. Seledtsov I.A. Solovyev V.V. SpliceDB: database of canonical and noncanonical mammalian splice sites.Nucleic Acids Res. 2001; 29: 255-259Crossref PubMed Scopus (186) Google Scholar), have effectively improved the performance of proteogenomics by facilitating improved design of the search database. The construction of "novel event" candidates (i.e. new exons or splice junctions) is one of the most important steps in proteogenomics. Some studies aim to be comprehensive, such as using six frame translation of whole genome sequence (15Fermin D. Allen B. Blackwell T. Menon R. Adamski M. Xu Y. Ulintz P. Omenn G. States D. Novel gene and gene model detection using a whole genome open reading frame analysis in proteomics.Genome Biol. 2006; 7: R35Crossref PubMed Scopus (107) Google Scholar, 16Khatun J. Yu Y. Wrobel J.A. Risk B.A. Gunawardena H.P. Secrest A. Spitzer W.J. Xie L. Wang L. Chen X. Giddings M.C. Whole human genome proteogenomic mapping for ENCODE cell line data: identifying protein-coding regions.BMC Genomics. 2013; 14: 141Crossref PubMed Scopus (46) Google Scholar), although these approaches are likely to contain exonic sequences for all possible genes, they suffer from a lack of statistical power (because of overall database size) and splicing information. An alternative is to use ab initio gene predictions from gene finding software (17Brosch M. Saunders G.I. Frankish A. Collins M.O. Yu L. Wright J. Verstraten R. Adams D.J. Harrow J. Choudhary J.S. Hubbard T. Shotgun proteomics aids discovery of novel protein-coding genes, alternative splicing, and "resurrected" pseudogenes in the mouse genome.Genome Res. 2011; 21: 756-767Crossref PubMed Scopus (99) Google Scholar). In this case, the constructed candidates will contain predicted splice events, but rely on the accuracy of gene finding software that is not generally high, meaning that some possible splice sites will be missed. A third alternative is to create a search database from mapping RNA-Seq data onto the genome. The use of RNA-Seq results can balance these aspects, keeping a relative comprehensive search space but without the size expansion of six frame translations. We designed our proteogenomics pipeline as follows. First, for database construction, we used transcriptomics data aligned onto the genome. For the translation step, we kept only the longest frame to control the overall database size. Second, for database searching we used multiple search engines via our previously published IPeak approach (18Wen B. Du C. Li G. Ghali F. Jones A.R. Kall L. Xu S. Zhou R. Ren Z. Feng Q. Xu X. Wang J. IPeak: An open source tool to combine results from multiple MS/MS search engines.Proteomics. 2015; 15: 2916-2920Crossref PubMed Scopus (28) Google Scholar). IPeak combines the machine learning approach of Percolator and the FDRScore algorithm for search engine integration, which has been demonstrated to improve sensitivity over using a single search engine (19Jones A.R. Siepen J.A. Hubbard S.J. Paton N.W. Improving sensitivity in proteome studies by analysis of false discovery rates for multiple search engines.Proteomics. 2009; 9: 1220-1229Crossref PubMed Scopus (76) Google Scholar). IPeak is available as part of the mzid Library and ProteoAnnotator projects (20Ghali F. Krishna R. Lukasse P. Martinez-Bartolome S. Reisinger F. Hermjakob H. Vizcaino J.A. Jones A.R. A toolkit for the mzIdentML standard: the ProteoIDViewer, the mzidLibrary and the mzidValidator.Mol. Cell Proteomics. 2013; (mcp.O113.029777)PubMed Google Scholar, 21Ghali F. Krishna R. Perkins S. Collins A. Xia D. Wastling J. Jones A.R. ProteoAnnotator – Open source proteogenomics annotation software supporting PSI standards.Proteomics. 2014; 14: 2731-2741Crossref PubMed Scopus (35) Google Scholar). Third, we performed extensive filtration to ensure that identified peptides not matching the official annotation (novel peptides) were high confidence and the corresponding spectra could not be explained by other causes. Fourth, to validate and annotate the resulting novel peptides and corresponding novel events, we aligned our novel peptides back onto the genome for visualization against other tracks of evidence. Last, to standardize the presentation of results, we use standard formats from the Proteomics Standards Initiative (PSI) - mzIdentML (22Jones A.R. Eisenacher M. Mayer G. Kohlbacher O. Siepen J. Hubbard S. Selley J. Searle B. Shofstahl J. Seymour S. Julian R. Binz P.-A. Deutsch E.W. Hermjakob H. Reisinger F. Griss J. Vizcaino J.A. Chambers M. Pizarro A. Creasy D. The mzIdentML data standard for mass spectrometry-based proteomics results.Mol. Cell. Proteomics. 2012; 11M111.014381Abstract Full Text Full Text PDF PubMed Scopus (158) Google Scholar) and proBed (23Menschaert G. Wang X. Jones A.R. Ghali F. Fenyö D. Olexiouk V. Zhang B. Deutsch E.W. Ternent T. Vizcaíno J.A. The proBAM and proBed standard formats: enabling a seamless integration of genomics and proteomics data.Genome Biol. 2018; 19: 12Crossref PubMed Scopus (14) Google Scholar), which allow for rapid and automated visualization of the results via public genome browsers. Using 29 data sets of RNA-Seq data (789,141,453 reads), and 9 MS/MS data sets (2,051,418 spectra), this study represents one of the most comprehensive proteogenomics efforts undertaken on rice. An overview of the pipeline used for rice proteogenomics is summarized in Fig. 1. The workflow is mainly divided to two parts for the processing of the RNA-Seq and MS/MS data, as follows. In this study, raw RNA-Seq data that was generated from the Illumina platform in paired end mode was collected from the European Nucleotide Archive (ENA, http://www.ebi.ac.uk/ena) database. A total of 29 runs (153,907,936,648 bases/789,141,453 reads) was contained in the data sets, and the full details are listed in SupplementaryFile1.xlsx (tab: "RNA Data Collection"). Data sets from various sources were merged to provide a comprehensive database of possible transcripts to match against. To search for peptide evidence, MS/MS data acquired from high-resolution mass spectrometers (LTQ Orbitrap XL, LTQ Orbitrap Velos, TripleTOF 5600 and Q Exactive) was used, regardless of whether the data was generated from profiling or enrichment studies. The raw MS/MS data for this study was collected from the ProteomeXchange (PX, http://www.proteomexchange.org/) database, including a total of nine data sets and 2,051,418 MS/MS spectra. Detailed information about data sets is listed in Table I and further in SupplementaryFile1.xlsx (tab: "MSMS Data Collection"). As shown in SupplementaryFile1.xlsx, in most cases, MS/MS data sets were source from O. sativa Japonica, which is considered the reference genome. However, to further increase coverage, several proteomics data sets were sourced from O. sativa Indica, and one from O. sativa KDML105 (Thai Jasmine) rice. Results are only presented if they map 100% to the Japonica reference, and the results are also presented demonstrating the source data set for each novel peptide found (file SupplementaryFile1.xlsx, tab: "Putative Novel Peptides"), enabling filtration of those observed only in certain strains.Table IThe raw MS/MS data collected for this study from ProteomeXchange database. Detailed search parameters used are provided in supplementary File S1Data set IdentifierTitleInstrumentPublicationAnnounce DateSpectra countPXD000265Oryza sativa egg, sperm, callus, pollen and seedling proteomeLTQ Orbitrap XL; LTQ Orbitrap XLAbiko et al. 2013 (32Abiko M. Furuta K. Yamauchi Y. Fujita C. Taoka M. Isobe T. Okamoto T. Identification of proteins enriched in rice egg or sperm cells by single-cell proteomics.PloS One. 2013; 8: e69578Crossref PubMed Scopus (29) Google Scholar)2013/8/2253743PXD000313Quantitative proteomics study on rice embryo during embryogenesis by using isobaric tags for relative and absolute quantification (iTRAQ)Q ExactiveZi et al. 2013 (33Zi J. Zhang J. Wang Q. Zhou B. Zhong J. Zhang C. Qiu X. Wen B. Zhang S. Fu X. Lin L. Liu S. Stress responsive proteins are actively regulated during rice (Oryza sativa) embryogenesis as indicated by quantitative proteomics analysis.PLOS ONE. 2013; 8: e74229Crossref PubMed Scopus (38) Google Scholar)2013/8/6976822PXD000923Rice Pistil LC-MSMSTripleTOF 5600Wang et al. 2014 (34Wang K. Zhao Y. Li M. Gao F. Yang M.K. Wang X. Li S. Yang P. Analysis of phosphoproteome in rice pistil.Proteomics. 2014; 14: 2319-2334Crossref PubMed Scopus (31) Google Scholar)2014/7/3187502PXD001030Proteomic analysis of proteins related to rice grain chalkiness using iTRAQ based on a notched-belly mutant with white-bellyTripleTOF 5600Lin et al. 2014 (35Lin Z. Zhang X. Yang X. Li G. Tang S. Wang S. Ding Y. Liu Z. Proteomic analysis of proteins related to rice grain chalkiness using iTRAQ and a novel comparison system based on a notched-belly mutant with white-belly.BMC Plant Biol. 2014; 14: 163Crossref PubMed Scopus (50) Google Scholar)2014/6/17301027PXD001058Unravelling the proteomic profile of rice meiocytes during early meiosisLTQ Orbitrap VelosCollado-Romero, Alós & Prieto 2014 (36Collado-Romero M. Alós E. Prieto P. Unravelling the proteomic profile of rice meiocytes during early meiosis.Frontiers Plant Sci. 2014; 5: 356Crossref PubMed Scopus (19) Google Scholar)2014/8/1392162PXD002291Rice (Oryza sativa) lysine-acetylation LC-MS/MSQ ExactiveXiong et al. 2016 (37Xiong Y. Peng X. Cheng Z. Liu W. Wang G.L. A comprehensive catalog of the lysine-acetylation targets in rice (Oryza sativa) based on proteomic analyses.J. Proteomics. 2016; 138: 20-29Crossref PubMed Scopus (68) Google Scholar)2016/2/2494400PXD002739Acetylome analyses in the germinating rice seedQ ExactiveHe et al. 2016 (38He D. Wang Q. Li M. Damaris R.N. Yi X. Cheng Z. Yang P. Global Proteome Analyses of Lysine Acetylation and Succinylation Reveal the Widespread Involvement of both Modification in Metabolism in the Embryo of Germinating Rice Seed.J. Proteome Res. 2016; 15: 879-890Crossref PubMed Scopus (99) Google Scholar)2016/1/2634344PXD002740Succinylome analyses in the germinating rice seedQ ExactiveHe et al. 2016 (38He D. Wang Q. Li M. Damaris R.N. Yi X. Cheng Z. Yang P. Global Proteome Analyses of Lysine Acetylation and Succinylation Reveal the Widespread Involvement of both Modification in Metabolism in the Embryo of Germinating Rice Seed.J. Proteome Res. 2016; 15: 879-890Crossref PubMed Scopus (99) Google Scholar)2016/1/2625862PXD003156Gel-free/label-free proteomic analysis of developing rice grains under heat stressLTQ OrbitrapTimabud et al. 2015 (39Timabud T. Yin X. Pongdontri P. Komatsu S. Gel-free/label-free proteomic analysis of developing rice grains under heat stress.J. Proteomics. 2016; 133: 1-19Crossref PubMed Scopus (27) Google Scholar)2015/12/17185556 Open table in a new tab The RNA-Seq reads from each run were individually aligned using TopHat (v2.0.12) against the Oryza sativa genome (IRGSP-1.0.30). The accepted matches in Bam format and the junctions in BED format were produced by TopHat. The parameters used in TopHat mapping were set as: the alignment sensitivity at "very sensitive," read mismatches at 2, the expected inner distance between mate pairs at 150, library type at fr-unstranded and other parameters at default. All the accepted reads from each run were individually sent for assembly into transcript sequences by Cufflinks (v2.2.1). Afterward, Cuffmerge was employed to combine the transcripts from each run to form longer transcripts in GTF format. The longer transcripts marked with class code " = " are from the transcripts completely matched to the known exons, termed as known transcripts, whereas those with other class codes are from the transcripts partially or totally mismatched to the known exons (IRGSP-1.0.30), assigned as novel transcripts (NTs). All the novel transcripts in GTF format were taken for construction of the customized database. All the junctions were first de-duplicated and aligned against the "official junction sites" from the Oryza sativa annotation (IRGSP-1.0.30) to filter out the known junctions by custom scripts, and the remaining junctions were considered as novel junctions (NJs) for further construction of the customized database. All the novel junctions and novel transcripts are collectively called novel events (NEs). All the NEs were matched back to their corresponding genome loci. Reads mapping to multiple locations were not filtered at this stage, but peptides mapping to multiple loci were handled explicitly (see below). The matched genomic fragments were translated in six reading frames. An accepted novel translation product from six reading frame translation was judged by two criteria, more than 5 amino acids (15 nucleotides) at least, and only the longest product being taken for a transcript. All the accepted novel translation products were added in to the list of the Oryza sativa proteins annotated from IRGSP-1.0.30, as well as contaminants from cRAP (http://www.thegpm.org/crap/), to generate a new protein database for MS/MS searching. IPeak, a Java-based open source software package, was employed for peptide search, which uses the Percolator to re-score peptide-spectrum matches (PSMs) from MS-GF+ (v9733), MyriMatch (v2.2.8634) and X! Tandem (v2009.10.01.1). IPeak incorporates the FDRScore algorithm to combine the results from different search engines. All the MS/MS data collected from nine data sets were converted into MGF format using ProteoWizard (v3.0.4238) (24Chambers M.C. Maclean B. Burke R. Amodei D. Ruderman D.L. Neumann S. Gatto L. Fischer B. Pratt B. Egertson J. Hoff K. Kessner D. Tasman N. Shulman N. Frewen B. Baker T.A. Brusniak M.-Y. Paulse C. Creasy D. Flashner L. Kani K. Moulding C. Seymour S.L. Nuwaysir L.M. Lefebvre B. Kuhlmann F. Roark J. Rainer P. Detlev S. Hemenway T. Huhmer A. Langridge J. Connolly B. Chadick T. Holly K. Eckels J. Deutsch E.W. Moritz R.L. Katz J.E. Agus D.B. MacCoss M. Tabb D.L. Mallick P. A cross-platform toolkit for mass spectrometry and proteomics.Nat. Biotech. 2012; 30: 918-920Crossref PubMed Scopus (1774) Google Scholar), then were searched with IPeak against the customized database, in which the minimum lengths of amino acids in sequences were no less than six. The PSMs with FDRScores less 0.01, corresponding to q-value (global FDR) < 0.01, were initially used to create the list of peptides identified. Most search parameters in the original publications associated with each of the nine data sets were used in the IPeak search. The exact parameters we used for the search are in SupplementaryFile1.xlsx (tab: "Search Parameters"). All the identified peptides through IPeak derived from the known rice proteins were marked as known peptides, whereas those not from those proteins were denoted as putative novel peptides (PNPs). The PNPs were mapped back to the genome to locate their positions on the chromosome by custom scripts to generate the GTF files, which record the genomic positions of the corresponding NEs of PNPs. The positional information of PNPs and NEs were imported into the original mzIdentML file using the proteogenomics encoding described in the mzIdentML version 1.2 specifications (25Vizcaino J.A. Mayer G. Perkins S.R. Barsnes H. Vaudel M. Perez-Riverol Y. Ternent T. Uszkoreit J. Eisenacher M. Fischer L. Rappsilber J. Netz E. Walzer M. Kohlbacher O. Leitner A. Chalkley R.J. Ghali F. Martinez-Bartolome S. Deutsch E.W. Jones A.R. The mzIdentML data standard version 1.2, supporting advances in proteome informatics.Mol. Cell Proteomics. 2017; 16: 1275-1285Abstract Full Text Full Text PDF PubMed Scopus (45) Google Scholar). The PNPs obtained from the IPeak search were further filtered to remove the potential for alternative explanations of the corresponding spectra being more likely. PEAKS PTM (26Han X. He L. Xin L. Shan B. Ma B. PeaksPTM: Mass Spectrometry-Based Identification of Peptides with Unspecified Modifications.J. Proteome Res. 2011; 10: 2930-2936Crossref PubMed Scopus (123) Google Scholar) can identify many types of post-translational modifications (PTMs) and chemical modifications to peptides. To reduce the possibility of misidentification of novel peptides because of modifications (e.g. where a novel peptide has the same mass as a known peptide with a common modification), the MS/MS spectra that were matched to the novel peptides were re-searched with PEAKS PTM against the official annotation. For any novel peptide whose spectrum was confidently identified by PEAKS PTM as the modified or alternative form of a known peptide (q-value < 0.01), was flagged for removal from the novel peptide set. To further check that PNP sequences could not be explained by other types of biochemical events (missed cleavage, proteolysis or single amino acid substitutions) on known peptides, all the PNPs were aligned by BLASTp and custom scripts (using a regular expression) against the known protein sequences (IRGSP-1.0.30). The BLASTp search was conducted in two modes, default and short sequence optimized, with results filtered allowing for a maximum of one mismatch, zero gap, and exact sequence length between query and hit. Any PNPs with a confident match by BLASTp (either mode) were excluded from the result set. To further increase the confidence of PNPs, the peptides with only a single spectrum support were also removed. After all these filtration steps, the remained PNPs were called the final novel peptides (FNPs). FNPs were created initially compared with IRGSP-1.0.30 (i.e. absent from), however we also mapped them to the updated IRGSP-1.0.38 and to a different set of gene prediction from MSU (RGAP version 7, http://rice.plantbiology.msu.edu/) to determine if FNPs were present in those annotation sets. The FNPs and their related NEs were parsed from mzIdentML into GFF format by custom scripts and were input to BEDTools (v2.25) to cluster the NEs onto the corresponding genomic loci. During clustering, 100bp was set as the maximum distance allowed between the NEs. To provide further evidence supporting the annotation of novel peptides, the sequences of FNPs were aligned u
Referência(s)