Annotation of the Zebrafish Genome through an Integrated Transcriptomic and Proteomic Analysis
2014; Elsevier BV; Volume: 13; Issue: 11 Linguagem: Inglês
10.1074/mcp.m114.038299
ISSN1535-9484
AutoresDhanashree Kelkar, Elayne Provost, Raghothama Chaerkady, Babylakshmi Muthusamy, Srikanth S. Manda, Tejaswini Subbannayya, Lakshmi Dhevi N. Selvan, Chieh‐Huei Wang, Keshava K. Datta, Sunghee Woo, Sutopa B. Dwivedi, Santosh Renuse, Derese Getnet, Tai‐Chung Huang, Min‐Sik Kim, Sneha M. Pinto, Chris J. Mitchell, Anil K. Madugundu, Praveen Kumar, Jyoti Sharma, Jayshree Advani, Gourav Dey, Lavanya Balakrishnan, Nazia Syed, Vishalakshi Nanjappa, Yashwanth Subbannayya, Renu Goel, Thottethodi Subrahmanya Keshava Prasad, Vineet Bafna, Ravi Sirdeshmukh, Harsha Gowda, Charles Wang, Steven D. Leach, Akhilesh Pandey,
Tópico(s)Machine Learning in Bioinformatics
ResumoAccurate annotation of protein-coding genes is one of the primary tasks upon the completion of whole genome sequencing of any organism. In this study, we used an integrated transcriptomic and proteomic strategy to validate and improve the existing zebrafish genome annotation. We undertook high-resolution mass-spectrometry-based proteomic profiling of 10 adult organs, whole adult fish body, and two developmental stages of zebrafish (SAT line), in addition to transcriptomic profiling of six organs. More than 7,000 proteins were identified from proteomic analyses, and ∼69,000 high-confidence transcripts were assembled from the RNA sequencing data. Approximately 15% of the transcripts mapped to intergenic regions, the majority of which are likely long non-coding RNAs. These high-quality transcriptomic and proteomic data were used to manually reannotate the zebrafish genome. We report the identification of 157 novel protein-coding genes. In addition, our data led to modification of existing gene structures including novel exons, changes in exon coordinates, changes in frame of translation, translation in annotated UTRs, and joining of genes. Finally, we discovered four instances of genome assembly errors that were supported by both proteomic and transcriptomic data. Our study shows how an integrative analysis of the transcriptome and the proteome can extend our understanding of even well-annotated genomes. Accurate annotation of protein-coding genes is one of the primary tasks upon the completion of whole genome sequencing of any organism. In this study, we used an integrated transcriptomic and proteomic strategy to validate and improve the existing zebrafish genome annotation. We undertook high-resolution mass-spectrometry-based proteomic profiling of 10 adult organs, whole adult fish body, and two developmental stages of zebrafish (SAT line), in addition to transcriptomic profiling of six organs. More than 7,000 proteins were identified from proteomic analyses, and ∼69,000 high-confidence transcripts were assembled from the RNA sequencing data. Approximately 15% of the transcripts mapped to intergenic regions, the majority of which are likely long non-coding RNAs. These high-quality transcriptomic and proteomic data were used to manually reannotate the zebrafish genome. We report the identification of 157 novel protein-coding genes. In addition, our data led to modification of existing gene structures including novel exons, changes in exon coordinates, changes in frame of translation, translation in annotated UTRs, and joining of genes. Finally, we discovered four instances of genome assembly errors that were supported by both proteomic and transcriptomic data. Our study shows how an integrative analysis of the transcriptome and the proteome can extend our understanding of even well-annotated genomes. Zebrafish (Danio rerio) is an important vertebrate model organism that has been widely used in biomedical research in several areas, including developmental biology, disease biology, toxicology, and behavior. The latest genome assembly, Zv9, which was released in October 2011, combines the advantages of clone-by-clone sequencing and shotgun sequencing technologies. In this assembly, 83% of sequences were generated from capillary sequencing of clones, with the gaps filled by shotgun sequencing reads generated via next-generation sequencing (1Howe K. Clark M.D. Torroja C.F. Torrance J. Berthelot C. Muffato M. Collins J.E. Humphray S. McLaren K. Matthews L. McLaren S. Sealy I. Caccamo M. Churcher C. Scott C. Barrett J.C. Koch R. Rauch G.J. White S. Chow W. Kilian B. Quintais L.T. Guerra-Assuncao J.A. Zhou Y. Gu Y. Yen J. Vogel J.H. Eyre T. Redmond S. Banerjee R. Chi J. Fu B. Langley E. Maguire S.F. Laird G.K. Lloyd D. Kenyon E. Donaldson S. Sehra H. Almeida-King J. Loveland J. Trevanion S. Jones M. Quail M. Willey D. Hunt A. Burton J. Sims S. McLay K. Plumb B. Davis J. Clee C. Oliver K. Clark R. Riddle C. Eliott D. Threadgold G. Harden G. Ware D. Mortimer B. Kerry G. Heath P. Phillimore B. Tracey A. Corby N. Dunn M. Johnson C. Wood J. Clark S. Pelan S. Griffiths G. Smith M. Glithero R. Howden P. Barker N. Stevens C. Harley J. Holt K. Panagiotidis G. Lovell J. Beasley H. Henderson C. Gordon D. Auger K. Wright D. Collins J. Raisen C. Dyer L. Leung K. Robertson L. Ambridge K. Leongamornlert D. McGuire S. Gilderthorp R. Griffiths C. Manthravadi D. Nichol S. Barker G. Whitehead S. Kay M. Brown J. Murnane C. Gray E. Humphries M. Sycamore N. Barker D. Saunders D. Wallis J. Babbage A. Hammond S. Mashreghi-Mohammadi M. Barr L. Martin S. Wray P. Ellington A. Matthews N. Ellwood M. Woodmansey R. Clark G. Cooper J. Tromans A. Grafham D. Skuce C. Pandian R. Andrews R. Harrison E. Kimberley A. Garnett J. Fosker N. Hall R. Garner P. Kelly D. Bird C. Palmer S. Gehring I. Berger A. Dooley C.M. Ersan-Urun Z. Eser C. Geiger H. Geisler M. Karotki L. Kirn A. Konantz J. Konantz M. Oberlander M. Rudolph-Geiger S. Teucke M. Osoegawa K. Zhu B. Rapp A. Widaa S. Langford C. Yang F. Carter N.P. Harrow J. Ning Z. Herrero J. Searle S.M. Enright A. Geisler R. Plasterk R.H. Lee C. Westerfield M. de Jong P.J. Zon L.I. Postlethwait J.H. Nusslein-Volhard C. Hubbard T.J. Roest Crollius H. Rogers J. Stemple D.L. The zebrafish reference genome sequence and its relationship to the human genome.Nature. 2013; 496: 498-503Crossref PubMed Scopus (2749) Google Scholar). The genome assembly includes 25 chromosomes along with 995 contigs from shotgun sequencing that could not be assembled into chromosomes. The extent and quality of the genome annotation ultimately determine the usefulness of the genome sequence itself. The current Ensembl genome annotation set (genebuild release 75) has 56,754 transcripts corresponding to 33,737 genes. This gene set includes annotations from an automated Ensembl annotation pipeline, a VEGA manual annotation pipeline (VEGA Release 55), and transcript models derived from RNA-Seq-derived 1The abbreviations used are:RNA-SeqRNA sequencingPSMpeptide spectrum matchCPATCoding-Potential Assessment ToolFDRfalse discovery rate. 1The abbreviations used are:RNA-SeqRNA sequencingPSMpeptide spectrum matchCPATCoding-Potential Assessment ToolFDRfalse discovery rate. data from five adult tissues and seven developmental stages of zebrafish (2Collins J.E. White S. Searle S.M. Stemple D.L. Incorporating RNA-seq data into the zebrafish Ensembl genebuild.Genome Res. 2012; 22: 2067-2078Crossref PubMed Scopus (78) Google Scholar). RNA sequencing peptide spectrum match Coding-Potential Assessment Tool false discovery rate. RNA sequencing peptide spectrum match Coding-Potential Assessment Tool false discovery rate. Shotgun proteomics and NextGen sequencing data have great potential to assist in genome annotation through automated as well as manual strategies. With advancements in the methods for transcriptome profiling and data processing, an increasing number of studies are being carried out in which transcriptomic and proteomic data are analyzed in an integrative manner (3Lundberg E. Fagerberg L. Klevebring D. Matic I. Geiger T. Cox J. Algenas C. Lundeberg J. Mann M. Uhlen M. Defining the transcriptome and proteome in three functionally different human cell lines.Mol. Syst. Biol. 2010; 6: 450Crossref PubMed Scopus (270) Google Scholar, 4Evans V.C. Barker G. Heesom K.J. Fan J. Bessant C. Matthews D.A. De novo derivation of proteomes from transcriptomes for transcript and protein identification.Nat. Methods. 2012; 9: 1207-1211Crossref PubMed Scopus (128) Google Scholar). There are also a number of reports of identification of novel coding loci using transcriptomic data alone or in combination with proteomic data (5Peterson E.S. McCue L.A. Schrimpe-Rutledge A.C. Jensen J.L. Walker H. Kobold M.A. Webb S.R. Payne S.H. Ansong C. Adkins J.N. Cannon W.R. Webb-Robertson B.J. VESPA: software to facilitate genomic annotation of prokaryotic organisms through integration of proteomic and transcriptomic data.BMC Genomics. 2012; 13: 131Crossref PubMed Scopus (30) Google Scholar, 6Mohien C.U. Colquhoun D.R. Mathias D.K. Gibbons J.G. Armistead J.S. Rodriguez M.C. Rodriguez M.H. Edwards N.J. Hartler J. Thallinger G.G. Graham D.R. Martinez-Barnetche J. Rokas A. Dinglasan R.R. A bioinformatics approach for integrated transcriptomic and proteomic comparative analyses of model and non-sequenced anopheline vectors of human malaria parasites.Mol. Cell. Proteomics. 2013; 12: 120-131Abstract Full Text Full Text PDF PubMed Scopus (18) Google Scholar). Our previous efforts have successfully demonstrated the power of proteogenomic analyses in improving genome annotation, as exemplified by studies on Mycobacterium tuberculosis, Candida glabrata, Leishmania donovani, Anopheles gambiae, and Homo sapiens (7Chaerkady R. Kelkar D.S. Muthusamy B. Kandasamy K. Dwivedi S.B. Sahasrabuddhe N.A. Kim M.S. Renuse S. Pinto S.M. Sharma R. Pawar H. Sekhar N.R. Mohanty A.K. Getnet D. Yang Y. Zhong J. Dash A.P. MacCallum R.M. Delanghe B. Mlambo G. Kumar A. Keshava Prasad T.S. Okulate M. Kumar N. Pandey A. A proteogenomic analysis of Anopheles gambiae using high-resolution Fourier transform mass spectrometry.Genome Res. 2011; 21: 1872-1881Crossref PubMed Scopus (47) Google Scholar, 8Prasad T.S. Harsha H.C. Keerthikumar S. Sekhar N.R. Selvan L.D. Kumar P. Pinto S.M. Muthusamy B. Subbannayya Y. Renuse S. Chaerkady R. Mathur P.P. Ravikumar R. Pandey A. Proteogenomic analysis of Candida glabrata using high resolution mass spectrometry.J. Proteome Res. 2012; 11: 247-260Crossref PubMed Scopus (37) Google Scholar, 9Kelkar D.S. Kumar D. Kumar P. Balakrishnan L. Muthusamy B. Yadav A.K. Shrivastava P. Marimuthu A. Anand S. Sundaram H. Kingsbury R. Harsha H.C. Nair B. Prasad T.S. Chauhan D.S. Katoch K. Katoch V.M. Chaerkady R. Ramachandran S. Dash D. Pandey A. Proteogenomic analysis of Mycobacterium tuberculosis by high resolution mass spectrometry.Mol. Cell. Proteomics. 2011; 10 (M111.011627)Abstract Full Text Full Text PDF PubMed Google Scholar, 10Pawar H. Sahasrabuddhe N.A. Renuse S. Keerthikumar S. Sharma J. Kumar G.S. Venugopal A. Sekhar N.R. Kelkar D.S. Nemade H. Khobragade S.N. Muthusamy B. Kandasamy K. Harsha H.C. Chaerkady R. Patole M.S. Pandey A. A proteogenomic approach to map the proteome of an unsequenced pathogen - Leishmania donovani.Proteomics. 2012; 12: 832-844Crossref PubMed Scopus (40) Google Scholar, 11Kim M.S. Pinto S.M. Getnet D. Nirujogi R.S. Manda S.S. Chaerkady R. Madugundu A.K. Kelkar D.S. Isserlin R. Jain S. Thomas J.K. Muthusamy B. Leal-Rojas P. Kumar P. Sahasrabuddhe N.A. Balakrishnan L. Advani J. George B. Renuse S. Selvan L.D.N. Patil A.H. Nanjappa V. Radhakrishnan A. Prasad S. Subbannayya T. Raju R. Kumar M. Sreenivasamurthy S.K. Marimuthu A. Sathe G.J. Chavan S. Datta K.K. Subbannayya Y. Sahu A. Yelamanchi S.D. Jayaram S. Rajagopalan P. Sharma J. Murthy K.R. Syed N. Goel R. Khan A.K. Ahmad S. Dey G. Mudgal K. Chatterjee A. Huang T. Zhong J. Wu X. Shaw P.G. Freed D. Zahari M.S. Mukherjee K.K. Shankar S. Mahadevan A. Lam H. Mitchell C.J. Shankar S.K. Satishchandra P. Schroeder J.T. Sirdeshmukh R. Maitra A. Leach S.D. Drake C.G. Halushka M.K. Prasad T.S.K. Hruban R.H. Kerr C.L. Bader G.D. Iacobuzio-Donahue C.H. Gowda H. Pandey A. A draft map of the human proteome.Nature. 2014; 509: 575-581Crossref PubMed Scopus (1494) Google Scholar). Here, we report the use of in-depth transcriptomic and proteomic profiling to refine the genome annotation of zebrafish (Fig. 1). The transcriptomic (RNA-Seq) data were derived from six adult organs. We identified 69,206 high-confidence transcripts, including novel transcripts for 22,585 genes and 9,404 novel transcribed loci. In total, 6,975 proteins were identified via proteomic analysis of 10 different adult organs, whole adult fish body, and two developmental stages. We employed various proteogenomic strategies that included searching the mass spectra against a number of custom databases, including a six-frame translated genome database, a translated RNA-Seq transcript database, and a de novo gene prediction set. To reduce false positives (12Gupta N. Bandeira N. Keich U. Pevzner P.A. Target-decoy approach and false discovery rate: when things may go wrong.J. Am. Soc. Mass Spectrom. 2011; 22: 1111-1120Crossref PubMed Scopus (112) Google Scholar, 13Blakeley P. Overton I.M. Hubbard S.J. Addressing statistical biases in nucleotide-derived protein databases for proteogenomic search strategies.J. Proteome Res. 2012; 11: 5221-5234Crossref PubMed Scopus (64) Google Scholar), we manually verified the peptide spectrum matches (PSMs) identified from each of these searches. Novel peptides obtained from only good-quality spectral matches were considered for genome annotation improvement. Apart from the identification of novel genes, significant findings of our study include the identification of genome assembly errors, novel exons, novel splice forms, and alternate translational start sites. A genetically defined Zebrafish SAT line (Sanger AB Tübingen) was procured and cultured in an in-house fish facility. Muscle, liver, intestine/pancreas, testis, eye, and spleen were dissected and collected in RNAlater on ice before RNA extraction. Total RNA was isolated from each organ using a Qiagen RNeasy Kit (Qiagen, Inc., Carlsbad, CA) according to the manufacturer's protocol. RNA-Seq of these six organs/tissues was performed according to the manufacturer's protocol using the Illumina TruSeq RNA Sample Preparation Kit and SBS Kit v3 (Illumina, San Diego, CA). Briefly, RNA quality was determined using an Agilent Bioanalyzer with an RNA Nano 6000 chip. RNA-Seq library construction was started using 500 ng of total RNA that was then subjected to poly(A)+ selection and fragmentation. Followed by first and second strand synthesis, the cDNA was subjected to end repair, adenylation of 3′ ends, and adapter ligation. One of six unique indices was used in each individual sample. After AMPure XP magnetic bead (Beckman Coulter, Brea, CA) clean-up, each cDNA sample was subjected to 15 cycles of PCR amplification using an ABI 9700 thermal cycler. The cDNA library quality and size distribution were checked using an Agilent Bioanalyzer with a DNA 1000 chip. Our libraries showed a size between 200 and 500 bp with a peak at ∼260 bp. All libraries were carefully quantitated using a Qubit 2.0 fluorometer (Invitrogen, Grand Island, NY) and were stored in microfuge tubes (Invitrogen) in a −20 °C freezer. The cluster generation was done using an Illumina TruSeq V3 flow cell with six different cDNA libraries with different indices in each lane, repeated in three lanes, at a concentration of ∼8.6 pm. RNA-Seq was carried out on Illumina's HiScanSQ system (Illumina) using the Illumina TruSeq SBS V3 sequencing kit and 50 bp by 50 bp paired reads. The reads were quality filtered for Phred-based base quality (Q > 20) using FastX tools. 99% of the reads passed the quality threshold and were used in downstream analysis steps. TopHat (version 1.4.1) with default parameters was used to align the reads against the Zv9 zebrafish genome assembly (14Trapnell C. Pachter L. Salzberg S.L. TopHat: discovering splice junctions with RNA-Seq.Bioinformatics. 2009; 25: 1105-1111Crossref PubMed Scopus (8995) Google Scholar). Transcript assembly was carried out using Cufflinks (version 2.0). The RABT (Reference Annotation Based Transcript Assembly) option was used. An Ensembl transcript coordinate file (.gtf) was provided as a reference assembly file. Transcripts were assembled separately for each organ and were combined using Cuffcompare. Transcripts were also categorized (class codes) into known isoforms, novel isoforms, and intergenic transcripts by Cuffcompare (15Trapnell C. Roberts A. Goff L. Pertea G. Kim D. Kelley D.R. Pimentel H. Salzberg S.L. Rinn J.L. Pachter L. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks.Nat. Protoc. 2012; 7: 562-578Crossref PubMed Scopus (168) Google Scholar). From the combined set of transcripts, a high-confidence set of transcripts was generated by filtering as shown in supplemental Fig. S1. Briefly, all the transcripts were filtered for fragments per kilobase of exon per million fragments mapped (FPKM) ≥ 1. From the remaining set, transcripts with Cufflinks class codes e, p, c, o, and s were eliminated. From transcripts with class codes u, i, x, and o (multi-exonic), transcripts smaller than 250 bp were eliminated. All transcripts of class codes = and j were retained. Transcripts for which peptide evidence was obtained were retained regardless of their class code and size. Protein coding potential was predicted for these high-confidence sets of transcripts using CPAT (16Wang L. Park H.J. Dasari S. Wang S. Kocher J.P. Li W. CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model.Nucleic Acids Res. 2013; 41: e74Crossref PubMed Scopus (1076) Google Scholar). Transcripts that had a coding probability greater than 0.38 were considered as potentially protein coding transcripts. Different organs (eye, brain, liver, spleen, intestine/pancreas, ovary, testis, muscle, heart, and head) were collected from ∼100 adult fish (SAT strain). Zebrafish embryos were collected at 48 and 120 h post-fertilization. The samples were lysed in 2% SDS lysis buffer and in 8 m urea lysis buffer. Lysates were homogenized and sonicated, and protein estimation followed. Proteins from SDS lysates were separated on SDS-PAGE, and in-gel digestion was carried out as described previously (17Amanchy R. Kalume D.E. Pandey A. Stable isotope labeling with amino acids in cell culture (SILAC) for studying dynamics of protein abundance and posttranslational modifications.Sci. STKE. 2005; 2005: l2Google Scholar). Briefly, the protein bands were destained, reduced and alkylated, and subjected to in-gel digestion using trypsin and Lys-C at an 8:1 ratio. The peptides were extracted, vacuum dried, and stored at −80 °C until further analysis. 1-mg protein samples from urea lysates were used for in-solution trypsin digestion. Samples were reduced, alkylated, and digested using trypsin and Lys-C (8:1) overnight at 37 °C. The peptide digests were then desalted using a C18 cartridge and lyophilized. The lyophilized samples were reconstituted in basic reverse-phase liquid chromatography solvent A (10 mm tetraethyl ammonium bicarbonate (TEABC), pH 8.5), loaded on an XBridge C18 5 μm 250 × 4.6 mm column (Waters, Milford, MA), and eluted with 0% to 100% solvent B (10 mm TEABC in acetonitrile, pH 8.5) with a 50-min gradient. The fractions collected were vacuum dried and then pooled by concatenation into 24 fractions. Enrichment of N-terminal acetylated peptides was carried out using a slightly modified protocol described by Taouatas et al. (18Taouatas N. Altelaar A.F. Drugan M.M. Helbig A.O. Mohammed S. Heck A.J. Strong cation exchange-based fractionation of Lys-N-generated peptides facilitates the targeted analysis of post-translational modifications.Mol. Cell. Proteomics. 2009; 8: 190-200Abstract Full Text Full Text PDF PubMed Scopus (68) Google Scholar). Briefly, purified peptides from in-solution digestion of protein extracts from four organ pairs were fractionated on a polysulfoethyl A column (PolyLC, Columbia, MD; 200 × 2.1 5 μm, 200 Å) using a low-ionic-strength buffer system (solvent A, 5 mm KH2PO4, 30% acetonitrile at pH 2.7; solvent B, 350 mm KCl, 5 mm KH2PO4, 30% acetonitrile). The first 15 fractions that were collected were pooled and re-fractionated into 24 fractions using basic reverse-phase liquid chromatography. Peptide samples from in-gel digestion and basic reverse-phase liquid chromatography fractionation were analyzed on an Accurate Mass Q-TOF 6540 mass spectrometer interfaced with an HPLC Chip system (Agilent Technologies, Santa Clara, CA.). The samples were reconstituted in solvent A (0.1% formic acid) and loaded onto the HPLC chip trap column using an Agilent 1200 series capillary liquid chromatography system. Both trap and analytical columns embedded in the HPLC chip were made up of Zorbax 300SB-C18 with a 5-μm particle size. The peptides were eluted using a gradient of 5% to 40% solvent B (0.1% formic acid in 90% acetonitrile) over 50 min. Q-TOF was operated with a capillary voltage of 1800 V, a fragmentor voltage of 175 V, a medium isolation width of 4 m/z, and an energy slope of 3 V plus a 2-V offset. MS data were acquired using MassHunter data acquisition software (Version B.04.00, Agilent Technologies). MS spectra were acquired in the range of m/z 350–1,800, and this was followed by five MS/MS analyses with a scan range of m/z 50–2,000. The duty cycle was set to 2.1 s with one MS scan per second followed by five MS/MS scans per second. The precursor selection was based on preference to charge state in the order of 2+, 3+, and >3+ ions and a second level preference to abundance. Additionally, in-gel digested samples from zebrafish testis and spleen were analyzed on an Orbitrap Velos mass spectrometer. Enriched N-terminally modified peptide samples were analyzed on an Orbitrap Elite mass spectrometer. Both MS and MS/MS spectra were acquired in the Orbitrap analyzer at 60K and 15K resolution settings, respectively. Fragmentation was carried out in higher-energy collisional dissociation mode. MS/MS spectral data were processed to generate Mascot generic format files using MassHunter (B.04.00) or Proteome Discoverer 1.3. The data were searched against a protein database from Ensembl-HAVANA annotation (release 70) combined with common contaminants like trypsin, keratin, and BSA (42,200 total sequences). The data were analyzed using Proteome Discoverer 1.3 (Thermo Scientific, Bremen, Germany) using Sequest (SCM build 59) and Mascot (Version 2.2) search algorithms. The parameters used for data analysis included trypsin as protease with up to one missed cleavage allowed. Carbamidomethylation of cysteine was specified as a fixed modification, and oxidation of methionine was specified as a variable modification. The minimum peptide length was specified as six amino acids. The mass error of parent ions was set to 20 ppm, whereas for fragment ions it was set to 0.05 Da. LC-MS/MS data were searched against a reversed-sequence database to calculate a 1% false discovery rate (FDR) threshold score. The FDR at each PSM score was calculated as - % FDR = (number of hits in reverse database at or above the score/total number of hits in target and reverse database at or above the score) × 100 (19Kall L. Storey J.D. MacCoss M.J. Noble W.S. Assigning significance to peptides identified by tandem mass spectrometry using decoy databases.J. Proteome Res. 2008; 7: 29-34Crossref PubMed Scopus (441) Google Scholar). A parsimonious protein list was generated from the peptide list by grouping proteins in Proteome Discoverer. For quantitative analysis, intensity- and PSM-number-based expression values for each identified gene were calculated on similar lines of intensity-based absolute quantification values (20Schwanhausser B. Busse D. Li N. Dittmar G. Schuchhardt J. Wolf J. Chen W. Selbach M. Global quantification of mammalian gene expression control.Nature. 2011; 473: 337-342Crossref PubMed Scopus (4058) Google Scholar). The sum of the intensities of all the PSMs belonging to one gene was divided by the number of possible unique tryptic peptides from the gene. This value was normalized across experiments by dividing it with the total number of spectra acquired in the experiment (e.g. the total number of spectra from brain in-gel fractions). Finally, the ratio was log2 transformed. Gene functional classification was done using the Web-based DAVID resource, and significantly enriched gene sets (p < 0.5) were selected (21Jiao X. Sherman B.T. Huang da W. Stephens R. Baseler M.W. Lane H.C. Lempicki R.A. DAVID-WS: a stateful web service to facilitate gene/protein list analysis.Bioinformatics. 2012; 28: 1805-1806Crossref PubMed Scopus (690) Google Scholar). Seven alternative databases were used for identifying novel peptides for proteogenomic analysis of the zebrafish genome. The seven databases used were (i) a six-frame translated genome database, (ii) a translated RNA-Seq transcript database (from this study), (iii) a translated Ensembl RNA-Seq transcript database, (iv) a splice graph database generated from RNA-Seq reads split across splice junctions, (v) ab initio prediction models from GENSCAN, (vi) a hypothetical N-terminal peptide database, and (vii) a three-frame translation of non-coding RNAs from VEGA annotation. Common contaminant sequences were added to each database prior to searching. A decoy database was created for each database by reversing the sequences in target databases. Peptide identification was carried out using X!Tandem (version CYCLONE, 2011.12.01) unless otherwise specified. Search parameters common to all searches were (a) an allowed precursor mass error of 20 ppm, (b) an allowed fragment mass error of 0.05 Da, (c) carbamidomethylation of cysteine was as a fixed modification, (d) oxidation of methionine as a variable modification, and (e) consideration only of tryptic peptides with up to one missed cleavage. All the novel peptides identified from each alternate database search were subjected to manual validation apart from filtering for 1% FDR. Specific details about the generation of each database, search parameters, and post-processing of the search outcomes are described below. A six-frame translated genome database was created using the Zv9 version of the genome sequence downloaded from the Ensembl FTP server. Similarly, translated transcriptome databases were created using assembled transcripts from JHU-IOB and Sanger RNA-Seq data. Sanger RNA-Seq build was downloaded using the pearl API from the Ensembl "other features" database. (Transcripts falling in categories =, j, o, and c were translated in three frames; other transcripts were translated in six frames.) The protein databases thus created consisted of stop-codon-to-stop-codon translation of the template sequence. Sequences that were shorter than seven amino acids in length were not included. A splice graph database was generated from RNA-Seq read alignments. A splice graph database is a non-redundant compact database of splice junction peptide sequences derived from RNA-Seq reads split across introns. The database is created by generating a graph in which genomic intervals (exonic regions) correspond to nodes, and edges correspond to pairs of exons that are putatively spliced together. A detailed method of splice graph creation and conversion to an MS search compatible FASTA database can be found in Ref. 22Woo S. Cha S.W. Merrihew G. He Y. Castellana N. Guest C. Maccoss M. Bafna V. Proteogenomic database construction driven from large scale RNA-seq data.J. Proteome Res. 2014; 13: 21-28Crossref PubMed Scopus (91) Google Scholar. The splice graph database and the six-frame translated genome database were searched using the MS-GFDB (version 20120106) search algorithm. Search results from the MS-GFDB algorithm were filtered for 1% FDR calculated as explained in Ref. 22Woo S. Cha S.W. Merrihew G. He Y. Castellana N. Guest C. Maccoss M. Bafna V. Proteogenomic database construction driven from large scale RNA-seq data.J. Proteome Res. 2014; 13: 21-28Crossref PubMed Scopus (91) Google Scholar. Spectra that matched to multiple sequences with equal scores were not considered for further analysis. Prediction models from the ab initio gene prediction algorithm GENSCAN were downloaded from the Ensembl FTP, and VEGA non-coding RNA sequences for processed pseudogenes, processed transcripts, pseudogenes, transcribed unprocessed pseudogenes, and unprocessed pseudogenes were downloaded using Ensembl BioMart; the sequences were translated in three frames. Additional variable modification of protein N-terminal acetylation was used for searching the Genscan prediction database. The hypothetical N-terminal tryptic peptide database was created by fetching all the peptide sequences that began with methionine and ended with K/R from the six-frame translated genome database. Peptide sequences with up to one missed cleavage and lengths ranging from 6 to 25 amino acids were considered. Sequest and X!Tandem were used for peptide identification. Additional variable modification of peptide N-terminal acetylation was specified for these searches. For X!Tandem searches, the database without subpeptides was used as the "quick acetyl" option available in X!Tandem searches. Peptide sequences identified from the alternate database searches were filtered for 1% FDR and compared with the protein database (Ensembl Genebuild 70) to identify novel peptides. These novel peptide PSMs were further checked via manual inspection for validity of the peptide identification. The major criteria considered for manual evaluation included (a) assignment of all intense peaks (intense unassigned peaks were checked to see whether they were arising from internal fragment ions); (b) identification of the majority of the y series of ions; (c) low-m/z-range b ions, that is, b1, b2, and b3 ions and a2 and a4 ions also observed in a typical spectrum; (d) whether an immonium ion indicated the presence of an amino acid that was not present in the assigned peptide sequence (if so, the PSM was rejected); (e) whether a Y1 ion was present that confirmed a peptide ending either with K (m/z 147.11) or with R (m/z 175.12); (f) whether any un-assigned fragment was present, especially from the higher m/z range, that indicated the presence of an amino acid that was not a part of an assigned sequence (if so, the PSM was rejected); (g) whether missed cleavages were followed by acidic amino acids, that is, E and D; (h) whether many assigned peaks were from the noise level (if so, the PSM was rejected); and (i) whether neutral loss ions were observed for peptides
Referência(s)