Assessing Protein Sequence Database Suitability Using De Novo Sequencing
2019; Elsevier BV; Volume: 19; Issue: 1 Linguagem: Inglês
10.1074/mcp.tir119.001752
ISSN1535-9484
AutoresRichard S. Johnson, Brian C. Searle, Brook L. Nunn, Jason M. Gilmore, Molly Phillips, Chris T. Amemiya, Michelle Heck, Michael J. MacCoss,
Tópico(s)Machine Learning in Bioinformatics
ResumoThe analysis of samples from unsequenced and/or understudied species as well as samples where the proteome is derived from multiple organisms poses two key questions. The first is whether the proteomic data obtained from an unusual sample type even contains peptide tandem mass spectra. The second question is whether an appropriate protein sequence database is available for proteomic searches. We describe the use of automated de novo sequencing for evaluating both the quality of a collection of tandem mass spectra and the suitability of a given protein sequence database for searching that data. Applications of this method include the proteome analysis of closely related species, metaproteomics, and proteomics of extinct organisms. The analysis of samples from unsequenced and/or understudied species as well as samples where the proteome is derived from multiple organisms poses two key questions. The first is whether the proteomic data obtained from an unusual sample type even contains peptide tandem mass spectra. The second question is whether an appropriate protein sequence database is available for proteomic searches. We describe the use of automated de novo sequencing for evaluating both the quality of a collection of tandem mass spectra and the suitability of a given protein sequence database for searching that data. Applications of this method include the proteome analysis of closely related species, metaproteomics, and proteomics of extinct organisms. Matching tandem mass spectra to database-derived sequences is routine, and a variety of software pipelines are available (1Eng J.K. Searle B.C. Clauser K.R. Tabb D.L. A face in the crowd: recognizing peptides through database search.Mol. Cell. Proteomics. 2011; 10: 1-9Abstract Full Text Full Text PDF Scopus (126) Google Scholar). However, atypical and new types of samples can be problematic. For example, it can be difficult to evaluate sample preparation methods using LCMS/MS data from fossils, soil, or glacial meltwaters, when such samples are new to the research community, DNA is unattainable, or there is no obvious protein database to search. If the genome of the organism under study is unknown, or if the sample comes from multiple unknown species (i.e. metaproteomics) the suitability of a chosen sequence database can be difficult to evaluate (2Timmins-Schiffman E. May D.H. Mikan M. Riffle M. Frazar C. Harvey H.R. Noble W.S. Nunn B.L. Critical decisions in metaproteomics: Achieving high confidence protein annotations in a sea of unknowns.ISME J. 2017; 11: 309-314Crossref PubMed Scopus (53) Google Scholar). In the case of species whose genomes are unsequenced, data analysis typically employs a protein sequence database from a taxonomically related species, where the hope is that the sequences are mostly identical (3Cilia M. Tamborindeguy C. Rolland M. Howe K. Thannhauser T.W. Gray S. Tangible benefits of the aphid Acyrthosiphon pisum genome sequencing for aphid proteomics: Enhancements in protein identification and data validation for homology-based proteomics.J. Insect Physiol. 2011; 57: 179-190Crossref PubMed Scopus (15) Google Scholar). For metaproteomics, the standard approach is to sequence the DNA in the sample, assemble a metagenome, and translate it into protein sequences as a FASTA formatted file (4Ruggles K.V. Krug K. Wang X. Clauser K.R. Wang J. Payne S.H. Fenyo D. Zhang B. Mani D.R. Methods, tools and current perspectives in proteogenomics.Mol. Cell. Proteomics. 2017; 16: 959-981Abstract Full Text Full Text PDF PubMed Scopus (88) Google Scholar). The hope is that there are no mistakes in assembly and translation to protein sequence, and that the translated metagenome accurately represents the metaproteome under study. Once a FASTA file is created, a database search is performed and the number of identifications is reported, yet it is not clear how many high-quality tandem mass spectra failed to make a match because the peptide sequence was not represented in the FASTA file. If there is more than one sequence database to choose from (e.g. because of different gene assembly methods), the one with the largest number of identifications over a range of false discovery rates is the optimal choice. Although this is a valid way to proceed, one cannot know if a low number of identifications is because of bad tandem mass spectrometry (MS/MS) 1The abbreviations used are:MS/MStandem mass spectrometryCIDcollision induced dissociationABCammonium bicarbonateDDAdata dependent acquisitionPSMpeptide-spectrum matchFDRfalse discovery rateTPtrue positiveFPfalse positiveFNfalse negativeTICtotal ion currentDIAdata independent acquisition. 1The abbreviations used are:MS/MStandem mass spectrometryCIDcollision induced dissociationABCammonium bicarbonateDDAdata dependent acquisitionPSMpeptide-spectrum matchFDRfalse discovery rateTPtrue positiveFPfalse positiveFNfalse negativeTICtotal ion currentDIAdata independent acquisition. data, an inappropriate or insufficient sequence database, or both. Here we propose and evaluate a simple solution that uses automated de novo sequencing to evaluate both tandem mass spectrum quality and sequence database suitability. tandem mass spectrometry collision induced dissociation ammonium bicarbonate data dependent acquisition peptide-spectrum match false discovery rate true positive false positive false negative total ion current data independent acquisition. tandem mass spectrometry collision induced dissociation ammonium bicarbonate data dependent acquisition peptide-spectrum match false discovery rate true positive false positive false negative total ion current data independent acquisition. De novo sequencing is the concept of deriving a peptide sequence from a tandem mass spectrum without use of a sequence database (5Ma B. Johnson R. De novo sequencing and homology searching.Mol. Cell. Proteomics. 2012; 11: 1-16Abstract Full Text Full Text PDF Scopus (129) Google Scholar). Before the existence of sequence databases (or ready access to powerful computers), de novo sequencing was the only approach to interpret tandem mass spectra of peptides. The degree to which one can successfully derive a sequence is dependent on spectral quality. Specifically, spectra that can be sequenced possess a contiguous series of sequencing ions of the same type (e.g. b- and y-type ions), which turns out to be the case for low energy collision induced dissociation (CID) of tryptic peptides. In the past, automated de novo sequencing has had difficulty deriving accurate sequences in a timely manner; however, the accuracy of the sequences and associated scores improved markedly with the use of high mass accuracy analyzers such as time-of-flight or orbitrap instruments. The recently developed automated de novo sequencing program called Novor (6Ma B. Novor: Real-time peptide de novo sequencing software.J. Am. Soc. Mass Spectrom. 2015; 26: 1885-1894Crossref PubMed Scopus (127) Google Scholar) has eliminated the speed limitation in that the software can generate de novo sequences faster than it takes to acquire the data (and generally in less time than a database search of the same data). The question now is: what should we do with these de novo interpretations? Here we describe an approach that utilizes Novor to assess tandem mass spectra quality and the suitability of FASTA files for database searches. Finding many high scoring de novo sequences in a data set implies that many high-quality tandem mass spectra of peptides are present (7Taylor J.A. Johnson R.S. Implementation and uses of automated de novo peptide sequencing by tandem mass spectrometry.Anal. Chem. 2001; 73: 2594-2604Crossref PubMed Scopus (248) Google Scholar). Likewise, comparison of de novo sequencing results with database search results (all from the same data file and scored with the same database search algorithm) can be used to determine the suitability of a chosen FASTA file. The strategy is to append de novo sequences to the FASTA file under study, run a standard database search of the modified FASTA file, and compare the ranking of the de novo sequences with the original FASTA file sequences. Returning a higher number of matches to the original FASTA file sequences, compared with the de novo sequences, implies that the FASTA file is suitable for use in a database search. Here we show that automated de novo sequencing using Novor provides a simple and fast means of evaluating both tandem mass spectra and FASTA file quality. A human K562 cell extract was obtained from Promega (Madison, WI). A C. elegans tryptic digest was prepared as previously described (8Merrihew G.E. Davis C. Ewing B. Williams G. Käll L. Frewen B.E. Noble W.S. Green P. Thomas J.H. MacCoss M.J. Use of shotgun proteomics for the identification, confirmation, and correction of C. elegans gene annotations.Genome Res. 2008; 18: 1660-1669Crossref PubMed Scopus (71) Google Scholar). Tryptic digests of whole Asian citrus psyllids (D. citri) were performed as described (9Ramsey J.S. Johnson R.S. Hoki J.S. Kruse A. Mahoney J. Hilf M.E. Hunter W.B. Hall D.G. Schroeder F.C. MacCoss M.J. Cilia M. Metabolic interplay between the asian citrus psyllid and its profftella symbiont: An achilles' heel of the citrus greening insect vector.PLoS ONE. 2015; 10: 1-21Crossref Scopus (48) Google Scholar). One hundred milligrams of powdered cave bear bone provided by Richard E. Green (University of California, Santa Cruz, CA) (10Bon C. Caudy N. de Dieuleveult M. Fosse P. Philippe M. Maksud F. Beraud-Colomb E. Bouzaid E. Kefi R. Laugier C. Rousseau B. Casane D. van der Plicht J. Elalouf J.M. Deciphering the complete mitochondrial genome and phylogeny of the extinct cave bear in the Paleolithic painted cave of Chauvet.Proc. Natl. Acad. Sci. 2008; 105: 17447-17452Crossref PubMed Scopus (59) Google Scholar, 11Noonan J.P. Hofreiter M. Smith D. Priest J.R. Rohland N. Rabeder G. Krause J. Detter J.C. Pääbo S. Rubin E.M. Genomic sequencing of Pleistocene cave bears.Science. 2005; 309: 597-599Crossref PubMed Scopus (191) Google Scholar, 12Dabney J. Knapp M. Glocke I. Gansauge M.-T. Weihmann A. Nickel B. Valdiosera C. Garcia N. Paabo S. Arsuaga J.-L. Meyer M. Complete mitochondrial genome sequence of a Middle Pleistocene cave bear reconstructed from ultrashort DNA fragments.Proc. Natl. Acad. Sci. 2013; 110: 15758-15763Crossref PubMed Scopus (762) Google Scholar) was placed in 500 μl of 0.2% PPS Silent Surfactant (Expedeon San Diego, CA) in 50 mm ammonium bicarbonate (ABC), and probe sonicated three times for 20 s while on ice. An additional 500 μl of 50 mm ABC brought the final extract to 1.0 ml in 0.1% PPS. The extract was centrifuged, and the supernatant shown to contain 1.2 mg/ml total protein using a BCA assay. A spotted ratfish (Hydrolagus colliei) and a little skate (Leucoraja erinacea) were collected and euthanized in compliance with IACUC protocol numbers, IACUC16–014 and AUP18–0001 respectively. The ratfish was collected from Puget Sound and experimentation was carried out at the Benaroya Research Institute in Seattle, WA. The skate was shipped from the Marine Biological Laboratory in Woods Hole, MA to the University of California, Merced where studies were performed. For both specimens, viscous hydrogel found within the electrosensory ampullae of Lorenzini was extracted by applying pressure on the organs externally. Dilution of 100 μl of this material with 200 μl of 0.2% PPS and 50 mm Tris buffer at pH 8 reduced the sample viscosity before tryptic digestion. Glacial silt from Greenland meltwater was isolated at the field site. Sample preparation, including DNA extraction, is described in detail in the supplemental section. Reduction and alkylation of disulfide bonds employed treatment with 5 mm tris(2-carboxyethyl) phosphine for an hour at 37 C, cooling to room temperature, and adding 5.5 mm iodoacetamide at room temperature for 20 min. Tryptic digestion proceeded overnight using 1:50 by weight of Promega trypsin at 37 C. Digestion was stopped and PPS cleaved by acidification with trifluoroacetic acid (TFA) (0.2% by volume and pH 2, as verified by pH paper). In most cases, the tryptic digestions were analyzed directly; however, the fish hydrogel samples were subjected to solid phase extraction using MCX cartridges (Waters) per manufacturer's instructions, dried on a vacuum centrifuge, and resolubilized in 100 μl of 2% acetonitrile containing 0.1% TFA. All mass spectrometry was performed on either a Fusion Orbitrap or Q-Exactive-HF (Thermo Fisher Scientific, San Jose, CA) mass spectrometer. Up to 1 μg of each sample digest was loaded from the autosampler onto a 150-μm inner diameter (ID) Kasil fritted trap packed with Reprosil-Pur C18-AQ (3-μm bead diameter, Dr. Maisch Ammerbuch, Germany) to a bed length of 2 cm at a flow rate of 2 μl/min. After loading and desalting using a total volume of 10 μl of 0.1% formic acid plus 2% acetonitrile, the trap was brought on-line with a pulled fused-silica capillary tip (75-μm ID) packed with the same Reprosil C18-AQ that was mounted in a microspray source and placed in line with a Waters Nanoacquity binary UPLC pump plus autosampler. Peptides were eluted off the column using a gradient of 2–35% acetonitrile in 0.1% formic acid over 60 min, followed by 35–60% acetonitrile over 5 min at a flow rate of 250 nl/min. The mass spectrometers were operated using electrospray ionization (2 kV) with the heated transfer tube at 275 C using data dependent acquisition (DDA) in the so-called "Top Speed" mode (Fusion), or "Top 20" mode (Q-Exactive). The orbitrap resolution was 120,000 at m/z 200, and for tandem mass spectrometry (MS/MS) the linear ion trap provided unit resolution or the orbitrap was operated at a resolving power of 15,000. Unless otherwise specified, the MS/MS spectra were acquired using a quadrupole isolation width of 1.6 m/z and HCD normalized collision energy (NCE) of 30%, or CID collision energy of 35%. Dynamic exclusion (including all isotope peaks) was set for 30 s using monoisotopic precursor selection. Data file format conversions were made with Proteowizard version 3.0.19053 (13Chambers M.C. Maclean B. Burke R. Amodei D. Ruderman D.L. Neumann S. Gatto L. Fischer B. Pratt B. Egertson J. Hoff K. Kessner D. Tasman N. Shulman N. Frewen B. Baker T. a. Brusniak M.-Y. Paulse C. Creasy D. Flashner L. Kani K. Moulding C. Seymour S.L. Nuwaysir L.M. Lefebvre B. Kuhlmann F. Roark J. Rainer P. Detlev S. Hemenway T. Huhmer A. Langridge J. Connolly B. Chadick T. Holly K. Eckels J. Deutsch E.W. Moritz R.L. Katz J.E. Agus D.B. Maccoss M.J. Tabb D.L. Mallick P. A cross-platform toolkit for mass spectrometry and proteomics.Nat. Biotechnol. 2012; 30: 918-920Crossref PubMed Scopus (1775) Google Scholar). De novo sequences were generated using the program Novor (v1.5.573) (6Ma B. Novor: Real-time peptide de novo sequencing software.J. Am. Soc. Mass Spectrom. 2015; 26: 1885-1894Crossref PubMed Scopus (127) Google Scholar), and unless otherwise specified the following parameters were used - Enzyme: Trypsin, Instrument Type: HCD-FT, Precursor Error Tolerance: 10 ppm, Fragment Error Tolerance: 0.02 Da, and no variable modifications were considered, but the appropriate cysteine modification was used as a fixed modification. Novor produces a csv output file that includes the de novo sequence and a sequence score. Evaluation of the Novor scores was performed by comparing high confidence peptide sequence search results with the de novo sequences. The database search used Comet (version 2018.01 rev. 1) (14Eng J.K. Jahan T. a. Hoopmann M.R. Comet: an open source tandem mass spectrometry sequence database search tool.Proteomics. 2012; 13: 1-3Google Scholar), followed by PeptideProphet (15Keller A. Nesvizhskii A.I. Kolker E. Aebersold R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search.Anal. Chem. 2002; 74: 5383-5392Crossref PubMed Scopus (3886) Google Scholar) using the Trans-Proteomic Pipeline (v5.0.0 Typhoon) (16Deutsch E.W. Mendoza L. Shteynberg D. Farrah T. Lam H. Sun Z. Nilsson E. Pratt B. Prazen B. Eng J.K. Daniel B. Nesvizhskii A.I. Aebersold R. A guided tour of the trans-proteomic pipeline Tasman8.Proteomics. 2010; 10: 1150-1159Crossref PubMed Scopus (601) Google Scholar). Novor was used to determine de novo sequences for spectra that in a database search gave peptide spectrum matches (PSMs) with false discovery rates (FDR) and Comet E-values less than 0.001. If the de novo sequence could account for at least 70% of the high confidence database sequence, then it was defined as having been correctly determined. Having made this determination, precision-recall curves were derived. Based on this analysis, de novo sequences with scores of 60 or higher were combined into a single protein sequence that was appended to the appropriate FASTA file (described next). For the analysis of a human cell line tryptic digest, Uniprot reference proteome FASTA files were used from a number of mostly chordate species - B. floridae (2/27/18; 28,542), C. brachyrhynchos (2/27/18; 13,621), F. catus (3/13/18; 20,447), G. aculeatus (3/13/18; 20,666), G. gorilla (5/26/18; 21,795), H. sapiens (6/27/18; 21,053), O. anatinus (3/13/18; 21,677), O. garnettii (3/13/18; 19,451), P. abelii (5/26/18; 21,999), P. troglodytes (6/23/18; 23,008), S. harrisii (3/13/18; 18,781), T. asiatica (3/13/18; 10,315), and X. laevis (2/27/18; 41,562). For the analysis of a C. elegans tryptic digest, Uniprot reference proteome FASTA files were used from a few nematode species - C. briggsae (2/27/18; 21,725), C. elegans (6/27/18; 19,998), D. viviparus (2/26/18; 14,161), H. bacteriophora (3/13/18; 20,833), N. americanus (2/27/18; 19,125), and O. dentatum (2/26/18; 25,133). The date in parentheses denotes the last time each was modified, followed by the number of protein entries. In addition, shuffled FASTA files were created from H. sapiens and C. elegans by maintaining the tryptic cleavage sites and scrambling the intervening amino acid sequences. Seawater data and FASTA files were downloaded from the Noble lab (17May D.H. Timmins-Schiffman E. Mikan M.P. Harvey H.R. Borenstein E. Nunn B.L. Noble W.S. An alignment-free "metapeptide" strategy for metaproteomic characterization of microbiome samples using shotgun metagenomic sequencing.J. Proteome Res. 2016; 15: 2697-2705Crossref PubMed Scopus (33) Google Scholar). The env_nr, metagenome, and metapeptide FASTA files contained 7,003,678, 459,004, and 15,911,893 entries, respectively. Cave bear (U. deningeri) bone data was searched against a Uniprot reference proteome FASTA file from A. melanoleuca (3/13/18; 19,344) and a NCBI FASTA file from Ursus arctos horribilis (11/15/18; 35,412). Spotted ratfish (H. colliei) data was searched against a Uniprot proteome from C. milii (9/25/18; 19,344), and little skate (L. erinacea) data was also searched against C. milii, as well as a combination of sequence data for all chondrichthyes found in Uniprot and RefSeq (111,444 entries). Data from whole Asian psyllids (D. citri), was searched against two fasta files derived from Gnmon gene predictions of the D. citri genome (version 1.1) (20150806Diaphorina_citri_GeneModel_MCOTprotein.ahrd.fasta and NCBI_Gnomon_MCOT_AHRD_and_endosymbionts.fasta), where the latter also contains sequences of known endosymbionts (Candidatus Carsonella ruddii, Candidatus Profftella armaturae, and Wolbachia). These had 30,562 and 47,160 entries, respectively. DNA extracted from glacial sediments was used to make metagenome FASTA files; further details are in the supplemental section. Appended to each of these FASTA files was a list of common contaminants (https://www.thegpm.org/crap/index.html). All database searches were performed with Comet using FASTA files that had been modified by appending high scoring de novo sequences. In all cases, enzymatic cleavage was semi-tryptic, allowing for up to 2 missed cleavages per peptide. This could result in matches to a partial de novo sequence or to combinations of multiple de novo sequences; however, if such matches scored higher than any FASTA-derived sequence, the latter could still be poor matches. The precursor tolerance was set to 20 ppm, and the fragment tolerance in Comet was set via a fragment bin tolerance value of 0.02 m/z units. Either iodoacetamide or methyl methanethiosulfonate were used, as appropriate (only the psyllid proteome was modified using the latter), for static modifications of cysteine. Variable modifications included oxidized methionine, acetylation of the protein N terminus, and cleavage of the protein N-terminal methionine. The concatenated decoy search option was used. The resulting pep.xml file output from Comet was then modified using a custom Python script to deal with the observation that correct sequences derived from a FASTA file entry can sometimes have a slightly lower Comet cross-correlation score than a de novo sequence. This occurs when the two sequences are nearly identical, but a slight sequence variation allows the de novo sequence to account for one or two additional minor fragment ions. Hence, in these cases the first and second rankings need to be reversed. The differences between the first and second ranked decoy sequences were used to model this effect. The rationale for this approach is that we assume all decoy sequences are incorrect regardless of their rankings, and that the first ranked decoy has a slightly higher score because of some extra random matches. This is like when a de novo sequence score is slightly higher than a correct FASTA file sequence score. To account for the effect of peptide size on cross-correlation scores, these decoy differences were normalized to the peptide molecular weight. The suitability of this modeling is demonstrated in supplemental Fig. S1, which shows the close match between score differences. The algorithm sorts the decoy score difference (normalized by MW), and counts down through a specified percent of the list (1% is used in all cases described here) to extract a normalized score difference that can be used as a cutoff. Score differences less than that cutoff would capture 99% of the decoy score differences. Likewise, these score differences would capture about 99% of the cases when a de novo sequence scores slightly better than a target FASTA sequence. Once the Comet pep.xml file has been modified thus, PeptideProphet (15Keller A. Nesvizhskii A.I. Kolker E. Aebersold R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search.Anal. Chem. 2002; 74: 5383-5392Crossref PubMed Scopus (3886) Google Scholar) was used to determine the FDR where the cutoff was 0.01 in conjunction with the Comet E-value cutoff of 0.01 (the only exception being the determination of high confidence sequences for Novor score evaluations as described above). Python 2.7 scripts were written to produce precision-recall tables for evaluating Novor results, extracting high scoring sequences produced by Novor and appending them to a FASTA file, and to manipulate Comet pep.xml files before analysis by PeptideProphet, and are available at bitbucket.org/rj8/fasta_quality. The automatic de novo sequencing program Novor produces an overall sequence score for each sequence that ranges between 0 and 100, where the higher number indicates a better match. To understand how to interpret this score, human tryptic peptides were analyzed in four ways with combinations of beam CID versus resonance CID and high versus low mass accuracy MS/MS measurements on a Thermo Fusion orbitrap mass spectrometer using DDA. For each analysis, peptides were identified using a database search (Comet followed by PeptideProphet) of a human FASTA file. The FDR and Comet E-value limits were both set at 0.001. These PSMs were assumed to be correct and were compared with the Novor-derived sequences from the same MS/MS spectra. This sounds simple enough, however, matching de novo and database sequences is not straightforward. De novo sequences are usually not completely correct but are often partially correct. One obvious error would be because of the inability to differentiate isomeric amino acids (leucine and isoleucine). Sometimes a handful of low intensity or absent sequencing ions results in short regions of poorly defined sequence in an otherwise correct sequence. A good automated de novo sequencing program should be able to handle short regions of poorly defined sequence and still provide a partially correct sequence with a high score. In contrast, bad spectra containing insufficient fragment ions to delineate most of the sequence should receive a low score. Likewise, non-peptide tandem mass spectra (e.g. from detergent ions) generally do not result in high scoring de novo sequences. This tendency to produce partially correct de novo sequences is what makes it a challenge to compare them to database sequences, because a direct string-to-string comparison will be far too conservative. Fig. 1A illustrates a better way to make these comparisons using mass alignments (18Taylor J.A. Johnson R.S. Sequence database searches via de novo peptide sequencing by tandem mass spectrometry.Rapid Commun. Mass Spectrom. 1997; 11: 1067-1075Crossref PubMed Scopus (339) Google Scholar, 19Searle B.C. Dasari S. Turner M. Reddy A.P. Choi D. Wilmarth P.A. McCormack A.L. David L.L. Nagalla S.R. High-throughput identification of proteins and unanticipated sequence modifications using a mass-based alignment algorithm for MS/MS de novo sequencing results.Anal. Chem. 2004; 76: 2220-2230Crossref PubMed Scopus (129) Google Scholar). In this mock example, the two N-terminal amino acids are reversed, which is a common mistake because of the frequent lack of sequence-defining fragment ions between the first and second amino acids. Likewise, the absence of cleavage between Ala and Gly could be construed as being because of the presence of Gln. Another example is when an adventitious fragment ion results in Gly-Gly in the de novo sequence when it is just Asn. To further illustrate the idea of mass-based alignment, Figs. 1B–D show a few actual cases where the top de novo sequences (all with scores > 60) were manually aligned with the top database hit (MS/MS spectra are shown in supplemental Fig. S2). In Fig. 1B there was very little fragmentation in the middle of the peptide, yet 75% of the amino acid masses could be aligned at the two termini. Likewise, 50% of the amino acid masses could be aligned at the termini in Fig. 1C; however, in this case, the de novo sequence mistakenly jumps between y- and b-ion series. The database sequence contains a subsequence of LLEVE, whereas the de novo sequence has this reversed. There is really no alignment in Fig. 1D, but this was a case where two peptides were co-fragmented—the database search picked out one of the peptides and Novor sequenced the other one. In fact, a BLAST search of the de novo sequence showed an exact match to a human calmodulin tryptic peptide, where Novor assigned Ala-Ala instead of carbamylated-Val at the N terminus. However, for the purposes described here, if a mass-based alignment shows that at least 70% of the peptide mass is correct, then the de novo sequence is considered "correct." Novor results were compared with database search results with very low false discovery rates (<0.001), which were all considered to be correct. Data was collected from a tryptic digest of a human K562 cell lysate on a Thermo Fusion Orbitrap mass spectrometer that employed either resonance CID (occurring in the linear ion trap) or beam CID (so-called HCD), and either high or low accuracy/resolution measurements (orbitrap or linear ion trap, respectively). True positives (TP) are the number of correct de novo sequences at or above a given Novor sequence score. False positives (FP) are the number of incorrect de novo sequences at or above a given score. False negatives (FN) are the number of correct de novo sequences below a given score. It should be noted again that a "correct" de novo sequence only needs to be able to mass align 70% of the peptide molecular weight (Fig. 1A). Fig. 2A shows the number of de novo sequences whose sequence score exceeds the score necessary to result in a specified precision, and Fig. 2B shows the precision—recall curves for the same data. Precision=TP/(TP+FP) Recall=TP/(TP+FN) Novor is most successful if fragment ion masses are measured with high accuracy and resolution. The effect of activation type (resonance versus beam CID) depended on the mass accuracy—beam CID is better for high accuracy fragment mass measurement, and resonance CID is better for low accuracy. This most likely reflects the types of data used to train Novor, and not anything intrinsic to the information content of beam versus resonance CID fragmentations. Supplemental Table S1 shows the Novor scores and number of de novo sequences obtained for beam and resonance CID with accurate mass MS/MS data for a few different precision values. For example, if a precision of 0.95 is acceptable, a Novor sequence score threshold of 58 was able to recall over 91% of the correct de novo sequences when using beam CID. In this case, there were 12,342 de novo sequences out of a total of 16,062 spectra identified in a database search, of which 5% were incorrect (as defined above). Hence, a Novor sequence score threshold in the range of 60 to 70 would seem to be an appropriate choice for selecting de novo sequences for further consideration. Similarly, supplemental Table S2 shows results for linear trap MS/MS data. All subsequent data was acquired using beam CID with high mass accuracy orbitrap mass measurements of the fragment ions. It can be difficult to evaluate and troubleshoot a new sample preparation method or a new type of sample, especially when there is no obvious FASTA file to search. One ty
Referência(s)