Artigo Acesso aberto Revisado por pares

Accurate Estimation of Context-Dependent False Discovery Rates in Top-Down Proteomics

2019; Elsevier BV; Volume: 18; Issue: 4 Linguagem: Inglês

10.1074/mcp.ra118.000993

ISSN

1535-9484

Autores

Richard D. LeDuc, Ryan T. Fellers, Bryan P. Early, Joseph B. Greer, Daniel P. Shams, Paul M. Thomas, Neil L. Kelleher,

Tópico(s)

Gene expression and cancer classification

Resumo

Within the last several years, top-down proteomics has emerged as a high throughput technique for protein and proteoform identification. This technique has the potential to identify and characterize thousands of proteoforms within a single study, but the absence of accurate false discovery rate (FDR) estimation could hinder the adoption and consistency of top-down proteomics in the future. In automated identification and characterization of proteoforms, FDR calculation strongly depends on the context of the search. The context includes MS data quality, the database being interrogated, the search engine, and the parameters of the search. Particular to top-down proteomics–there are four molecular levels of study: proteoform spectral match (PrSM), protein, isoform, and proteoform. Here, a context-dependent framework for calculating an accurate FDR at each level was designed, implemented, and validated against a manually curated training set with 546 confirmed proteoforms. We examined several search contexts and found that an FDR calculated at the PrSM level under-reported the true FDR at the protein level by an average of 24-fold. We present a new open-source tool, the TDCD_FDR_Calculator, which provides a scalable, context-dependent FDR calculation that can be applied post-search to enhance the quality of results in top-down proteomics from any search engine. Within the last several years, top-down proteomics has emerged as a high throughput technique for protein and proteoform identification. This technique has the potential to identify and characterize thousands of proteoforms within a single study, but the absence of accurate false discovery rate (FDR) estimation could hinder the adoption and consistency of top-down proteomics in the future. In automated identification and characterization of proteoforms, FDR calculation strongly depends on the context of the search. The context includes MS data quality, the database being interrogated, the search engine, and the parameters of the search. Particular to top-down proteomics–there are four molecular levels of study: proteoform spectral match (PrSM), protein, isoform, and proteoform. Here, a context-dependent framework for calculating an accurate FDR at each level was designed, implemented, and validated against a manually curated training set with 546 confirmed proteoforms. We examined several search contexts and found that an FDR calculated at the PrSM level under-reported the true FDR at the protein level by an average of 24-fold. We present a new open-source tool, the TDCD_FDR_Calculator, which provides a scalable, context-dependent FDR calculation that can be applied post-search to enhance the quality of results in top-down proteomics from any search engine. Accurate and efficient false discovery rate (FDR) 1The abbreviations used are:CD FDRContext-dependent false discovery rateCSVComma separated value fileDecoy PrSMProteoform spectral match to a decoy databaseFDRFalse discovery ratemzMLAn XML file formatPrSMProteoform spectral matchPTMPost-translational modificationSNPSingle nucleotide polymorphism. 1The abbreviations used are:CD FDRContext-dependent false discovery rateCSVComma separated value fileDecoy PrSMProteoform spectral match to a decoy databaseFDRFalse discovery ratemzMLAn XML file formatPrSMProteoform spectral matchPTMPost-translational modificationSNPSingle nucleotide polymorphism. determination of protein and proteoform identifications is needed to improve top-down proteomics for large-scale, automated proteoform discovery (qualitative analysis) and relative quantification (quantitative analysis) (1Ntai I. Toby T.K. LeDuc R.D. Kelleher N.L. A method for label-free, differential top-down proteomics.Methods Mol. Biol. 2016; 1410: 121-133Crossref PubMed Scopus (24) Google Scholar, 2Ntai I. LeDuc R.D. Fellers R.T. Erdmann-Gilmore P. Davies S.R. Rumsey J. Early B.P. Thomas P.M. Li S. Compton P.D. Ellis M.J. Ruggles K.V. Fenyo D. Boja E.S. Rodriguez H. Townsend R.R. Kelleher N.L. Integrated bottom-up and top-down proteomics of patient-derived breast tumor xenografts.Mol. Cell. Proteomics. 2016; 15: 45-56Abstract Full Text Full Text PDF PubMed Scopus (58) Google Scholar). Discovery top-down proteomics uses LC-MS/MS to analyze complex samples to determine the proteoform composition without protease digestion and employs various search algorithms to identify proteoforms. Context-dependent false discovery rate Comma separated value file Proteoform spectral match to a decoy database False discovery rate An XML file format Proteoform spectral match Post-translational modification Single nucleotide polymorphism. Context-dependent false discovery rate Comma separated value file Proteoform spectral match to a decoy database False discovery rate An XML file format Proteoform spectral match Post-translational modification Single nucleotide polymorphism. Over time, the community of bottom-up proteomics has developed more accurate, global FDR solution that scales well (3Serang O. Kall L. Solution to statistical challenges in proteomics is more statistics, not less.J. Proteome Res. 2015; 14: 4099-4103Crossref PubMed Scopus (35) Google Scholar). A very large scale 2014 study found 18,097 protein entries from multiple parts of the human body (4Wilhelm M. Schlegl J. Hahne H. Moghaddas Gholami A. Lieberenz M. Savitski M.M. Ziegler E. Butzmann L. Gessulat S. Marx H. Mathieson T. Lemeer S. Schnatbaum K. Reimer U. Wenschuh H. Mollenhauer M. Slotta-Huspenina J. Boese J.H. Bantscheff M. Gerstmair A. Faerber F. Kuster B. Mass-spectrometry-based draft of the human proteome.Nature. 2014; 509: 582-587Crossref PubMed Scopus (1318) Google Scholar), whereas a subsequent reanalysis with a more accurate estimation of a 1% FDR at the protein level revised this number down to 15,375 protein entries (5Savitski M.M. Wilhelm M. Hahne H. Kuster B. Bantscheff M. A Scalable Approach for Protein False Discovery Rate Estimation in Large Proteomic Data Sets.Mol. Cell. Proteomics. 2015; 14: 2394-2404Abstract Full Text Full Text PDF PubMed Scopus (237) Google Scholar). A major complexity in bottom-up is the "protein inference problem" where individual peptides can be shared between different highly related proteins of different genes, isoforms and proteoforms of a single protein (6Burger T. Gentle Introduction to the Statistical Foundations of False Discovery Rate in Quantitative Proteomics.J. Proteome Res. 2018; 17: 12-22Crossref PubMed Scopus (26) Google Scholar). This gives rise to the need for bottom-up to report protein groups. Systems developed for bottom-up FDR estimation (7Noble W.S. MacCoss M.J. Computational and statistical analysis of protein mass spectrometry data.PLoS Comput. Biol. 2012; 8: e1002296Crossref PubMed Scopus (50) Google Scholar, 8Benjamini Y. Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing.J. Roy. Statistical Soc. 1995; 57: 289-300Google Scholar, 9Hather G. Higdon R. Bauman A. von Haller P.D. Kolker E. Estimating false discovery rates for peptide and protein identification using randomized databases.Proteomics. 2010; 10: 2369-2376Crossref PubMed Scopus (23) Google Scholar, 10Higdon R. Hogan J.M. Kolker N. van Belle G. Kolker E. Experiment-specific estimation of peptide identification probabilities using a randomized database.Omics. 2007; 11: 351-365Crossref PubMed Scopus (21) Google Scholar) do not consider the individual molecular levels discovered by top-down (Fig. 1) and thus cannot be used in this context. Likewise, the targeted top-down analysis tools that have been available for many years (11LeDuc R.D. Taylor G.K. Kim Y.B. Januszyk T.E. Bynum L.H. Sola J.V. Garavelli J.S. Kelleher N.L. ProSight PTM: an integrated environment for protein identification and characterization by top-down mass spectrometry.Nucleic Acids Res. 2004; 32 (Web Server issue): W340-W345Crossref PubMed Scopus (172) Google Scholar, 12Meng F. Cargile B.J. Miller L.M. Forbes A.J. Johnson J.R. Kelleher N.L. Informatics and multiplexing of intact protein identification in bacteria and the archaea.Nat. Biotechnol. 2001; 19: 952-957Crossref PubMed Scopus (200) Google Scholar) are optimized for expert-driven manual validation of the search results of one or a few Spectra. These tools provide several different search strategies (13LeDuc R.D. Kelleher N.L. Using ProSight PTM and related tools for targeted protein identification and characterization with high mass accuracy tandem MS data.Current Protocols Bioinformatics. 2007; (Chapter 13, Unit 13.6)PubMed Google Scholar, 14Frank A.M. Pesavento J.J. Mizzen C.A. Kelleher N.L. Pevzner P.A. Interpreting top-down mass. Spectra using spectral alignment.Anal. Chem. 2008; 80: 2499-2505Crossref PubMed Scopus (64) Google Scholar) but the accuracy of FDR determination is yet understudied and not regularized with the community. In high-throughput top-down proteomics, search algorithms function by scoring the match between a set of theoretical proteoforms and an observed set of MS1 and MS2 data. As in bottom-up proteomics, the MS1 and MS2 data object in top-down will here be referred to as a Spectrum, and it contains both intact and fragment masses. Likewise, a match between a Spectrum and a theoretical proteoform can be called a Proteoform Spectral Match (PrSM) (Fig. 1). Typically, spectral data are converted to the neutral mass regime (15Horn D.M. Zubarev R.A. McLafferty F.W. Automated reduction and interpretation of high resolution electrospray mass spectra of large molecules.J. Am. Soc. for Mass Spectrom. 2000; 11: 320-332Crossref PubMed Scopus (479) Google Scholar); this is true even for most spectral alignment approaches (16Kou Q. Xun L. Liu X. TopPIC: a software tool for top-down mass spectrometry-based proteoform identification and characterization.Bioinformatics. 2016; 32: 3495-3497Crossref PubMed Scopus (133) Google Scholar, 17Liu X. Sirotkin Y. Shen Y. Anderson G. Tsai Y.S. Ting Y.S. Goodlett D.R. Smith R.D. Bafna V. Pevzner P.A. Protein identification using top-down spectra.Mol. Cell. Proteomics. 2012; 11 (M111.008524)Abstract Full Text Full Text PDF Scopus (112) Google Scholar). Matches between Spectra and theoretical proteoforms, PrSMs, can then manually validated by mass spectrometerists to determine which proteoforms are present in the sample. Although this approach is very labor intensive and subject to human interpretation, it has been used successfully in the past (1Ntai I. Toby T.K. LeDuc R.D. Kelleher N.L. A method for label-free, differential top-down proteomics.Methods Mol. Biol. 2016; 1410: 121-133Crossref PubMed Scopus (24) Google Scholar, 2Ntai I. LeDuc R.D. Fellers R.T. Erdmann-Gilmore P. Davies S.R. Rumsey J. Early B.P. Thomas P.M. Li S. Compton P.D. Ellis M.J. Ruggles K.V. Fenyo D. Boja E.S. Rodriguez H. Townsend R.R. Kelleher N.L. Integrated bottom-up and top-down proteomics of patient-derived breast tumor xenografts.Mol. Cell. Proteomics. 2016; 15: 45-56Abstract Full Text Full Text PDF PubMed Scopus (58) Google Scholar). Although automating this process will greatly accelerate the field of discovery top-down proteomics, it will also require calculating a reliable FDR. In the usage of proteomics, proteoforms are not the same thing as proteins; (18Smith L.M. Kelleher N.L. Proteoform: a single term describing protein complexity.Nat. Methods. 2013; 10: 186-187Crossref PubMed Scopus (892) Google Scholar) see Fig. 1. Proteins represent a collection of expressed proteoforms, and although the proteoform is a form of a given protein, a protein is usually expressed as multiple proteoforms. Each measured proteoform results from a series of molecular processing events starting with transcription and ending with post-translational modifications. Each gene typically is associated with a canonical amino acid sequence. This sequence is often different from that observed in biological samples: there may be different alleles or coding SNPs creating sequence variants, transcripts from a given gene can be alternatively spliced to form multiple isoforms (19Yang X. Coulombe-Huntington J. Kang S. Sheynkman G.M. Hao T. Richardson A. Sun S. Yang F. Shen Y.A. Murray R.R. Spirohn K. Begg B.E. Duran-Frigola M. MacWilliams A. Pevzner S.J. Zhong Q. Trigg S.A. Tam S. Ghamsari L. Sahni N. Yi S. Rodriguez M.D. Balcha D. Tan G. Costanzo M. Andrews B. Boone C. Zhou X.J. Salehi-Ashtiani K. Charloteaux B. Chen A.A. Calderwood M.A. Aloy P. Roth F.P. Hill D.E. Iakoucheva L.M. Xia Y. Vidal M. Widespread expansion of protein interaction capabilities by alternative splicing.Cell. 2016; 164: 805-817Abstract Full Text Full Text PDF PubMed Scopus (320) Google Scholar) which also translate into different amino acid sequences, and proteoforms can have covalently-attached, site-specific features enzymatically added to form post-translational modifications (PTMs). The sum of these events leads to a population of multiple molecules, each being a unique proteoform (18Smith L.M. Kelleher N.L. Proteoform: a single term describing protein complexity.Nat. Methods. 2013; 10: 186-187Crossref PubMed Scopus (892) Google Scholar). An expressed protein is the population, or family, of its expressed proteoforms (20Shortreed M.R. Frey B.L. Scalf M. Knoener R.A. Cesnik A.J. Smith L.M. Elucidating Proteoform. Families from proteoform intact-mass and lysine-count measurements.J. Proteome Res. 2016; 15: 1213-1221Crossref PubMed Scopus (32) Google Scholar). This is further complicated by isoforms. Some proteins, such as the human high mobility group protein (P17096–1 and P17096–2) come in multiple isoforms. These isoforms have differing amino acid sequences and different modifications (21Tran J.C. Zamdborg L. Ahlf D.R. Lee J.E. Catherman A.D. Durbin K.R. Tipton J.D. Vellaichamy A. Kellie J.F. Li M. Wu C. Sweet S.M. Early B.P. Siuti N. LeDuc R.D. Compton P.D. Thomas P.M. Kelleher N.L. Mapping intact protein isoforms in discovery mode using top-down proteomics.Nature. 2011; 480: 254-258Crossref PubMed Scopus (438) Google Scholar). In systems where a confidence metric specifically at the isoform level is desired, FDRs can now be calculated for this purpose. However, we expect that most studies will focus mainly on protein and proteoform-level FDR values. The concept of an expressed protein exists to help us simplify the complexity associated with understanding biological function at the molecular level and has worked well for bottom-up proteomics. But, the presence of multiple proteoforms arising from a single gene makes the concept of an expressed protein more complex, as it is the proteoforms that are expressed. Although unmodified protein sequence as directly encoded by a gene are frequently expressed in bacteria, in eukaryotes this appears to occur less often than the expression of modified sequences. (For example, in one study (21Tran J.C. Zamdborg L. Ahlf D.R. Lee J.E. Catherman A.D. Durbin K.R. Tipton J.D. Vellaichamy A. Kellie J.F. Li M. Wu C. Sweet S.M. Early B.P. Siuti N. LeDuc R.D. Compton P.D. Thomas P.M. Kelleher N.L. Mapping intact protein isoforms in discovery mode using top-down proteomics.Nature. 2011; 480: 254-258Crossref PubMed Scopus (438) Google Scholar) only 106 of 1046 or 10.1% of the discovered proteoforms are unmodified, whereas the Catherman data set presented below has only 4.5% unmodified proteoforms.) This duality, that the protein entity encoded by a gene is rarely expressed, while modified proteoforms more commonly occur in the physical world, complicates FDR determination in top-down proteomics. Observing proteoforms is further complicated by issues of identification and characterization. To avoid confusion, the term Protein Identification is simply the act of assigning a gene product to a gene, or in practice to a protein accession in a gene-centric protein knowledgebase. It should be noted that there may or may not be data that allow the exact determination of proteoforms with complete molecular specificity. Thus, a protein may be identified as present in a sample even if no proteoforms are fully characterized. In contrast, full characterization only occurs when the molecular specificity of a proteoform can be determined within the context of the search. Proteoforms containing unknown mass shifts, or incompletely localized PTMs are said to be partially characterized. In top-down proteomics, it is less useful to discuss characterizing an isoform or protein, except in the case of very simple systems where there is only a single proteoform produced by a given gene. Identification can occur at three independent molecular levels (Fig. 1). The protein entry is a gene-level identification and is denoted with an accession number to a gene-centric database like UniProtKB (22The UniProt Consortium UniProt: the universal protein knowledgebase.Nucleic Acids Res. 2017; 45 (Database Issue): D158-D169Crossref PubMed Scopus (3136) Google Scholar, 23Junker V.L. Apweiler R. Bairoch A. Representation of functional information in the SWISS-PROT data bank.Bioinformatics. 1999; 15: 1066-1067Crossref PubMed Scopus (37) Google Scholar), whereas a proteoform identification refers to one combination of modifications with a single primary structure. A proteoform should be reported with an accession number from a proteoform-centric database like the one maintained by the Consortium for Top-Down Proteomics (24. Proteomics, C. f. T.-D. TopDownProteomics.org. (accessed 12/6/2017).Google Scholar). Between the protein entry and proteoform level, there exists an isoform entry level. Identifying an isoform means that there is evidence for the presence of a given sequence arising from alternative splicing or translational start site in the sample, independent of which PTMs might be present. Top-down search engines assign a numeric score to the degree of matching between a Spectrum and a candidate proteoform. ProSight (11LeDuc R.D. Taylor G.K. Kim Y.B. Januszyk T.E. Bynum L.H. Sola J.V. Garavelli J.S. Kelleher N.L. ProSight PTM: an integrated environment for protein identification and characterization by top-down mass spectrometry.Nucleic Acids Res. 2004; 32 (Web Server issue): W340-W345Crossref PubMed Scopus (172) Google Scholar) uses the P-score (12Meng F. Cargile B.J. Miller L.M. Forbes A.J. Johnson J.R. Kelleher N.L. Informatics and multiplexing of intact protein identification in bacteria and the archaea.Nat. Biotechnol. 2001; 19: 952-957Crossref PubMed Scopus (200) Google Scholar). This score is a nonlinear transformation of the number of fragment ions matching between the candidate proteoform and the PrSM, where the non-linear response is governed by the search parameters. Well-suited for targeted studies, this score allows gene-level identification but cannot automatically distinguish partial characterizations, a shortcoming alleviated by the C-score (25LeDuc R.D. Fellers R.T. Early B.P. Greer J.B. Thomas P.M. Kelleher N.L. The C-score: a Bayesian framework to sharply improve proteoform scoring in high-throughput top down proteomics.J. Proteome Res. 2014; 13: 3231-3240Crossref PubMed Scopus (65) Google Scholar). Other search engines use other scoring approaches (e.g. TopPIC reports p values and E-values, Informed Proteomics reports Probabilities (sic) and E-values) (16Kou Q. Xun L. Liu X. TopPIC: a software tool for top-down mass spectrometry-based proteoform identification and characterization.Bioinformatics. 2016; 32: 3495-3497Crossref PubMed Scopus (133) Google Scholar, 17Liu X. Sirotkin Y. Shen Y. Anderson G. Tsai Y.S. Ting Y.S. Goodlett D.R. Smith R.D. Bafna V. Pevzner P.A. Protein identification using top-down spectra.Mol. Cell. Proteomics. 2012; 11 (M111.008524)Abstract Full Text Full Text PDF Scopus (112) Google Scholar, 26Park J. Piehowski P.D. Wilkins C. Zhou M. Mendoza J. Fujimoto G.M. Gibbons B.C. Shaw J.B. Shen Y. Shukla A.K. Moore R.J. Liu T. Petyuk V.A. Tolic N. Pasa-Tolic L. Smith R.D. Payne S.H. Kim S. Informed-Proteomics: open-source software package for top-down proteomics.Nat. Methods. 2017; 14: 909-914Crossref PubMed Scopus (92) Google Scholar). The fundamental problem, regardless of search engine, is determining when a continuous score (i.e. one that can have continuous values over a given range) is sufficiently good to allow the assertion that the Spectrum represents a specific proteoform. The search engine returns a set of putative discoveries which can be ranked by score from best to worst, and a cutoff can be found which determines the so called "selected discoveries." Ideally, the cutoff separates true discoveries from false discoveries. The level of this cutoff must be determined by the needs of the individual experiment. For example, studies interested in biomarker discovery (27Toby T.K. Abecassis M. Kim K. Thomas P.M. Fellers R.T. LeDuc R.D. Kelleher N.L. Demetris J. Levitsky J. Proteoforms in peripheral blood mononuclear cells as novel rejection biomarkers in liver transplant recipients.Am. J. Transplantation. 2017; 17: 2458-2467Crossref PubMed Scopus (28) Google Scholar) may set a more permissive proteoform level FDR, whereas studies looking for protein expression differences in specific brain regions (28Davis R.G. Park H.-M. Kim K. Greer J.B. Fellers R.T. LeDuc R.D. Romanova E.V. Rubakhin S.S. Zombeck J.A. Wu C. Yau P.M. Gao P. van Nispen A.J. Patrie S.M. Thomas P.M. Sweedler J.V. Rhodes J.S. Kelleher N.L. Top-down proteomics enables comparative analysis of brain proteoforms between mouse strains.Anal. Chem. 2018; 90: 3802-3810Crossref PubMed Scopus (16) Google Scholar) may set a more stringent 1% protein level FDR. Burger has recently reviewed this process in the context of bottom-up proteomics (6Burger T. Gentle Introduction to the Statistical Foundations of False Discovery Rate in Quantitative Proteomics.J. Proteome Res. 2018; 17: 12-22Crossref PubMed Scopus (26) Google Scholar). The distribution of scores will vary with the context of the search. The context is defined here as the set of the Spectra searched, the database searched against, the search engine, and parameters used. To determine if a score is enough to allow identification, for a given context, we need to infer the null distribution of the score within that context (6Burger T. Gentle Introduction to the Statistical Foundations of False Discovery Rate in Quantitative Proteomics.J. Proteome Res. 2018; 17: 12-22Crossref PubMed Scopus (26) Google Scholar). The null distribution is an unknown probability distribution associated with a context that reports the probability of a given score (or better) occurring by chance alone. One approach to inferring the null distribution is to reverse or scramble the searched database such that none of the candidate proteoforms in the new decoy database could represent real ions measured by the observed Spectra. Then, by duplicating the search context against this decoy database, any PrSMs returned are known to be false, and are called Decoy PrSMs. The distribution of these false scores can then be used as a surrogate for the null distribution (29Aggarwal S. Yadav A.K. False Discovery Rate Estimation in Proteomics.in: Jung K. Statistical Analysis in Proteomics. Springer New York, New York, NY2016: 119-128Google Scholar). Two approaches, parametric and non-parametric, have been used to determine the PrSM-level FDR from a decoy distribution. With a non-parametric approach, for a given list of PrSMs, the FDR is a function of the number of Decoy PrSMs scoring equal to or better than the observed Forward PrSM (29Aggarwal S. Yadav A.K. False Discovery Rate Estimation in Proteomics.in: Jung K. Statistical Analysis in Proteomics. Springer New York, New York, NY2016: 119-128Google Scholar, 30North B.V. Curtis D. Sham P.C. A note on the calculation of empirical P values from Monte Carlo procedures.Am. J. Hum. Genet. 2002; 71: 439-441Abstract Full Text Full Text PDF PubMed Scopus (230) Google Scholar). A parametric FDR attempts to model the distribution of Decoy PrSMs. This can be done by itself (31Tran J.C. Zamdborg L. Ahlf D.R. Lee J.E. Catherman A.D. Durbin K.R. Tipton J.D. Vellaichamy A. Kellie J.F. Li M. Wu C. Sweet S.M.M. Early B.P. Siuti N. LeDuc R.D. Compton P.D. Thomas P.M. Kelleher, N. L., Mapping intact protein isoforms in discovery mode using top down proteomics.Nature. 2011; 480: 254-258Crossref PubMed Scopus (510) Google Scholar), or coupled with expert-driven manual validation (21Tran J.C. Zamdborg L. Ahlf D.R. Lee J.E. Catherman A.D. Durbin K.R. Tipton J.D. Vellaichamy A. Kellie J.F. Li M. Wu C. Sweet S.M. Early B.P. Siuti N. LeDuc R.D. Compton P.D. Thomas P.M. Kelleher N.L. Mapping intact protein isoforms in discovery mode using top-down proteomics.Nature. 2011; 480: 254-258Crossref PubMed Scopus (438) Google Scholar, 32Catherman A.D. Durbin K.R. Ahlf D.R. Early B.P. Fellers R.T. Tran J.C. Thomas P.M. Kelleher N.L. Large-scale top-down proteomics of the human proteome: membrane proteins, mitochondria, and senescence.Mol. Cell. Proteomics. 2013; 12: 3465-3473Abstract Full Text Full Text PDF PubMed Scopus (120) Google Scholar). Both parametric and non-parametric approaches have their strengths and weaknesses. Non-parametric approaches are more robust against irregularities in the decoy distribution but fail to provide any information about the FDR associated with Forward PrSMs scoring above the best Decoy PrSM. Likewise, parametric solutions suffer from errors in correctly modeling the null distribution but provide FDR information about all Forward PrSMs. Tools using non-parametric approaches are availible (17Liu X. Sirotkin Y. Shen Y. Anderson G. Tsai Y.S. Ting Y.S. Goodlett D.R. Smith R.D. Bafna V. Pevzner P.A. Protein identification using top-down spectra.Mol. Cell. Proteomics. 2012; 11 (M111.008524)Abstract Full Text Full Text PDF Scopus (112) Google Scholar, 26Park J. Piehowski P.D. Wilkins C. Zhou M. Mendoza J. Fujimoto G.M. Gibbons B.C. Shaw J.B. Shen Y. Shukla A.K. Moore R.J. Liu T. Petyuk V.A. Tolic N. Pasa-Tolic L. Smith R.D. Payne S.H. Kim S. Informed-Proteomics: open-source software package for top-down proteomics.Nat. Methods. 2017; 14: 909-914Crossref PubMed Scopus (92) Google Scholar). These tools use a model-free approach to empirically estimate null distributions and control the FDR at the PrSM level. Such a non-parametric system has been used by early versions of the TDPortal, a customized Galaxy search portal (33Afgan E. Baker D. van den Beek M. Blankenberg D. Bouvier D. Čech M. Chilton J. Clements D. Coraor N. Eberhard C. Grüning B. Guerler A. Hillman-Jackson J. Von Kuster G. Rasche E. Soranzo N. Turaga N. Taylor J. Nekrutenko A. Goecks J. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update.Nucleic Acids Res. 2016; 44 (Web Server issue): W3-W10Crossref PubMed Scopus (1247) Google Scholar) available through the National Resource for Translational and Developmental Proteomics (27Toby T.K. Abecassis M. Kim K. Thomas P.M. Fellers R.T. LeDuc R.D. Kelleher N.L. Demetris J. Levitsky J. Proteoforms in peripheral blood mononuclear cells as novel rejection biomarkers in liver transplant recipients.Am. J. Transplantation. 2017; 17: 2458-2467Crossref PubMed Scopus (28) Google Scholar, 34Fornelli L. Durbin K.R. Fellers R.T. Early B.P. Greer J.B. LeDuc R.D. Compton P.D. Kelleher N.L. Advancing top-down analysis of the human proteome using a benchtop quadrupole-orbitrap mass spectrometer.J. Proteome Res. 2017; 16: 609-618Crossref PubMed Scopus (62) Google Scholar, 35Anderson L.C. DeHart C.J. Kaiser N.K. Fellers R.T. Smith D.F. Greer J.B. LeDuc R.D. Blakney G.T. Thomas P.M. Kelleher N.L. Hendrickson C.L. Identification and characterization of human proteoforms by top-down LC-21 Tesla FT-ICR mass spectrometry.J. Proteome Res. 2017; 16: 1087-1096Crossref PubMed Scopus (69) Google Scholar). The TDPortal is the first top-down tool to control FDR at not just the PrSM level, but also at the proteoform, isoform, and protein levels. Here we present a logical structure for calculating an identification FDR at the proteoform, isoform, and protein level using PrSMs from their given search context. We supply software for performing this calculation, and a large set of curated spectral results as training data. We show that the FDR functions and scales correctly on the training data and on previously published results. An algorithm for calculating the context-dependent FDR (CD FDR) associated with PrSM, proteoform, isoform, and protein identification was developed. This algorithm uses a non-parametric FDR for identification but enhances this with a parametric FDR for all those molecular entities identified with a score better than the best decoy score. The algorithm calculates a separate decoy distribution at each molecular level encountered in top-down proteomics. A training dataset with 546 manually validated PrSMs was built from data from two representative publications from two different laboratories (25LeDuc R.D. Fellers R.T. Early B.P. Greer J.B. Thomas P.M. Kelleher N.L. The C-score: a Bayesian framework to sharply improve proteoform scoring in high-throughput top down proteomics.J. Proteome Res. 2014; 13: 3231-3240Crossref PubMed Scopus (65) Google Scholar, 26Park J. Piehowski P.D. Wilkins C. Zhou M. Mendoza J. Fujimoto G.M. Gibbons B.C. Shaw J.B. Shen Y. Shukla A.K. Moore R.J. Liu T. Petyuk V.A. Tolic N. Pasa-Tolic L. Smith R.D. Payne S.H. Kim S. Informed-Proteomics: open-source software package for top-down proteomics.Nat. Methods. 2017; 14: 909-914Crossref PubMed Scopus (92) Google Scholar) TDCD_FDR_Calculator, software to implement the algorithm on search engine output, was developed and tested against the training dataset. Both the training datasets and the TDCD_FDR_Calculator tool are made available and described in detail below. Prior to FDR calculation, each spectral data file was deconvoluted and deisotoped. Here the algorithm was tested on Pr

Referência(s)