A Scalable Approach for Protein False Discovery Rate Estimation in Large Proteomic Data Sets
2015; Elsevier BV; Volume: 14; Issue: 9 Linguagem: Inglês
10.1074/mcp.m114.046995
ISSN1535-9484
AutoresMikhail M. Savitski, Mathias Wilhelm, Hannes Hahne, Bernhard Küster, Marcus Bantscheff,
Tópico(s)Metabolomics and Mass Spectrometry Studies
ResumoCalculating the number of confidently identified proteins and estimating false discovery rate (FDR) is a challenge when analyzing very large proteomic data sets such as entire human proteomes. Biological and technical heterogeneity in proteomic experiments further add to the challenge and there are strong differences in opinion regarding the conceptual validity of a protein FDR and no consensus regarding the methodology for protein FDR determination. There are also limitations inherent to the widely used classic target–decoy strategy that particularly show when analyzing very large data sets and that lead to a strong over-representation of decoy identifications. In this study, we investigated the merits of the classic, as well as a novel target–decoy-based protein FDR estimation approach, taking advantage of a heterogeneous data collection comprised of ∼19,000 LC-MS/MS runs deposited in ProteomicsDB (https://www.proteomicsdb.org). The "picked" protein FDR approach treats target and decoy sequences of the same protein as a pair rather than as individual entities and chooses either the target or the decoy sequence depending on which receives the highest score. We investigated the performance of this approach in combination with q-value based peptide scoring to normalize sample-, instrument-, and search engine-specific differences. The "picked" target–decoy strategy performed best when protein scoring was based on the best peptide q-value for each protein yielding a stable number of true positive protein identifications over a wide range of q-value thresholds. We show that this simple and unbiased strategy eliminates a conceptual issue in the commonly used "classic" protein FDR approach that causes overprediction of false-positive protein identification in large data sets. The approach scales from small to very large data sets without losing performance, consistently increases the number of true-positive protein identifications and is readily implemented in proteomics analysis software. Calculating the number of confidently identified proteins and estimating false discovery rate (FDR) is a challenge when analyzing very large proteomic data sets such as entire human proteomes. Biological and technical heterogeneity in proteomic experiments further add to the challenge and there are strong differences in opinion regarding the conceptual validity of a protein FDR and no consensus regarding the methodology for protein FDR determination. There are also limitations inherent to the widely used classic target–decoy strategy that particularly show when analyzing very large data sets and that lead to a strong over-representation of decoy identifications. In this study, we investigated the merits of the classic, as well as a novel target–decoy-based protein FDR estimation approach, taking advantage of a heterogeneous data collection comprised of ∼19,000 LC-MS/MS runs deposited in ProteomicsDB (https://www.proteomicsdb.org). The "picked" protein FDR approach treats target and decoy sequences of the same protein as a pair rather than as individual entities and chooses either the target or the decoy sequence depending on which receives the highest score. We investigated the performance of this approach in combination with q-value based peptide scoring to normalize sample-, instrument-, and search engine-specific differences. The "picked" target–decoy strategy performed best when protein scoring was based on the best peptide q-value for each protein yielding a stable number of true positive protein identifications over a wide range of q-value thresholds. We show that this simple and unbiased strategy eliminates a conceptual issue in the commonly used "classic" protein FDR approach that causes overprediction of false-positive protein identification in large data sets. The approach scales from small to very large data sets without losing performance, consistently increases the number of true-positive protein identifications and is readily implemented in proteomics analysis software. Shotgun proteomics is the most popular approach for large-scale identification and quantification of proteins. The rapid evolution of high-end mass spectrometers in recent years (1.Scheltema R.A. Hauschild J.P. Lange O. Hornburg D. Denisov E. Damoc E. Kuehn A. Makarov A. Mann M. The Q Exactive hf, a benchtop mass spectrometer with a prefilter, high performance Quadrupole, and an ultra-high field Orbitrap analyzer.Mol. Cell. Proteomics. 2014; 13: 3698-3708Abstract Full Text Full Text PDF PubMed Scopus (232) Google Scholar, 2.Kelstrup C.D. Jersie-Christensen R.R. Batth T.S. Arrey T.N. Kuehn A. Kellmann M. Olsen J.V. Rapid and deep proteomes by faster sequencing on a benchtop Quadrupole ultra-high-field Orbitrap mass spectrometer.J. Proteome Res. 2014; 3: 6187-6195Crossref Scopus (137) Google Scholar, 3.Helm D. Vissers J.P. Hughes C.J. Hahne H. Ruprecht B. Pachl F. Grzyb A. Richardson K. Wildgoose J. Maier S.K. Marx H. Wilhelm M. Becher I. Lemeer S. Bantscheff M. Langridge J.I. Kuster B. Ion mobility tandem mass spectrometry enhances performance of bottom-up proteomics.Mol. Cell. Proteomics. 2014; 13: 3709-3715Abstract Full Text Full Text PDF PubMed Scopus (73) Google Scholar, 4.Yamana R. Iwasaki M. Wakabayashi M. Nakagawa M. Yamanaka S. Ishihama Y. Rapid and deep profiling of human induced pluripotent stem cell proteome by one-shot NanoLC-MS/MS analysis with meter-scale monolithic silica columns.J. Proteome Res. 2013; 12: 214-221Crossref PubMed Scopus (48) Google Scholar, 5.Hebert A.S. Richards A.L. Bailey D.J. Ulbrich A. Coughlin E.E. Westphall M.S. Coon J.J. The one hour yeast proteome.Mol. Cell. Proteomics. 2014; 13: 339-347Abstract Full Text Full Text PDF PubMed Scopus (411) Google Scholar) has made proteomic studies feasible that identify and quantify as many as 10,000 proteins in a sample (6.Moghaddas Gholami A. Hahne H. Wu Z. Auer F.J. Meng C. Wilhelm M. Kuster B. Global proteome analysis of the NCI-60 cell line panel.Cell Rep. 2013; 4: 609-620Abstract Full Text Full Text PDF PubMed Scopus (222) Google Scholar, 7.Kulak N.A. Pichler G. Paron I. Nagaraj N. Mann M. Minimal, encapsulated proteomic-sample processing applied to copy-number estimation in eukaryotic cells.Nat. Methods. 2014; 11: 319-324Crossref PubMed Scopus (1002) Google Scholar, 8.Ritorto M.S. Cook K. Tyagi K. Pedrioli P.G. Trost M. Hydrophilic strong anion exchange (hSAX) chromatography for highly orthogonal peptide separation of complex proteomes.J. Proteome Res. 2013; 12: 2449-2457Crossref PubMed Scopus (52) Google Scholar) and enables many lines of new scientific research including, for example, the analysis of many human proteomes, and proteome-wide protein–drug interaction studies (9.Wilhelm M. Schlegl J. Hahne H. Moghaddas Gholami A. Lieberenz M. Savitski M.M. Ziegler E. Butzmann L. Gessulat S. Marx H. Mathieson T. Lemeer S. Schnatbaum K. Reimer U. Wenschuh H. Mollenhauer M. Slotta-Huspenina J. Boese J.H. Bantscheff M. Gerstmair A. Faerber F. Kuster B. Mass-spectrometry-based draft of the human proteome.Nature. 2014; 509: 582-587Crossref PubMed Scopus (1318) Google Scholar, 10.Kim M.S. Pinto S.M. Getnet D. Nirujogi R.S. Manda S.S. Chaerkady R. Madugundu A.K. Kelkar D.S. Isserlin R. Jain S. Thomas J.K. Muthusamy B. Leal-Rojas P. Kumar P. Sahasrabuddhe N.A. Balakrishnan L. Advani J. George B. Renuse S. Selvan L.D. Patil A.H. Nanjappa V. Radhakrishnan A. Prasad S. Subbannayya T. Raju R. Kumar M. Sreenivasamurthy S.K. Marimuthu A. Sathe G.J. Chavan S. Datta K.K. Subbannayya Y. Sahu A. Yelamanchi S.D. Jayaram S. Rajagopalan P. Sharma J. Murthy K.R. Syed N. Goel R. Khan A.A. Ahmad S. Dey G. Mudgal K. Chatterjee A. Huang T.C. Zhong J. Wu X. Shaw P.G. Freed D. Zahari M.S. Mukherjee K.K. Shankar S. Mahadevan A. Lam H. Mitchell C.J. Shankar S.K. Satishchandra P. Schroeder J.T. Sirdeshmukh R. Maitra A. Leach S.D. Drake C.G. Halushka M.K. Prasad T.S. Hruban R.H. Kerr C.L. Bader G.D. Iacobuzio-Donahue C.A. Gowda H. Pandey A. A draft map of the human proteome.Nature. 2014; 509: 575-581Crossref PubMed Scopus (1505) Google Scholar, 11.Savitski M.M. Reinhard F.B. Franken H. Werner T. Savitski M.F. Eberhard D. Martinez Molina D. Jafari R. Dovega R.B. Klaeger S. Kuster B. Nordlund P. Bantscheff M. Drewes G. Proteomics. Tracking cancer drugs in living cells by thermal profiling of the proteome.Science. 2014; 3461255784Crossref PubMed Scopus (589) Google Scholar). One fundamental step in most proteomic experiments is the identification of proteins in the biological system under investigation. To achieve this, proteins are digested into peptides, analyzed by LC-MS/MS, and tandem mass spectra are used to interrogate protein sequence databases using search engines that match experimental data to data generated in silico (12.Nesvizhskii A.I. Vitek O. Aebersold R. Analysis and validation of proteomic data generated by tandem mass spectrometry.Nat. Methods. 2007; 4: 787-797Crossref PubMed Scopus (516) Google Scholar, 13.Nesvizhskii A.I. Aebersold R. Interpretation of shotgun proteomic data: the protein inference problem.Mol. Cell. Proteomics. 2005; 4: 1419-1440Abstract Full Text Full Text PDF PubMed Scopus (794) Google Scholar). Peptide spectrum matches (PSMs) 1 are commonly assigned by a search engine using either a heuristic or a probabilistic scoring scheme (14.Eng J.K. McCormack A.L. Yates J.R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database.J. Am. Soc. Mass Spectr. 1994; 5: 976-989Crossref PubMed Scopus (5434) Google Scholar, 15.Cox J. Neuhauser N. Michalski A. Scheltema R.A. Olsen J.V. Mann M. Andromeda: a peptide search engine integrated into the MaxQuant environment.J. Proteome Res. 2011; 10: 1794-1805Crossref PubMed Scopus (3469) Google Scholar, 16.Perkins D.N. Pappin D.J. Creasy D.M. Cottrell J.S. Probability-based protein identification by searching sequence databases using mass spectrometry data.Electrophoresis. 1999; 20: 3551-3567Crossref PubMed Scopus (6771) Google Scholar, 17.Craig R. Beavis R.C. TANDEM: matching proteins with tandem mass spectra.Bioinformatics. 2004; 20: 1466-1467Crossref PubMed Scopus (1989) Google Scholar, 18.Geer L.Y. Markey S.P. Kowalak J.A. Wagner L. Xu M. Maynard D.M. Yang X. Shi W. Bryant S.H. Open mass spectrometry search algorithm.J. Proteome Res. 2004; 3: 958-964Crossref PubMed Scopus (1166) Google Scholar). Proteins are then inferred from identified peptides and a protein score or a probability derived as a measure for the confidence in the identification (13.Nesvizhskii A.I. Aebersold R. Interpretation of shotgun proteomic data: the protein inference problem.Mol. Cell. Proteomics. 2005; 4: 1419-1440Abstract Full Text Full Text PDF PubMed Scopus (794) Google Scholar, 19.Serang O. Noble W. A review of statistical methods for protein identification using tandem mass spectrometry.Stat. Interface. 2012; 5: 3-20Crossref PubMed Google Scholar). Estimating the proportion of false matches (false discovery rate; FDR) in an experiment is important to assess and maintain the quality of protein identifications. Owing to its conceptual and practical simplicity, the most widely used strategy to estimate FDR in proteomics is the target–decoy database search strategy (target–decoy strategy; TDS) (20.Elias J.E. Gygi S.P. Target–decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry.Nat. Methods. 2007; 4: 207-214Crossref PubMed Scopus (2834) Google Scholar). The main assumption underlying this idea is that random matches (false positives) should occur with similar likelihood in the target database and the decoy (reversed, shuffled, or otherwise randomized) version of the same database (21.Jeong K. Kim S. Bandeira N. False discovery rates in spectral identification.BMC Bioinformatics. 2012; 16: S2Crossref Scopus (108) Google Scholar, 22.Nesvizhskii A.I. A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics.J. Proteomics. 2010; 73: 2092-2123Crossref PubMed Scopus (380) Google Scholar). The number of matches to the decoy database, therefore, provides an estimate of the number of random matches one should expect to obtain in the target database. The number of target and decoy hits can then be used to calculate either a local or a global FDR for a given data set (21.Jeong K. Kim S. Bandeira N. False discovery rates in spectral identification.BMC Bioinformatics. 2012; 16: S2Crossref Scopus (108) Google Scholar, 22.Nesvizhskii A.I. A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics.J. Proteomics. 2010; 73: 2092-2123Crossref PubMed Scopus (380) Google Scholar, 23.Choi H. Nesvizhskii A.I. False discovery rates and related statistical concepts in mass spectrometry-based proteomics.J. Proteome Res. 2008; 7: 47-50Crossref PubMed Scopus (166) Google Scholar, 24.Kall L. Storey J.D. MacCoss M.J. Noble W.S. Posterior error probabilities and false discovery rates: two sides of the same coin.J. Proteome Res. 2008; 7: 40-44Crossref PubMed Scopus (212) Google Scholar, 25.Blanco L. Mead J.A. Bessant C. Comparison of novel decoy database designs for optimizing protein identification searches using ABRF sPRG2006 standard MS/MS data sets.J. Proteome Res. 2009; 8: 1782-1791Crossref PubMed Scopus (32) Google Scholar, 26.Wang G. Wu W.W. Zhang Z. Masilamani S. Shen R.F. Decoy methods for assessing false positives and false discovery rates in shotgun proteomics.Anal. Chem. 2009; 81: 146-159Crossref PubMed Scopus (76) Google Scholar). This general idea can be applied to control the FDR at the level of PSMs, peptides, and proteins, typically by counting the number of target and decoy observations above a specified score. Despite the significant practical impact of the TDS, it has been observed that a peptide FDR that results in an acceptable protein FDR (of say 1%) for a small or medium sized data set, turns into an unacceptably high protein FDR when the data set grows larger (22.Nesvizhskii A.I. A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics.J. Proteomics. 2010; 73: 2092-2123Crossref PubMed Scopus (380) Google Scholar, 27.Reiter L. Claassen M. Schrimpf S.P. Jovanovic M. Schmidt A. Buhmann J.M. Hengartner M.O. Aebersold R. Protein identification false discovery rates for very large proteomics data sets generated by tandem mass spectrometry.Mol. Cell. Proteomics. 2009; 8: 2405-2417Abstract Full Text Full Text PDF PubMed Scopus (252) Google Scholar). This is because the basic assumption of the classical TDS is compromised when a large proportion of the true positive proteins have already been identified. In small data sets, containing say only a few hundred to a few thousand proteins, random peptide matches will be distributed roughly equally over all decoy and "leftover" target proteins, allowing for a reasonably accurate estimation of false positive target identifications by using the number of decoy identifications. However, in large experiments comprising hundreds to thousands of LC-MS/MS runs, 10,000 or more target proteins may be genuinely and repeatedly identified, leaving an ever smaller number of (target) proteins to be hit by new false positive peptide matches. In contrast, decoy proteins are only hit by the occasional random peptide match but fully count toward the number of false positive protein identifications estimated from the decoy hits. The higher the number of genuinely identified target proteins gets, the larger this imbalance becomes. If this is not corrected for in the decoy space, an overestimation of false positives will occur. This problem has been recognized and e.g. Reiter and colleagues suggested a way for correcting for the overestimation of false positive protein hits termed MAYU (27.Reiter L. Claassen M. Schrimpf S.P. Jovanovic M. Schmidt A. Buhmann J.M. Hengartner M.O. Aebersold R. Protein identification false discovery rates for very large proteomics data sets generated by tandem mass spectrometry.Mol. Cell. Proteomics. 2009; 8: 2405-2417Abstract Full Text Full Text PDF PubMed Scopus (252) Google Scholar). Following the main assumption that protein identifications containing false positive PSMs are uniformly distributed over the target database, MAYU models the number of false positive protein identifications using a hypergeometric distribution. Its parameters are estimated from the number of protein database entries and the total number of target and decoy protein identifications. The protein FDR is then estimated by dividing the number of expected false positive identifications (expectation value of the hypergeometric distribution) by the total number of target identifications. Although this approach was specifically designed for large data sets (tested on ∼1300 LC-MS/MS runs from digests of C. elegans proteins), it is not clear how far the approach actually scales. Another correction strategy for overestimation of false positive rates, the R factor, was suggested initially for peptides (28.Shteynberg D. Deutsch E.W. Lam H. Eng J.K. Sun Z. Tasman N. Mendoza L. Moritz R.L. Aebersold R. Nesvizhskii A.I. iProphet: multilevel integrative analysis of shotgun proteomic data improves peptide and protein identification rates and error estimates.Mol. Cell. Proteomics. 2011; 10: M111-007690Abstract Full Text Full Text PDF PubMed Scopus (401) Google Scholar) and more recently for proteins (29.Shanmugam A.K. Yocum A.K. Nesvizhskii A.I. Utility of RNA-seq and GPMDB protein observation frequency for improving the sensitivity of protein identification by tandem MS.J. Proteome Res. 2014; 13: 4113-4119Crossref PubMed Scopus (21) Google Scholar). A ratio, R, of forward and decoy hits in the low probability range is calculated, where the number of true peptide or protein identifications is expected to be close to zero, and hence, R should approximate one. The number of decoy hits is then multiplied (corrected) by the R factor when performing FDR calculations. The approach is conceptually simpler than the MAYU strategy and easy to implement, but is also based on the assumption that the inflation of the decoy hits intrinsic in the classic target–decoy strategy occurs to the same extent in all probability ranges. In the context of the above, it is interesting to note that there is currently no consensus in the community regarding if and how protein FDRs should be calculated for data of any size. One perhaps extreme view is that, owing to issues and assumptions related to the peptide to protein inference step and ways of constructing decoy protein sequences, protein level FDRs cannot be meaningfully estimated at all (30.Cottrell J. Does protein FDR have any meaning?.http://www.matrixscience.com/blog/does-protein-fdr-have-any-meaning.htm. 2013; Google Scholar). This is somewhat unsatisfactory as an estimate of protein level error in proteomic experiments is highly desirable. Others have argued that target–decoy searches are not even needed when accurate p values of individual PSMs are available (31.Gupta N. Bandeira N. Keich U. Pevzner P.A. Target–decoy approach and false discovery rate: when things may go wrong.J. Am. Soc. Mass Spectr. 2011; 22: 1111-1120Crossref PubMed Scopus (112) Google Scholar) whereas others choose to tighten the PSM or peptide FDRs obtained from TDS analysis to whatever threshold necessary to obtain a desired protein FDR (32.Farrah T. Deutsch E.W. Omenn G.S. Sun Z. Watts J.D. Yamamoto T. Shteynberg D. Harris M.M. Moritz R.L. State of the human proteome in 2013 as viewed through PeptideAtlas: comparing the kidney, urine, and plasma proteomes for the biology- and disease-driven Human Proteome Project.J. Proteome Res. 2014; 13: 60-75Crossref PubMed Scopus (107) Google Scholar). This is likely too conservative. We have recently proposed an alternative protein FDR approach termed "picked" target–decoy strategy (picked TDS) that indicated improved performance over the classical TDS in a very large proteomic data set (9.Wilhelm M. Schlegl J. Hahne H. Moghaddas Gholami A. Lieberenz M. Savitski M.M. Ziegler E. Butzmann L. Gessulat S. Marx H. Mathieson T. Lemeer S. Schnatbaum K. Reimer U. Wenschuh H. Mollenhauer M. Slotta-Huspenina J. Boese J.H. Bantscheff M. Gerstmair A. Faerber F. Kuster B. Mass-spectrometry-based draft of the human proteome.Nature. 2014; 509: 582-587Crossref PubMed Scopus (1318) Google Scholar) but a systematic investigation of the idea had not been performed at the time. In this study, we further characterized the picked TDS for protein FDR estimation and investigated its scalability compared with that of the classic TDS FDR method in data sets of increasing size up to ∼19,000 LC-MS/MS runs. The results show that the picked TDS is effective in preventing decoy protein over-representation, identifies more true positive hits, and works equally well for small and large proteomic data sets. The data basis for this study was a large collection of LC-MS/MS runs along with the derived human protein identification data deposited in ProteomicsDB (https://www.proteomicsdb.org). At the time of writing, this comprised 19,013 LC-MS/MS runs, the majority of which represent two recently published drafts of the human proteome (9.Wilhelm M. Schlegl J. Hahne H. Moghaddas Gholami A. Lieberenz M. Savitski M.M. Ziegler E. Butzmann L. Gessulat S. Marx H. Mathieson T. Lemeer S. Schnatbaum K. Reimer U. Wenschuh H. Mollenhauer M. Slotta-Huspenina J. Boese J.H. Bantscheff M. Gerstmair A. Faerber F. Kuster B. Mass-spectrometry-based draft of the human proteome.Nature. 2014; 509: 582-587Crossref PubMed Scopus (1318) Google Scholar, 10.Kim M.S. Pinto S.M. Getnet D. Nirujogi R.S. Manda S.S. Chaerkady R. Madugundu A.K. Kelkar D.S. Isserlin R. Jain S. Thomas J.K. Muthusamy B. Leal-Rojas P. Kumar P. Sahasrabuddhe N.A. Balakrishnan L. Advani J. George B. Renuse S. Selvan L.D. Patil A.H. Nanjappa V. Radhakrishnan A. Prasad S. Subbannayya T. Raju R. Kumar M. Sreenivasamurthy S.K. Marimuthu A. Sathe G.J. Chavan S. Datta K.K. Subbannayya Y. Sahu A. Yelamanchi S.D. Jayaram S. Rajagopalan P. Sharma J. Murthy K.R. Syed N. Goel R. Khan A.A. Ahmad S. Dey G. Mudgal K. Chatterjee A. Huang T.C. Zhong J. Wu X. Shaw P.G. Freed D. Zahari M.S. Mukherjee K.K. Shankar S. Mahadevan A. Lam H. Mitchell C.J. Shankar S.K. Satishchandra P. Schroeder J.T. Sirdeshmukh R. Maitra A. Leach S.D. Drake C.G. Halushka M.K. Prasad T.S. Hruban R.H. Kerr C.L. Bader G.D. Iacobuzio-Donahue C.A. Gowda H. Pandey A. A draft map of the human proteome.Nature. 2014; 509: 575-581Crossref PubMed Scopus (1505) Google Scholar). In ProteomicsDB, biological samples are grouped into experiments of varying number of LC-MS/MS runs. Raw MS files from each experiment were searched in parallel using Mascot (Matrixscience, London, UK) (16.Perkins D.N. Pappin D.J. Creasy D.M. Cottrell J.S. Probability-based protein identification by searching sequence databases using mass spectrometry data.Electrophoresis. 1999; 20: 3551-3567Crossref PubMed Scopus (6771) Google Scholar) and Maxquant/Andromeda (15.Cox J. Neuhauser N. Michalski A. Scheltema R.A. Olsen J.V. Mann M. Andromeda: a peptide search engine integrated into the MaxQuant environment.J. Proteome Res. 2011; 10: 1794-1805Crossref PubMed Scopus (3469) Google Scholar, 33.Cox J. Mann M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification.Nat. Biotechnol. 2008; 26: 1367-1372Crossref PubMed Scopus (9214) Google Scholar) against a concatenated protein sequence database containing the UniProtKB complete human proteome (download date: September 5, 2012; 86,725 sequences) and cRAP (common Repository of Adventitious Proteins; download date: September 5, 2012; 113 sequences) as described (9.Wilhelm M. Schlegl J. Hahne H. Moghaddas Gholami A. Lieberenz M. Savitski M.M. Ziegler E. Butzmann L. Gessulat S. Marx H. Mathieson T. Lemeer S. Schnatbaum K. Reimer U. Wenschuh H. Mollenhauer M. Slotta-Huspenina J. Boese J.H. Bantscheff M. Gerstmair A. Faerber F. Kuster B. Mass-spectrometry-based draft of the human proteome.Nature. 2014; 509: 582-587Crossref PubMed Scopus (1318) Google Scholar). Briefly, in the Mascot workflow, MS files were processed using Mascot Distiller using peak picking, deisotoping, and charge deconvolution. The resulting peaklist files were searched with the target–decoy option enabled (on-the-fly search against a decoy database with reversed protein sequences), a precursor tolerance of 10 ppm and a fragment tolerance of 0.5 Da for collision-induced dissociation (CID) spectra and 0.05 Da for higher energy collision-induced dissociation (HCD) spectra, an enzyme specificity of trypsin, LysC, GluC, or chymotrypsin (as appropriate), a maximum of two missed cleavages sites, the Mascot 13C option of 1 and oxidation of Met as well as acetylation of protein amino terminus as variable modifications. Additional variable and fixed modifications were set as appropriate for individual experiments (e.g. stable isotope labeling with amino acids in cell culture, tandem mass tag, or phosphorylation etc.). In the Maxquant workflow, MS files were searched against the same target–decoy protein sequence database as described above but using the Andromeda search engine. Proteases, variable and fixed modifications were specified as above. Mass accuracy of the precursor ions was determined by the time-dependent recalibration algorithm of Maxquant, and fragment ion mass tolerance was set to 0.6 Da and 20 ppm for CID and HCD, respectively. Further details regarding sample handling and data acquisition can be found in (9.Wilhelm M. Schlegl J. Hahne H. Moghaddas Gholami A. Lieberenz M. Savitski M.M. Ziegler E. Butzmann L. Gessulat S. Marx H. Mathieson T. Lemeer S. Schnatbaum K. Reimer U. Wenschuh H. Mollenhauer M. Slotta-Huspenina J. Boese J.H. Bantscheff M. Gerstmair A. Faerber F. Kuster B. Mass-spectrometry-based draft of the human proteome.Nature. 2014; 509: 582-587Crossref PubMed Scopus (1318) Google Scholar). All numerical data required to reproduce the figures in this manuscript as well as the associated protein lists are tabulated in supplemental Table S1. Mascot and Andromeda database search parameters for selected reference data sets detailed in Fig. 4B are listed in supplemental Table S2. Search engine-specific local peptide length-dependent score cutoffs as reported in Wilhelm et al. (9.Wilhelm M. Schlegl J. Hahne H. Moghaddas Gholami A. Lieberenz M. Savitski M.M. Ziegler E. Butzmann L. Gessulat S. Marx H. Mathieson T. Lemeer S. Schnatbaum K. Reimer U. Wenschuh H. Mollenhauer M. Slotta-Huspenina J. Boese J.H. Bantscheff M. Gerstmair A. Faerber F. Kuster B. Mass-spectrometry-based draft of the human proteome.Nature. 2014; 509: 582-587Crossref PubMed Scopus (1318) Google Scholar) were calculated as follows. All peptide spectrum matches (PSMs) of the same length were binned separately for Mascot and Andromeda in intervals of one score point and smoothed by a moving average with a window size of five to account for fluctuations likely introduced by the scoring algorithm. The local false discovery rates in each score bin were calculated by dividing the number of decoy PSMs by the number of target PSMs and the resulting distribution was smoothed using a moving average with a window size of five to account for small fluctuations. The minimum score over all bins with a local false discovery rate less than 0.05 was defined to be the local peptide length-dependent cutoff. Normalized scores of PSMs were calculated by dividing the Mascot ion score or Andromeda score by the corresponding search-engine specific local peptide length-dependent cutoff. For the purpose of this study, a q-value is defined to be the minimum FDR at which a PSM, peptide, or protein will appear in the filtered output list. Such q-values are commonly used to filter a list of observations to obtain a particular FDR. Instead of using all PSMs for this purpose, we chose the PSM with the highest normalized search engine score that represents one peptide sequence detected at one charge state and carrying a particular peptide modification (termed PCM). PCMs for each LC-MS/MS run were then sorted in decreasing order by their normalized Mascot or Andromeda scores. Empirical q-values were calculated by traversing the list from top to bottom and dividing the cumulative number of decoys by the number of cumulative targets. To assure monotonicity a second traversal from bottom to top changes the empirical q-value from the top to bottom traversal to the minimum q-value observed so far. Next, the relationship between logarithmic q-values and normalized scores was modeled by a linear regression using the highest and lowest scoring PCMs with an empirical q-value below 0.01 as fulcrums. Then, all q-values were recalculated using the predicted slope (a) and intercept (b) of the model: −log10 q-value = a * normalized score + b, by multiplying the normalized score with the predicted slope a and adding the predicted intercept b. Last, the resulting list of PCMs was filtered at 1% FDR. Peptides matching to either one particular protein isoform (protein unique) or to multiple protein isoforms originating from the same gene (gene unique) are classified as unique peptides. All other peptides are classified as shared (supplemental Fig. S1). Shared peptides were discarded from protein inference. For the purpose of this study, it is not differentiated between the identification of a specific protein isoform and the identification of at least one protein isoform of a gene. For data presented in Fig. 1A, protein scores were calculated as the sum of Mascot ion scores of the best scoring peptide matches below 1% PSM FDR. For all other analyses, protein scores were
Referência(s)