Artigo Acesso aberto Revisado por pares

ModifiComb, a New Proteomic Tool for Mapping Substoichiometric Post-translational Modifications, Finding Novel Types of Modifications, and Fingerprinting Complex Protein Mixtures

2006; Elsevier BV; Volume: 5; Issue: 5 Linguagem: Inglês

10.1074/mcp.t500034-mcp200

ISSN

1535-9484

Autores

Mikhail M. Savitski, Michael L. Nielsen, Roman A. Zubarev,

Tópico(s)

Antimicrobial Peptides and Activities

Resumo

A major challenge in proteomics is to fully identify and characterize the post-translational modification (PTM) patterns present at any given time in cells, tissues, and organisms. Here we present a fast and reliable method ("ModifiComb") for mapping hundreds types of PTMs at a time, including novel and unexpected PTMs. The high mass accuracy of Fourier transform mass spectrometry provides in many cases unique elemental composition of the PTM through the difference ΔM between the molecular masses of the modified and unmodified peptides, whereas the retention time difference ΔRT between their elution in reversed-phase liquid chromatography provides an additional dimension for PTM identification. Abundant sequence information obtained with complementary fragmentation techniques using ion-neutral collisions and electron capture often locates the modification to a single residue. The (ΔM, ΔRT) maps are representative of the proteome and its overall modification state and may be used for database-independent organism identification, comparative proteomic studies, and biomarker discovery. Examples of newly found modifications include +12.000 Da (+C atom) incorporation into proline residues of peptides from proline-rich proteins found in human saliva. This modification is hypothesized to increase the known activity of the peptide. A major challenge in proteomics is to fully identify and characterize the post-translational modification (PTM) patterns present at any given time in cells, tissues, and organisms. Here we present a fast and reliable method ("ModifiComb") for mapping hundreds types of PTMs at a time, including novel and unexpected PTMs. The high mass accuracy of Fourier transform mass spectrometry provides in many cases unique elemental composition of the PTM through the difference ΔM between the molecular masses of the modified and unmodified peptides, whereas the retention time difference ΔRT between their elution in reversed-phase liquid chromatography provides an additional dimension for PTM identification. Abundant sequence information obtained with complementary fragmentation techniques using ion-neutral collisions and electron capture often locates the modification to a single residue. The (ΔM, ΔRT) maps are representative of the proteome and its overall modification state and may be used for database-independent organism identification, comparative proteomic studies, and biomarker discovery. Examples of newly found modifications include +12.000 Da (+C atom) incorporation into proline residues of peptides from proline-rich proteins found in human saliva. This modification is hypothesized to increase the known activity of the peptide. Post-translational modifications (PTMs) 1The abbreviations used are: PTM, post-translational modification; CAD, collisionally activated dissociation; ECD, electron capture dissociation; RT, retention time; ΔRT, retention time difference; ΔM, mass difference; PRP, proline-rich protein. are key regulators of protein function, localization, and interactions taking place inside the cell (1Mann M. Jensen O.N. Proteomic analysis of post-translational modifications.Nat. Biotechnol. 2003; 21: 255-261Crossref PubMed Scopus (1650) Google Scholar, 2Jensen O.N. Modification-specific proteomics: characterization of post-translational modifications by mass spectrometry.Curr. Opin. Chem. Biol. 2004; 8: 33-41Crossref PubMed Scopus (484) Google Scholar). PTMs are also required for proper folding of the protein. A major challenge in proteomics is therefore to fully identify and characterize the PTM patterns present at any given time in cells, tissues, and organisms (1Mann M. Jensen O.N. Proteomic analysis of post-translational modifications.Nat. Biotechnol. 2003; 21: 255-261Crossref PubMed Scopus (1650) Google Scholar, 2Jensen O.N. Modification-specific proteomics: characterization of post-translational modifications by mass spectrometry.Curr. Opin. Chem. Biol. 2004; 8: 33-41Crossref PubMed Scopus (484) Google Scholar). Hundreds of modification sites can be identified in a single MS experiment yielding valuable information in cellular processes (3Gruhler A. Olsen J.V. Mohammed S. Mortensen P. Faergeman N.J. Mann M. Jensen O.N. Quantitative phosphoproteomics applied to the yeast pheromone signaling pathway.Mol. Cell. Proteomics. 2005; 4: 310-327Abstract Full Text Full Text PDF PubMed Scopus (698) Google Scholar, 4Beausoleil S.A. Jedrychowski M. Schwartz D. Elias J.E. Villen J. Li J.X. Cohn M.A. Cantley L.C. Gygi S.P. Large-scale characterization of HeLa cell nuclear phosphoproteins.Proc. Natl. Acad. Sci. U. S. A. 2004; 101: 12130-12135Crossref PubMed Scopus (1239) Google Scholar). The main MS tool in PTM detection is tandem mass spectrometry combined with a database search engine, such as Sequest (5Eng J.K. McCormack A.L. Yates J.R. An approach to correlate tandem mass-spectral data of peptides with amino-acid-sequences in a protein database.J. Am. Soc. Mass Spectrom. 1994; 5: 976-989Crossref PubMed Scopus (5472) Google Scholar) or Mascot (6Perkins D.N. Pappin D.J. Creasy D.M. Cottrell J.S. Probability-based protein identification by searching sequence databases using mass spectrometry data.Electrophoresis. 1999; 20: 3551-3567Crossref PubMed Scopus (6814) Google Scholar). Although modern search engine-based proteomic approaches have been highly successful, they possess a number of significant drawbacks. Before the search, the operator specifies all expected modifications, often marked as "variable," i.e. not necessarily present. Because the engine considers all peptide sequences with and without variable modifications, including all possible combinations of modifications, database searches with several variable modifications often take a much longer time than it took to collect the experimental data set, creating a bottleneck in high throughput analysis. Allowing for many modifications in the database search increases the rate of false positives (7Ong S.E. Mittler G. Mann M. Identifying and quantifying in vivo methylation sites by heavy methyl SILAC.Nat. Methods. 2004; 1: 119-126Crossref PubMed Scopus (368) Google Scholar), eliminating which requires a much higher score threshold for identification of peptides, which leads to an enhanced number of false negative results (misses of present proteins) (7Ong S.E. Mittler G. Mann M. Identifying and quantifying in vivo methylation sites by heavy methyl SILAC.Nat. Methods. 2004; 1: 119-126Crossref PubMed Scopus (368) Google Scholar). To reduce the analysis time and the false positive and negative rates, a typical database search focuses upon a few types of modifications, far fewer compared with the broad variety that potentially can be present in the sample. A database search strategy is limited by nature, and although major improvements have been made over the past couple of years (8Shevchenko A. Loboda A. Ens W. Standing K.G. MALDI quadrupole time-of-flight mass spectrometry: a powerful tool for proteomic research.Anal. Chem. 2000; 72: 2132-2141Crossref PubMed Scopus (267) Google Scholar), most of the acquired tandem mass spectra remain unidentified through these searches. In a typical LC/MS proteomic-type analysis, the identification success rate usually varies between 5 and 15% (9Keller A. Nesvizhskii A.I. Kolker E. Aebersold R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search.Anal. Chem. 2002; 74: 5383-5392Crossref PubMed Scopus (3912) Google Scholar). Even with FTMS that provides ppm mass accuracy and can use two complementary fragmentation techniques (collisionally activated dissociation (CAD) and electron capture dissociation (ECD) (10Zubarev R.A. Kelleher N.L. McLafferty F.W. Electron capture dissociation of multiply charged protein cations. A nonergodic process.J. Am. Chem. Soc. 1998; 120: 3265-3266Crossref Scopus (1668) Google Scholar) ), no more than 30% of MS/MS datasets produce positive identifications (11Savitski M.M. Nielsen M.L. Zubarev R.A. New data base-independent, sequence tag-based scoring of peptide MS/MS data validates Mowse scores, recovers below threshold data, singles out modified peptides, and assesses the quality of MS/MS techniques.Mol. Cell. Proteomics. 2005; 4: 1180-1188Abstract Full Text Full Text PDF PubMed Scopus (85) Google Scholar). Part of the unidentified mass spectra may be due to unexpected modifications. Fig. 1 shows an example of an endogenous peptide from a human saliva sample sequence suggested by Mascot as peptide WAPGGQQSSQ from an unnamed human protein. Although five identified fragments deviated from their theoretical values by less than 11 mDa (Fig. 1, C and D) and the data quality was good (S-score value (11Savitski M.M. Nielsen M.L. Zubarev R.A. New data base-independent, sequence tag-based scoring of peptide MS/MS data validates Mowse scores, recovers below threshold data, singles out modified peptides, and assesses the quality of MS/MS techniques.Mol. Cell. Proteomics. 2005; 4: 1180-1188Abstract Full Text Full Text PDF PubMed Scopus (85) Google Scholar) was four, way above the threshold value of two), the dataset received a Mascot score (M-score) of 18, below the threshold value of 41. A database search using several common variable modifications did not provide a better answer. Subsequently a ModifiComb search (see below) identified the peptide as a modified version of another peptide that eluted from the nano-LC column some 9 min earlier and was 12.000 Da lighter (survey spectrum integrated over the 9-min time interval is depicted in Fig. 1A), identified by Mascot as GPPQQGGHQQ (Fig. 1B) with M-score of 47. Note that the masses of all 12 identified fragments were internally consistent with experimentally measured masses deviating from the theoretical values by less than 5 mDa and with the deviation changing linearly with the fragment mass (Fig. 1B, inset). The 12.000-Da shift was observed in y8 and y9 fragments as well as in all b fragments. Accurate mass analysis of the mass difference (109.055 Da) between the y8 and y7 fragments revealed the unique elemental composition of the third amino acid (C6H7NO) only 2 mDa away from the theoretical mass (109.053 Da) of the modified proline residue that has the same elemental composition. Thus the identity of the +12.000-Da modified proline was additionally confirmed. Such a proline modification is not reported for humans (12Creasy D.M. Cottrell J.S. Unimod: protein modifications for mass spectrometry.Proteomics. 2004; 4: 1534-1536Crossref PubMed Scopus (247) Google Scholar), although analogues of it can be found in the literature (see below). After the insertion of this modification into the Mascot search as a user-defined modification, the modification position was confirmed with M-score of 48 and a nearly perfect fit of 12 fragment masses (Fig. 1E). This example is typical, and it demonstrates both the problem and the solution. The problem is the presence of modified peptides, sometimes of an unknown type. The solution can be to utilize the fact that peptides within the same LC/MS run may be correlated. Because peptides in the same LC/MS run originate from the some protein mixture (peptide separation prior to LC/MS is not assumed), heterogeneity of PTMs and mutation sites results in the presence within the sample of several closely related variants of the same peptide, a kind of a peptide family. Most PTMs are present in substoichiometric amounts; therefore, each family includes one "base" (unmodified) peptide and one or several "dependent" peptides (modified or/and mutated). Many of the dependent peptides may be expected to elute within a limited time window before or after the base peptide. Here we report on a software tool ("ModifiComb") that searches for such peptide families and reveals the PTM and mutation patterns of complex peptide mixtures. The tool "combs out" from large data arrays pairs of peptides with strong sequence similarities, one of which is a base peptide and the other of which is a dependent peptide (Fig. 2). The base peptide is usually identified either via de novo sequencing or database searching, whereas the dependent peptide should not give database identification without variable modifications included. Identification of the base peptide is not critical for the analysis: based on sequence similarity, peptide pairs can be found in a "blind" search without knowing which peptide in the pair is the base one. Comparing the molecular masses of the peptide inside the pairs, the program builds a ΔM histogram of the differences between them. This ΔM histogram built for one LC/MS run or several related runs represents the overall pattern of all mutations and PTMs present in the corresponding sample. For ΔM values below 100 Da, the mDa mass accuracy of FTMS reveals the corresponding elemental composition of the modification; for ΔM > 100 Da values the high mass accuracy limits the number of possible elemental compositions. Inspection of the MS/MS data can often reveal the position of the modification. Once all the base peptides are identified, the ΔM histogram takes seconds to build. This identification takes seconds if de novo sequencing is used or minutes if a database search is used (the search is typically used without variable modifications or with a few obvious ones, such as oxidation of methionine). Thus the overall data analysis using the ΔM histogram is much faster than the data acquisition, removing one of the throughput bottlenecks. The difference in the retention times, ΔRT, between the dependent and base peptides is used as complementary information, although the intrinsic resolution, precision, and accuracy of RT measurements are much below the mass measurements. The (ΔM, ΔRT) pair provides a two-dimensional map of the present PTMs and mutations. Several earlier analogues of the ModifiComb approach can be found in the literature. Recently approaches have been developed to minimize the computational cost of complete PTM identification by applying database filters (13Tanner S. Shu H.J. Frank A. Wang L.C. Zandi E. Mumby M. Pevzner P.A. Bafna V. InsPecT: identification of posttranslationally modified peptides from tandem mass spectra.Anal. Chem. 2005; 77: 4626-4639Crossref PubMed Scopus (504) Google Scholar, 14Craig R. Beavis R.C. A method for reducing the time required to match protein sequences with tandem mass spectra.Rapid Commun. Mass Spectrom. 2003; 17: 2310-2316Crossref PubMed Scopus (403) Google Scholar). The filters are based on peptide sequence tags (15Mann M. Wilm M. Error tolerant identification of peptides in sequence databases by peptide sequence tags.Anal. Chem. 1994; 66: 4390-4399Crossref PubMed Scopus (1318) Google Scholar) extracted from the acquired MS/MS data. The tags reduce the database to a much smaller set of sequence candidates that can be searched with multiple variable modifications in a reasonable time. This approach is still largely limited to known protein sequences and modifications and can miss modifications if they occur inside the sequence tag. Additionally this approach is firmly database-oriented, that is rather slow and sensitive to sequence errors that are present in all databases. Finally this approach does not produce sample-specific fingerprint patterns. An ideologically similar strategy has been described by Zhang et al. (16Zhang N. Li X.-j. Ye M. Pan S. Schwikowski B. Aebersold R. ProbIDtree: an automated software program capable of identifying multiple peptides from a single collision-induced dissociation spectrum collected by a tandem mass spectrometer.Proteomics. 2005; 5: 4096-4106Crossref PubMed Scopus (66) Google Scholar) for a low resolution ion trap. However, that approach only worked on mixtures containing a few proteins. Recently Tsur et al. (17Tsur D. Tanner S. Zandi E. Bafna V. Pevzner P.A. Identification of post-translational modifications by blind search of mass spectra.Nat. Biotechnol. 2005; 23: 1562-1567Crossref PubMed Scopus (225) Google Scholar) described MS-Alignment, a software tool for a blind PTM search in large MS/MS datasets. Although using an impressively sophisticated alignment algorithm, MS-Alignment has a number of limitations. Integer ΔM values that the algorithm uses mask the underlying complexity of modifications (e.g. modifications with elemental compositions CO (formylation), N2, and C2H4 have the same integer mass of 28). Furthermore MS-Alignment processes CAD-only datasets, and analysis speed requirements limit the ΔM region (−100 to +160 Da in Ref. 17Tsur D. Tanner S. Zandi E. Bafna V. Pevzner P.A. Identification of post-translational modifications by blind search of mass spectra.Nat. Biotechnol. 2005; 23: 1562-1567Crossref PubMed Scopus (225) Google Scholar). In contrast, ModifiComb uses accurate mass data, has no limit on ΔM values, and uses combined ECD/CAD datasets. As already mentioned, ModifiComb also makes use of the retention time differences, i.e. uses both dimensions of LC/MS separation, which gives it very high specificity. For instance, a dependent peptide with ΔM = +0.977 Da and a small positive ΔRT is surely a deamidated version of the base peptide, whereas ΔM = 1.003 Da and a large ΔRT is likely due to a monoisotopic mass misassignment in one of the peptides. There is one more significant different between ModifiComb and other algorithms. The usual approach to reducing the search space for modifications is to identify first the set of proteins present in the sample and then search PTMs in that small database. ModifiComb goes one step further and searches PTMs only for identified peptides, further reducing the search space by an order of magnitude (although ≈5 peptides per protein are on average identified in our analysis (18Nielsen M.L. Savitski M.M. Zubarev R.A. Improving protein identification using complementary fragmentation techniques in Fourier transform mass spectrometry.Mol. Cell. Proteomics. 2005; 4: 835-845Abstract Full Text Full Text PDF PubMed Scopus (130) Google Scholar), an average protein produces 50 tryptic peptides). The search space reduction diminishes the probability of false positive PTM identification and obviates the development and validation of a special scoring algorithm (see below). The explicit requirement in the ModifiComb non-blind search for the unmodified peptide to be present is a limitation but not a too narrow one as most PTMs appear in substoichiometric proportions. In the current work, we tested ModifiComb and built ΔM, ΔRT histograms and (ΔM, ΔRT) maps for several biological samples. Sensitivity, specificity, and repeatability of the approach were evaluated. Because the ability of the program to find new and unexpected modifications by far exceeds our current capacity to characterize them, here our goal was not to report all findings, and we limited the current report to the demonstration and validation of the ModifiComb operation. Several examples of new modifications and sample fingerprinting (M-fingerprinting) are provided as an illustration, and their potential biological importance is discussed. Whole human saliva was obtained from a healthy 32-year-old non-smoking Caucasian male taking no medications and with no overt signs of gingivitis or caries. The mouth of the subject was rinsed with water, and the sample was collected 3 h after food intake. To minimize degradation, the sample was collected on ice and during the entire sample preparation procedure kept constantly at 4 °C. A total of 5 ml of saliva was collected of which 2 ml was clarified by centrifugation at 12,000 × g for 10 min, thereby removing debris and cells. The obtained supernatant was loaded onto four 10-kDa mass cutoff filters (Microcon YM-10, 500 μl on each) and centrifuged at 14,000 × g for 30 min. In doing so, the endogenous peptides were immediately separated from most of the proteases present in saliva, again minimizing the possibility for further peptide/protein degradation. For the final step of peptide isolation, the flow-through fractions ( 34) base peptide sequences is created for each sample. Another list is created for dependent peptides that were not identified by Mascot (or received a below threshold score). The dependent peptide fragment lists are then compared with those of base peptides, and for each peptide pair, the molecular mass difference ΔM and retention time difference ΔRT are calculated. The ΔRT value is currently approximated by the difference in the scan number of the dependent and base peptides (any given scan duration is a function of the given ion abundance, but the average value of the scan duration fluctuates insignificantly during the peptide elution time). The algorithm determines that the pair "matches" if a certain predefined number (usually four) of fragments of the dependent peptide either coincide within the given mass accuracy with the observed fragments in the base peptide or the corresponding masses are shifted by ΔM. The "matched" peptide pair is reported in the output file. Simultaneously their ΔM and ΔRT data are added to one-dimensional histograms ΔM and ΔRT and a two-dimensional map (ΔM, ΔRT). ModifiComb has two regimes: blind and "open eyed." In the latter regime, the base peptides are identified either through the Mascot search or de novo sequencing. In the blind regime, the base peptide remains unidentified. Below a detailed description is given for the open eyed regime in the case of base peptide identification by Mascot (the procedure for de novo sequencing is easily extrapolated). First an initial search is performed with no variable modifications except oxidation of methionine for each dta file containing extracted consensus information from ECD and CAD MS/MS (18Nielsen M.L. Savitski M.M. Zubarev R.A. Improving protein identification using complementary fragmentation techniques in Fourier transform mass spectrometry.Mol. Cell. Proteomics. 2005; 4: 835-845Abstract Full Text Full Text PDF PubMed Scopus (130) Google Scholar). The output contains the received Mascot score M (M ≥ 0 if Mascot suggested a sequence, and M = 0 if Mascot did not make any suggestion) and the corresponding Mascot-suggested sequence for M > 0. The user defines three parameters. The M-score threshold M1 above which all suggested sequences are accepted as trustworthy.The M-score threshold M2 below which all suggested sequences are deemed wrong (because a certain minimal S-score is implicitly required for every tandem MS spectrum ModifiComb considers (see requirement 3), this assumption is correct with a high probability).The minimum number n of common cleavage sites present when comparing two different peptides. The definition of a common cleavage is given below. All dta files with M > M1 are considered to belong to base peptides (A), whereas those with M < M2 are viewed as belonging to potential dependent peptides (B). Each possible pair of A and B peptides are considered. For each compared pair, n is calculated in the following way. First ΔM is determined as the difference of the molecular masses between B and A peptides. ΔM is then considered as the mass of the potential modification (thus this approach intrinsically favors single modifications), which can assume a positive as well as a negative value. The Mascot-suggested peptide sequence for A is used to generate a list of b ion masses [b1,…, bL] and y ion masses [y1,…, yL] where L + 1 is the length of the sequence. The masses [m1,…, mk] in the dta file are already tagged with the likely type of the ion they represent, y, b, or by (either b or y ion) (18Nielsen M.L. Savitski M.M. Zubarev R.A. Improving protein identification using complementary fragmentation techniques in Fourier transform mass spectrometry.Mol. Cell. Proteomics. 2005; 4: 835-845Abstract Full Text Full Text PDF PubMed Scopus (130) Google Scholar). The masses tagged y and b are compared with the theoretical sets [y1,…, yL] and [b1,…

Referência(s)