Experimental Peptide Identification Repository (EPIR)
2004; Elsevier BV; Volume: 3; Issue: 10 Linguagem: Inglês
10.1074/mcp.t400004-mcp200
ISSN1535-9484
AutoresDan Bach Kristensen, Jan Christian Brønd, Peter Aagaard Nielsen, Jens R. Andersen, Ole Tang Sørensen, Vibeke Jørgensen, Kenneth Budin, Jesper Matthiesen, Peter Venø, Hans M. Jespersen, Christian H. Ahrens, Soeren Schandorff, Peder Thusgaard Ruhoff, Jacek R. Wiśniewski, Keiryn L. Bennett, Alexandre V. Podtelejnikov,
Tópico(s)Machine Learning in Bioinformatics
ResumoLC MS/MS has become an established technology in proteomic studies, and with the maturation of the technology the bottleneck has shifted from data generation to data validation and mining. To address this bottleneck we developed Experimental Peptide Identification Repository (EPIR), which is an integrated software platform for storage, validation, and mining of LC MS/MS-derived peptide evidence. EPIR is a cumulative data repository where precursor ions are linked to peptide assignments and protein associations returned by a search engine (e.g. Mascot, Sequest, or PepSea). Any number of datasets can be parsed into EPIR and subsequently validated and mined using a set of software modules that overlay the database. These include a peptide validation module, a protein grouping module, a generic module for extracting quantitative data, a comparative module, and additional modules for extracting statistical information. In the present study, the utility of EPIR and associated software tools is demonstrated on LC MS/MS data derived from a set of model proteins and complex protein mixtures derived from MCF-7 breast cancer cells. Emphasis is placed on the key strengths of EPIR, including the ability to validate and mine multiple combined datasets, and presentation of protein-level evidence in concise, nonredundant protein groups that are based on shared peptide evidence. LC MS/MS has become an established technology in proteomic studies, and with the maturation of the technology the bottleneck has shifted from data generation to data validation and mining. To address this bottleneck we developed Experimental Peptide Identification Repository (EPIR), which is an integrated software platform for storage, validation, and mining of LC MS/MS-derived peptide evidence. EPIR is a cumulative data repository where precursor ions are linked to peptide assignments and protein associations returned by a search engine (e.g. Mascot, Sequest, or PepSea). Any number of datasets can be parsed into EPIR and subsequently validated and mined using a set of software modules that overlay the database. These include a peptide validation module, a protein grouping module, a generic module for extracting quantitative data, a comparative module, and additional modules for extracting statistical information. In the present study, the utility of EPIR and associated software tools is demonstrated on LC MS/MS data derived from a set of model proteins and complex protein mixtures derived from MCF-7 breast cancer cells. Emphasis is placed on the key strengths of EPIR, including the ability to validate and mine multiple combined datasets, and presentation of protein-level evidence in concise, nonredundant protein groups that are based on shared peptide evidence. LC MS/MS has become a well-established technology for large-scale protein characterization in proteomic research (1Aebersold R. Mann M. Mass spectrometry-based proteomics..Nature. 2003; 422: 198-207Google Scholar). Briefly, proteins in the sample are digested with an enzyme, typically trypsin, because the tryptic peptides are more compatible with MS/MS analysis. The mass spectrometer is coupled to a reverse-phase LC unit, which reduces sample complexity and increases concentration of the peptides during MS acquisition. Throughout the LC MS/MS analysis, peptides are isolated and fragmented by CID to generate sequence-dependent MS/MS information, and finally the data is matched against a sequence database using a search engine. In a typical LC MS/MS acquisition, hundreds to thousands of precursor ions are subjected to MS/MS. As LC MS/MS can now be performed in a fully automated fashion, a new challenge faces investigators: the ability to generate LC MS/MS data outpaces the ability to analyze it (2Patterson S.D. Data analysis—The Achilles heel of proteomics..Nat. Biotechnol. 2003; 21: 221-222Google Scholar). One of the initial challenges when analyzing LC MS/MS data is the assignment of peptides to precursor ions. Currently, this is typically achieved by statistical algorithms that match a theoretical peak list with the measured peak list and include the cross-correlative Sequest algorithm (3Eng J.K. McCormack A.L. Yates 3rd, J.R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database..J. Am. Soc. Mass Spectrom. 1994; 5: 976-989Google Scholar) and probability-based algorithms such as Mascot (4Perkins D.N. Pappin D.J. Creasy D.M. Cottrell J.S. Probability-based protein identification by searching sequence databases using mass spectrometry data..Electrophoresis. 1999; 20: 3551-3567Google Scholar). The number of incorrect peptide assignments made using probabilistic or cross-correlative algorithms alone can become an issue, as different peptides may have overlapping or even identical fragmentation patterns (e.g. Leu/Ile substitutions). This issue is particularly valid for large LC MS/MS datasets and/or when a high sensitivity (i.e. the true positive rate) is required, e.g. in target discovery projects (5Keller A. Purvine S. Nesvizhskii A.I. Stolyar S. Goodlett D.R. Kolker E. Experimental protein mixture for validating tandem mass spectral analysis..Omics. 2002; 6: 207-212Google Scholar). The end result is that a substantial amount of time and resources are required for manual validation. The sensitivity of the peptide assignment can be improved at different levels, including: additional processing of the MS/MS data (6Gentzel M. Kocher T. Ponnusamy S. Wilm M. Preprocessing of tandem mass spectrometric data to support automatic protein identification..Proteomics. 2003; 3: 1597-1610Google Scholar); improved charge-state determination (7Sadygov R.G. Eng J. Durr E. Saraf A. McDonald H. MacCoss M.J. Yates 3rd, J.R. Code developments to improve the efficiency of automated MS/MS spectra interpretation..J. Proteome Res. 2002; 1: 211-215Google Scholar, 8Colinge J. Magnin J. Dessingy T. Giron M. Masselot A. Improved peptide charge state assignment..Proteomics. 2003; 3: 1434-1440Google Scholar); removal of low-quality MS/MS data (9Moore R.E. Young M.K. Lee T.D. Method for screening peptide fragment ion mass spectra prior to database searching..J. Am. Soc. Mass Spectrom. 2000; 11: 422-426Google Scholar); and clustering of redundant spectra (10Tabb D.L. MacCoss M.J. Wu C.C. Anderson S.D. Yates 3rd, J.R. Similarity among tandem mass spectra from proteomic experiments: Detection, significance, and utility..Anal. Chem. 2003; 75: 2470-2477Google Scholar). Alternatively, a higher peptide assignment sensitivity can be achieved using sophisticated scoring schemes that exploit empirical information derived from MS/MS data and the search results. For instance, this could be the presence of consecutive fragment ions (i.e. sequence tag-like information); specific fragmentation signatures (e.g. a relatively intense proline ion); and the number of sibling peptides (NSP) 1The abbreviations used are: NSP, number of sibling peptides; EPIR, Experimental Peptide Identification Repository; PPI, precursor peak intensity; IDA, information-dependent acquisition. (11Nesvizhskii A.I. Keller A. Kolker E. Aebersold R. A statistical model for identifying proteins by tandem mass spectrometry..Anal. Chem. 2003; 75: 4646-4658Google Scholar). For example, Colinge et al. introduced a new probabilistic scoring scheme termed OLAV (12Colinge J. Masselot A. Giron M. Dessingy T. Magnin J. OLAV: Towards high-throughput tandem mass spectrometry data identification..Proteomics. 2003; 3: 1454-1463Google Scholar) that exploits structural information in the MS/MS data to assign peptides. Another example is the SALSA algorithm, which seeks specific sequence-dependent features in MS/MS spectra (13Liebler D.C. Hansen B.T. Davey S.W. Tiscareno L. Mason D.E. Peptide sequence motif analysis of tandem MS data with the SALSA algorithm..Anal. Chem. 2002; 74: 203-210Google Scholar). SALSA scores peptides based on how well the theoretical ion series for peptide sequence motifs correspond with the actual MS/MS product ion series, regardless of absolute position on the m/z-axis. The approach can be used in the identification of both unmodified and modified peptides (e.g. post-translationally or genetically). In the present study, we address the peptide assignment issue by exploiting various empirical parameters when validating the assignments returned by the Mascot search engine. Once peptides have been assigned to the precursor ions, the next step is presentation of the protein evidence. This is a challenging task due to the degenerate nature of peptides, i.e. the same peptide can be derived from more than one protein entry. This redundancy may be derived from, e.g. homologous proteins or protein splice variants, or the database itself may be redundant. In many cases the MS/MS evidence therefore points toward a group of proteins rather than a single protein, and it may be impossible to determine which group members are present in the actual biological sample on the basis of the MS/MS evidence alone. Consequently caution should be taken when ranking protein hits, e.g. as seen in the result summary returned by Mascot (4Perkins D.N. Pappin D.J. Creasy D.M. Cottrell J.S. Probability-based protein identification by searching sequence databases using mass spectrometry data..Electrophoresis. 1999; 20: 3551-3567Google Scholar). Nesvizhskii et al. addressed this issue by designing a statistical model for identifying proteins by LC MS/MS (11Nesvizhskii A.I. Keller A. Kolker E. Aebersold R. A statistical model for identifying proteins by tandem mass spectrometry..Anal. Chem. 2003; 75: 4646-4658Google Scholar). Redundant protein identifications (i.e. assignments that can not be distinguished by the MS/MS evidence) were collapsed into a single identification, and a minimal protein list was generated using an expectation-maximization algorithm. A major challenge in the analysis of proteomes is to maximize the information that can be extracted from a biological sample. There are several approaches, such as two-dimensional LC MS/MS, multi-step fractionation, and multiple analysis of the same sample using an exclusion list approach. The challenge remains, however, in how to deal with the huge amounts of data generated from proteomic analysis of complex biological samples, both in terms of database searches and data validation/mining. All these issues were central in the development of the new generic software platform presented here. A peptide-centric relational database (Experimental Peptide Identification Repository, or EPIR) was developed for the storage, validation, and mining of LC MS/MS data. EPIR is a data storage area for all precursor ions to which peptides have been assigned by a given search engine. At the same time, EPIR is cumulative, meaning that any number of datasets can be parsed into EPIR at any given time, and subsequently validated/mined as a single combined dataset. A set of software modules have been developed to automatically validate and mine datasets stored in EPIR. For instance, one module collapses proteins into groups on the basis of shared peptides. Protein evidence is thus presented in concise protein groups rather than as a ranked list of proteins, and this significantly reduces the complexity of the result summary. All proteins with conclusive, unambiguous MS/MS evidence are automatically highlighted within the group. Using a validation module, peptide assignments returned by the search engine (e.g. Mascot) are automatically validated or reassigned within EPIR, on the basis of different empirical parameters; including the presence of consecutive y/b-ions, the relative intensity of proline fragment ions (14Kapp E.A. Schutz F. Reid G.E. Eddes J.S. Moritz R.L. O’Hair R.A. Speed T.P. Simpson R.J. Mining a tandem mass spectrometry database to determine the trends and global factors influencing peptide fragmentation..Anal. Chem. 2003; 75: 6251-6264Google Scholar), and the NSP. This functionality greatly enhances the ability to validate peptide assignments from large datasets in an automatic fashion. A generic quantitative module compatible with non-coeluting labels has also been developed to extract quantitative information from any type of differential experiment, regardless of the labeling method used (chemical or metabolic). Statistical modules were developed to extract information related to the quality of the datasets stored in EPIR. A key feature of the system is that no evidence is lost during data validation and mining, because the core data (a list of precursor ions with all potential peptide identifications and protein associations) remains unaffected at all times. This is because the data validation and mining process simply provides a means of filtering and organizing the core data, with the aim of addressing specific biological or analytical questions. In the present study, the utility of EPIR and associated modules is demonstrated on LC MS/MS datasets generated on a Q-TOF mass spectrometer. A mixture of six model proteins (bovine albumin, rabbit aldolase, horse ferritin, chicken ovalbumin, bovine ribonuclease A, and bovine thyroglobulin, all from Amersham Biosciences, Uppsala, Sweden) was digested with Lys-C endopeptidase (Achromobactor lyticus; Wako Pure Chemicals, Osaka, Japan) in 4 m urea and 10 mm Tris·HCl pH 8.5 for 4 h at room temperature. The proteins were reduced with 10 mm DTT for 30 min at 37 °C, alkylated with 50 mm iodoacetamide for 30 min at room temperature, diluted 4-fold in 100 mm NH4HC03, and digested overnight at 37 °C with trypsin (sequencing grade; Promega, Madison, WI). LC MS/MS (one standard and two exclusion list analyses) was performed as described below using 2 μl of the sample (500 fmol of each protein). The human breast carcinoma MCF-7 cell line was cultured in Dulbecco’s modified Eagle’s medium supplemented with 10% FCS, and 1% penicillin/streptomycin, 0.01 mg/ml insulin, 1.5 g/liter sodium bicarbonate, and nonessential amino acids. The cells were maintained at 37 °C in a humidified atmosphere of 95% air and 5% CO2. For isotopic labeling, the cells were grown for at least six cell divisions in medium deficient in l-leucine supplemented with 10% double dialyzed FCS (Hyclone, Logan, UT) and 52 mg/ml normal l-leucine (LeuD0) or [5,5,5-D3]-l-leucine (LeuD3) from Sigma-Aldrich (St. Louis, MO) A total of 5 × 108 cells were homogenized in 10 ml of GB buffer containing 0.25 m sucrose, 10 mm HEPES·NaOH, 2 mm CaCl2, 2 mm MgCl2, 1 mm AEBSF hydrochloride, 1 mm EDTA, 20 μm leupeptin hemisulfate, 150 μm aprotinin, pH 7.4 (buffer A) using a motor-driven Potter homogenizer (B. Braun Biotech, Allentown, PA). The homogenate was centrifuged at 1,000 × g for 10 min, and the supernatant collected. Homogenization and centrifugation were repeated. The post-nuclear supernatant was centrifuged at 50,000 × g for 30 min. The resultant pellet containing crude membranes (P2) was resuspended in 4 ml of GB buffer and mixed with 3.85 ml of 100% Percoll (Amersham Biosciences) and 0.55 ml 2 m sucrose in a 11.5-ml crimp tube (tube PA 11.5 ml; Sorvall, Asheville, NC). The tube was filled with GB buffer, capped, and centrifuged at 50,000 r.p.m. in a fixed-angle rotor T 890 (Sorvall) at 4 °C for 15 min. The gradient was fractionated from the top by the displacement method. In order to select fractions containing enriched plasma membranes, individual fractions were assayed for γ-glutamyl transpeptidase, cytochrome c oxidase, and NADH-cytochrome c reductase as described previously (15Olsen J.V. Andersen J.R. Nielsen P.A. Nielsen M.L. Figeys D. Mann M. Wisniewski J.R. Hystag—A novel proteomic quantification tool applied to differential display analysis of membrane proteins from distinct areas of mouse brain..Mol. Cell. Proteomics. 2004; 3: 82-92Google Scholar). Total protein was determined fluorometrically on solubilized and denatured proteins by measuring the fluorescence of tryptophan (excitation at 295 nm, emission at 360 nm) using tryptophanamide as a standard. Percoll was removed by centrifugation of the fractions in 1-ml 1PC tubes at 900,000 × g in Sorvall RC M150 GX using the S150AT rotor at 4 °C for 20 min. The isolated membrane fractions were washed and the proteins reduced with DTT on membrane as described previously (15Olsen J.V. Andersen J.R. Nielsen P.A. Nielsen M.L. Figeys D. Mann M. Wisniewski J.R. Hystag—A novel proteomic quantification tool applied to differential display analysis of membrane proteins from distinct areas of mouse brain..Mol. Cell. Proteomics. 2004; 3: 82-92Google Scholar). Finally, the membranes were resuspended in 200 μl of 4 m urea in 0.1 m Tris·HCl, pH 8.0. Next, 20 μl of 1 m iodoacetamide was added and the mixture incubated at room temperature for 2 h. The membranes were collected by centrifugation at 900,000 × g at 4 °C for 20 min, and the pellet was resuspended in 200 μl of 4 m urea in 0.1 m Tris·HCl pH 8.0. Five micrograms of endoproteinase Lys-C were added, and the membranes were incubated overnight at room temperature. The released peptides were separated from the membranes by centrifugation. This procedure yielded 88 ± 10 μg peptide per 5 × 108 cells. The Lys-C peptides were separated over a Dionex Acclaim 300 C18 3-μm column (i.d. 2.1 mm × 150 mm). The peptides were eluted with an ACN gradient in water containing 0.1% TFA. The flow rate was 100 μl/min. Next, 200-μl fractions were collected and lyophilized. The fractionated peptides were dissolved in 20 μl of 100 mm NH4HCO3 and incubated overnight at 37 °C with 0.5 μg trypsin. All LC MS/MS experiments were performed on a QStar Pulsar XL (MDS Sciex, Toronto, Canada) connected to an LC Packings Ultimate system equipped with a Famos autosampler and Switchos unit (LC Packings, Sunnyvale, CA). All hardware systems were controlled from the Analyst QS software (MDS Sciex). Samples were loaded onto the precolumn (4 cm × 150 μm, Zorbax SB-C18 5-μm beads) using a flow rate of 5 μl/min solvent A (0.005% heptafluorobutyric acid and 0.4% acetic acid in HPLC-grade water) using the Switchos unit. The peptides were subsequently eluted at 300 nl/min from the precolumn over the analytical column (4 cm × 75 μm, Zorbax SB-C18 3.5-μm beads) using an 80-min gradient from 10–35% solvent B (90% ACN, 0.005% heptafluorobutyric acid and 0.4% acetic acid in HPLC-grade water) delivered by the Ultimate CAP pump. The total duration of the LC run was 120 min, including sample loading and column equilibration. The QStar XL was operated in information-dependent acquisition (IDA) mode. In MS mode, ions were screened from m/z 350–1,000, and MS/MS data were acquired from m/z 80–1,000 (QStar pulsing mode on). In standard acquisition mode, each acquisition cycle was comprised of a 1-s MS and a 2-s MS/MS. MS to MS/MS switch threshold was set to 40 cps. Five exclusion list runs were performed, where all precursor ions subjected to MS/MS in the previous run(s) were excluded for 9 min using a 3-amu window. The broad exclusion window (±4.5 min) was necessary as the retention time for individual precursor ions drifted up to 4 min during the 5 days required to exhaustively analyze a single biological sample. The exclusion list acquisition methods were generated manually by importing the precursor ion list (text file) into the Analyst method editor. In the first exclusion list analysis, the MS and MS/MS acquisition times and the MS to MS/MS switch threshold were unaltered (1 s, 2 s, and 40 cps, respectively). In the latter exclusion list analyses (run 3–5), the MS/MS acquisition time was increased to 3 s and the MS to MS/MS switch threshold was lowered to 25 cps. The IDA processor (Applied Biosystems, Foster City, CA) was used to generate Mascot msm files with peak lists from the Analyst wiff files. The IDA settings were as follows: default charge state was set to 2+, 3+, and 4+; MS centroid parameters were 50% height percentage and 0.05 amu merge distance; all MS/MS data were centroided, with a 50% height percentage and a merge distance of 0.05 amu. The threshold peak intensity was set to 2 cps; MS/MS averaging parameters was set to reject spectra with less than 5 peaks or precursor ions with less than 5 or more than 10,000 cps; the precursor mass tolerance for grouping was set to 1; and the maximum and minimum number of cycles between groups was set to 10 and 1, respectively. MS/MS data from the standard protein sample was searched as a single merged msm file against all entries in the public NCBInr database (downloaded November 23, 2003; 1,543,949 entries in total) from the National Center for Biotechnology Information (www.ncbi.nlm.nih.gov) using the Mascot search engine (version 1.9.05; Matrix Science, London, United Kingdom). MCF-7 data were searched against human database entries only. Alkylation of cysteine residues was set as a fixed modification, and oxidation of methionine was set as a variable modification for all Mascot searches. One missed trypsin cleavage site was allowed, and the peptide MS and MS/MS tolerance was set to 0.3 and 0.13 Da, respectively. All common porcine trypsin autoproteolysis products were excluded after the data was entered into EPIR. EPIR was implemented as a standard SQL database using MySQL version 3.2.4 running on a PC equipped with RedHat Linux 9.0. EPIR contains structured information concerning samples (name, type, LIMS link); acquisitions (filename, time); raw data preprocessing (filtering parameters, processing application); database identification parameters (MS and MS/MS identification tolerance, database name, database version, species restrictions); spectrum identification results (peptide sequence, score, delta mass, expected ions, calculated ions, retention time); and relationship to the associated proteins. A software module has been developed to extract peptide identification information from the original Mascot result file into EPIR. The following information is extracted: search parameters; query/precursor; peptide score; suggested modifications; protein assignments; retention time; and fragment ion matches. Parsers for other result formats, i.e. Sequest (ThermoElectron, Waltham, MA), PepSea (MDS Inc., Odense, Denmark) are under development. The precursor peak intensity (PPI, i.e., the maximum ion count observed for a precursor ion) is extracted automatically for all suggested peptide matches. The PPI is obtained directly from the raw MS acquisition file. The window in which the PPI is extracted is 60 s pre- and 90 s post-MS/MS acquisition time for both the identified peptides and nonidentified partner ions. For nonidentified peptide partners, the PPI is extracted using the theoretical precursor ion mass predicted from the identified peptide partner. The PPI elution profile is Gaussian fitted using nonlinear least squares. If the profile exceeds a 3-min elution time, and the PPI value is not observed within the analysis window, then the profile is excluded. A 3-min elution time was chosen as the maximum because most precursor ions eluted in less than 2 min. Furthermore, quantitative data was excluded if the difference in PPI times for non-coeluting peptides exceeded 30 s. This value was chosen because previous experiments have shown that more than 95% of light and heavy peptides have differences in PPI times that were less than 30 s (data not shown). Besides the standard search engine results used for peptide assignment (score, expected versus calculated fragment ions, delta mass), additional empirical information is computed by the EPIR peptide validation module to assist in the assignment. Currently these include: a) the NSP for all potential peptide hits; b) the presence of consecutive y/b fragment ions; and c) a proline score for potential peptide identifications containing proline residues. a) For each potential peptide assignment the NSP is computed. In this study, all peptide identifications returned with a Mascot score of ≥20 were included; however, the score threshold can be defined by the user. The average peptide score is calculated for each group of sibling peptides, and in cases where two or more peptide identifications have an identical NSP the identifications are ranked according to the average peptide score.b) A synthetic structural fragment ion score is introduced for both y- and b-ion fragment series. This score reflects the presence of consecutive fragment ions and thus mimics the specificity of a sequence tag. Consecutive fragment ions separated by at least one nonmatching fragment ion are grouped into partial tags. For each partial tag, a cumulative product score, pts, is computed: pts= ∏i=0n−1 (i/3+2.0) where i is the consecutive matched fragment ion and n is the total number of matched fragment ions of the partial tag. The total fragment ion score, s, is computed: s= ∑i=0n−1 log10(10/pi)(ptsi+ptsi+1) where i is the tag index, n is the total number of tags, and pi is the number of nonmatching fragment ions between the partial tags.c) In CID-based MS/MS, proline residues show a strong preference for cleavage at the N-terminal amide bond, which produces intense y-ions containing an N-terminal proline residue (14Kapp E.A. Schutz F. Reid G.E. Eddes J.S. Moritz R.L. O’Hair R.A. Speed T.P. Simpson R.J. Mining a tandem mass spectrometry database to determine the trends and global factors influencing peptide fragmentation..Anal. Chem. 2003; 75: 6251-6264Google Scholar). If the suggested peptide contains at least one proline residue, a proline score, ps, is therefore computed. The score reflects the relative intensity of the y-ion containing an N-terminal proline compared with all fragment ions in the MS/MS spectrum, and is calculated as follows: ps=100−100∗ ( pr−1n) where pr is the intensity rank of the proline-containing y-ion and n is the total number of fragment ions in the MS/MS spectrum. A proline score of 100 therefore indicates that the proline fragment ion is the most intensive peak, i.e. pr is 1. The automatic assignment of peptides to precursor ions is initially based on the NSP information. If any of the suggested peptides have an NSP > 1, these peptides will be the only entries considered. In addition to the NSP information, the suggested peptide is removed from the potential list if the elution profile is invalid (see “Quantitation”); if the suggested number of labels do not match the number of labeling sites (isotopically labeled samples); and if the proline score is below 80 (proline-containing peptides). After the list of potential peptides has been generated, the peptide with the highest average group score (average Mascot score for all peptides in the group) and structural score (average y-ion or b-ion scores for all peptides in the group) is selected as the correct assignment. In cases where the same peptide has been identified multiple times, the identification with the highest score will be used. In situations where two or more peptides have the same group and structural score, no peptide will be assigned. The spectrum is flagged for manual inspection. If all suggested peptides have an NSP of 1, a list of possible peptides is generated based on a valid elution profile (see “Quantitation”); correct number of labels (isotopically labeled samples); a proline score greater than 80 (proline-containing peptides); and a minimum of five consecutive fragment ions. The peptide with the highest structural and identification score will be selected. When two or more peptides have the same group and structural score, the spectrum is flagged for manual inspection. Proteins with shared peptides are collapsed into a group and reported as a single identification, with the highest-scoring protein entry as the anchor. All information on the proteins in a group is stored in a collapsed format. Consequently no protein evidence is removed or lost. Protein groups with unambiguous protein identifications are highlighted. ClustalW was used for aligning the entries within a protein group (16Thompson J.D. Higgins D.G. Gibson T.J. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice..Nucleic Acids Res. 1994; 22: 4673-4680Google Scholar). The following ClustalW settings were used: -quicktree, -score = absolute, -output = gde, -outorder = input, -case = upper, -GAPOPEN = 200, and -GAPEXT = 100. The data entered into EPIR was managed, viewed, merged, and analyzed using a web application module. The module was developed using J2EE and running under a Jboss web application server version 3.0.4. To evaluate the robustness and functionality of the EPIR platform, several types of analyses were performed. In order to assess the data validation module and protein grouping functionality of EPIR, a protein mixture containing six model proteins was used. Biochemical treatment of the mixture was identical to that used in the analysis of complex samples. Next, the ability of EPIR to process large LC MS/MS datasets was assessed using plasma membrane fractions generated from an MCF-7 cell line. Each fraction was analyzed six times using an exclusion list approach, resulting in a total of 60 LC MS/MS analyses. The same cell line was chosen to perform quantitative analysis of the proteins in a complex mixture. For this experiment, a series of different heavy:light ratios were prepared, and quantitative analysis was based on the metabolic labeling of Leu with three deuterium atoms (17Ong S.E. Blagoev B. Kratchmarova I. Kristensen D.B. Steen H. Pandey A. Mann M. Stable isotope labeling by amino acids in cell culture, SILAC, as a simple and accurate approach to expression proteomics..Mol. Cell. Proteomics. 2002; 1: 376-386Google Scholar). The mixture of six model proteins was analyzed by one standard and two exclusion list analyses. The individual msm files were merged and sear
Referência(s)