PRISM, a Generic Large Scale Proteomic Investigation Strategy for Mammals*S
2003; Elsevier BV; Volume: 2; Issue: 2 Linguagem: Inglês
10.1074/mcp.m200074-mcp200
ISSN1535-9484
AutoresThomas Kislinger, Khaled Rahman, Dragan Radulović, Brian Cox, Janet Rossant, Andrew Emili,
Tópico(s)Genomics and Phylogenetic Studies
ResumoWe have developed a systematic analytical approach, termed PRISM (Proteomic Investigation Strategy for Mammals), that permits routine, large scale protein expression profiling of mammalian cells and tissues. PRISM combines subcellular fractionation, multidimensional liquid chromatography-tandem mass spectrometry-based protein shotgun sequencing, and two newly developed computer algorithms, STATQUEST and GOClust, as a means to rapidly identify, annotate, and categorize thousands of expressed mammalian proteins. The application of PRISM to adult mouse lung and liver resulted in the high confidence identification of over 2,100 unique proteins including more than 100 integral membrane proteins, 400 nuclear proteins, and 500 uncharacterized proteins, the largest proteome study carried out to date on this important model organism. Automated clustering of the identified proteins into Gene Ontology annotation groups allowed for streamlined analysis of the large data set, revealing interesting and physiologically relevant patterns of tissue and organelle specificity. PRISM therefore offers an effective platform for in-depth investigation of complex mammalian proteomes. We have developed a systematic analytical approach, termed PRISM (Proteomic Investigation Strategy for Mammals), that permits routine, large scale protein expression profiling of mammalian cells and tissues. PRISM combines subcellular fractionation, multidimensional liquid chromatography-tandem mass spectrometry-based protein shotgun sequencing, and two newly developed computer algorithms, STATQUEST and GOClust, as a means to rapidly identify, annotate, and categorize thousands of expressed mammalian proteins. The application of PRISM to adult mouse lung and liver resulted in the high confidence identification of over 2,100 unique proteins including more than 100 integral membrane proteins, 400 nuclear proteins, and 500 uncharacterized proteins, the largest proteome study carried out to date on this important model organism. Automated clustering of the identified proteins into Gene Ontology annotation groups allowed for streamlined analysis of the large data set, revealing interesting and physiologically relevant patterns of tissue and organelle specificity. PRISM therefore offers an effective platform for in-depth investigation of complex mammalian proteomes. The laboratory mouse is a powerful model organism for investigating fundamental aspects of mammalian cell physiology, development, and disease (1.Rossant J. McKerlie C. Mouse-based phenogenomics for modelling human disease.Trends Mol. Med. 2001; 7: 502-507Google Scholar), and it is currently the focus of systematic efforts aimed at large scale gene prediction and functional annotation (2.Nadeau J.H. Balling R. Barsh G. Beier D. Brown S.D. Bucan M. Camper S. Carlson G. Copeland N. Eppig J. Fletcher C. Frankel W.N. Ganten D. Goldowitz D. Goodnow C. et al.Sequence interpretation. Functional annotation of mouse genome sequences.Science. 2001; 291: 1251-1255Google Scholar, 3.Marra M. Hillier L. Kucaba T. Allen M. Barstead R. Beck C. Blistain A. Bonaldo M. Bowers Y. Bowles L. Cardenas M. Chamberlain A. Chappell J. Clifton S. Favello A. et al.An encyclopedia of mouse genes.Nat. Genet. 1999; 21: 191-194Google Scholar, 4.Okazaki Y. Furuno M. Kasukawa T. Adachi J. Bono H. Kondo S. Nikaido I. Osato N. Saito R. Suzuki H. Yamanaka I. Kiyosawa H. Yagi K. Tomaru Y. Hasegawa Y. et al.Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs.Nature. 2002; 420: 563-573Google Scholar). The use of oligonucleotide- and cDNA-based microarrays, in particular, is providing unprecedented insight into the regulation of global patterns of gene expression (5.Schulze A. Downward J. Navigating gene expression using microarrays—a technology review.Nat. Cell Biol. 2001; 3: E190-E195Google Scholar). Nevertheless, since protein abundance does not always correlate with transcript levels (6.Gygi S.P. Rochon Y. Franza B.R. Aebersold R. Correlation between protein and mRNA abundance in yeast.Mol. Cell. Biol. 1999; 19: 1720-1730Google Scholar) and since the subcellular localization and turnover rate of biologically active protein can only be determined directly, the development of sensitive, accurate methods for comprehensive analysis of cellular protein expression patterns in mouse and other mammals is broadly needed. Two-dimensional polyacrylamide gel electrophoresis has been the traditional method of choice for high resolution proteome analysis (7.Hanash S.M. Biomedical applications of two-dimensional electrophoresis using immobilized pH gradients: current status.Electrophoresis. 2000; 21: 1202-1209Google Scholar, 8.Westbrook J.A. Yan J.X. Wait R. Welson S.Y. Dunn M.J. Zooming-in on the proteome: very narrow-range immobilised pH gradients reveal more protein species and isoforms.Electrophoresis. 2001; 22: 2865-2871Google Scholar). Despite recent advances, this approach is biased against membrane-associated proteins, low abundance proteins, or proteins with extremes in isoelectric point or molecular weight (9.Corthals G.L. Wasinger V.C. Hochstrasser D.F. Sanchez J.C. The dynamic range of protein expression: a challenge for proteomic research.Electrophoresis. 2000; 21: 1104-1115Google Scholar, 10.Gygi S.P. Corthals G.L. Zhang Y. Rochon Y. Aebersold R. Evaluation of two-dimensional gel electrophoresis-based proteome analysis technology.Proc. Natl. Acad. Sci. U. S. A. 2000; 97: 9390-9395Google Scholar). The identification of gel-separated proteins by mass spectrometry (MS) 1The abbreviations used are: MS, mass spectrometry; LC, liquid chromatography; GO, Gene Ontology; MudPIT, multidimensional protein identification technology; PRISM, Proteomic Investigation Strategy for Mammals; HPLC, high pressure liquid chromatography; DTT, dithiothreitol; CBP, cAMP-response element-binding protein (CREB)-binding protein. is also tedious due to the need to extract, digest, and analyze individual gel spots. Consequently techniques for gel-free chromatographic separation of protein or peptide mixtures coupled to on-line MS detection are currently in development. One promising method, based on multidimensional capillary-scale liquid chromatography-electrospray ionization tandem MS (LC-MS) protein identification technology (MudPIT) pioneered by Yates and colleagues (11.Link A.J. Eng J. Schieltz D.M. Carmack E. Mize G.J. Morris D.R. Garvik B.M. Yates III, J.R. Direct analysis of protein complexes using mass spectrometry.Nat. Biotechnol. 1999; 17: 676-682Google Scholar, 12.Washburn M.P. Wolters D. Yates III, J.R. Large-scale analysis of the yeast proteome by multidimensional protein identification technology.Nat. Biotechnol. 2001; 19: 242-247Google Scholar, 13.Wolters D.A. Washburn M.P. Yates III, J.R. An automated multidimensional protein identification technology for shotgun proteomics.Anal. Chem. 2001; 73: 5683-5690Google Scholar), permits shotgun sequencing of large numbers of proteins present in cell extracts. MudPIT has been applied successfully to several model organisms, leading to the identification of 1,484 proteins in yeast (12.Washburn M.P. Wolters D. Yates III, J.R. Large-scale analysis of the yeast proteome by multidimensional protein identification technology.Nat. Biotechnol. 2001; 19: 242-247Google Scholar), 2,363 proteins in rice (14.Koller A. Washburn M.P. Lange B.M. Andon N.L. Deciu C. Haynes P.A. Hays L. Schieltz D. Ulaszek R. Wei J. Wolters D. Yates III, J.R. Proteomic survey of metabolic pathways in rice.Proc. Natl. Acad. Sci. U. S. A. 2002; 99: 11969-11974Google Scholar), and, most recently, 2,415 proteins in Plasmodium (15.Florens L. Washburn M.P. Raine J.D. Anthony R.M. Grainger M. Haynes J.D. Moch J.K. Muster N. Sacci J.B. Tabb D.L. Witney A.A. Wolters D. Wu Y. Gardner M.J. Holder A.A. Sinden R.E. Yates J.R. Carucci D.J. A proteomic view of the Plasmodium falciparum life cycle.Nature. 2002; 419: 520-526Google Scholar). Other powerful gel-free approaches, such as the use of accurate mass tag detection by Fourier transform ion cyclotron resonance MS (16.Lipton M.S. Pasa-Tolic L. Anderson G.A. Anderson D.J. Auberry D.L. Battista J.R. Daly M.J. Fredrickson J. Hixson K.K. Kostandarithes H. Masselon C. Markillie L.M. Moore R.J. Romine M.F. Shen Y. et al.Global analysis of the Deinococcus radiodurans proteome by using accurate mass tags.Proc. Natl. Acad. Sci. U. S. A. 2002; 99: 11049-11054Google Scholar), isotope-coded affinity tags (17.Han D.K. Eng J. Zhou H. Aebersold R. Quantitative profiling of differentiation-induced microsomal proteins using isotope-coded affinity tags and mass spectrometry.Nat. Biotechnol. 2001; 19: 946-951Google Scholar), and high accuracy quadrupole MS (18.Lasonder E. Ishihama Y. Andersen J.S. Vermunt A.M. Pain A. Sauerwein R.W. Eling W.M. Hall N. Waters A.P. Stunnenberg H.G. Mann M. Analysis of the Plasmodium falciparum proteome by high-accuracy mass spectrometry.Nature. 2002; 419: 537-542Google Scholar), also allow for significant proteome coverage. Nevertheless, the mouse proteome is predicted to be very complex and highly regulated (19.Waterston R.H. Lindblad-Toh K. Birney E. Rogers J. Abril J.F. Agarwal P. Agarwala R. Ainscough R. Alexandersson M. An P. Antonarakis S.E. Attwood J. Baertsch R. Bailey J. Barlow K. et al.Initial sequencing and comparative analysis of the mouse genome.Nature. 2002; 420: 520-562Google Scholar), involving many thousands of proteins regulated by means of differential synthesis and selective subcellular localization. Furthermore, current high throughput experimental proteomic approaches do not allow for ready transformation of raw data into meaningful, easy to interpret output. Here we describe the development and application of PRISM, a generic Proteomic Investigation Strategy for Mammals that allows for systematic, efficient, and unbiased detection and simplified follow-up analysis of large numbers of proteins expressed in mammalian cells and tissues. PRISM consists of a series of integrated experimental and analytical steps, starting with subcellular fractionation and high throughput protein shotgun sequencing using an optimized MudPIT procedure followed by automated statistical validation, annotation, and categorization of the identified proteins based on universal Gene Ontology (GO) annotation terms (20.Ashburner M. Ball C.A. Blake J.A. Botstein D. Butler H. Cherry J.M. Davis A.P. Dolinski K. Dwight S.S. Eppig J.T. Harris M.A. Hill D.P. Issel-Tarver L. Kasarskis A. Lewis S. Matese J.C. Richardson J.E. Ringwald M. Rubin G.M. Sherlock G. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.Nat. Genet. 2000; 25: 25-29Google Scholar). PRISM was evaluated on healthy adult mouse lung and liver, and physiologically significant differences in the tissue specificity and subcellular localization of hundreds of proteins were readily detected, confirming the utility of the approach for global analysis of complex mammalian proteomes. All solid chemicals were from Sigma, while HPLC grade acetonitrile, methanol, and water were purchased from Fisher Scientific, and heptafluorobutyric acid was obtained from BioLynx (Brockville, Ontario, Canada). Bulk Poroszyme immobilized trypsin was obtained from Applied Biosystems (Streetsville, Ontario, Canada), and endoproteinase Lys-C was from Roche Diagnostics. Healthy adult female mice (ICR) were CO2-asphyxiated and sacrificed. The organs of interest were perfused with cold phosphate-buffered saline, rapidly removed, rinsed, and homogenized for 2 min in ice-cold lysis buffer containing 250 mm sucrose, 50 mm Tris-HCl (pH 7.4), 5 mm MgCl2, 1 mm DTT, and 1 mm phenylmethylsulfonyl fluoride using a tight fitting Teflon pestle attached to a power drill. All subsequent steps were performed at 4 °C. The lysate was centrifuged in a benchtop centrifuge at 800 × g for 15 min; the supernatant served as source of cytosol, mitochondria, and microsomes. The pellet, which contains the nuclei, was rehomogenized for 1 min in lysis buffer and centrifuged again as above. The nuclei were homogenized in cushion buffer (2 m sucrose, 50 mm Tris-HCl (pH 7.4), 5 mm MgCl2, 1 mm DTT, and 1 mm phenylmethylsulfonyl fluoride), filtered through cheesecloth to remove debris, layered onto 4 ml of cushion buffer, and pelleted in an ultracentrifuge at 80,000 × g for 35 min (Beckman SW41 rotor). Mitochondria were isolated from the crude cytoplasmic fraction by benchtop centrifugation at 6,000 × g for 15 min, whereas the microsomal fraction was isolated by 100,000 × g ultracentrifugation for 1 h (Beckman SW41 rotor). The supernatant was saved as the "cytosol" fraction. Nuclear proteins were extracted by resuspending and incubating the nuclei in 5 volumes of 20 mm HEPES (pH 7.9), 1.5 mm MgCl2, 0.42 m NaCl, 0.2 mm EDTA, and 25% glycerol for 30 min with gentle shaking. The nuclei were then lysed by 10 passages through an 18-gauge needle, and debris were removed by microcentrifugation at 13,000 rpm for 30 min. The supernatant served as the "nuclear" fraction. Mitochondrial proteins were isolated by incubating the mitochondria in a hypotonic lysis buffer containing 10 mm HEPES, pH 7.9 for 30 min on ice. The suspension was briefly sonicated, and debris were pelleted in a benchtop microcentrifuge at 13,000 rpm for 30 min. The supernatant served as the "soluble mitochondrial" fraction. Membrane proteins were extracted by gently resuspending the insoluble mitochondrial pellet and the microsomes in extraction buffer containing 20 mm Tris-HCl (pH 7.8), 0.4 m NaCl, 15% glycerol, 1 mm DTT, and 1.5% Triton-X-100. The suspension was incubated with gentle shaking for 1 h and recentrifuged at 100,000 × g for 1 h (Beckman SW60Ti rotor). The supernatants served as the "microsome" and "mitochondrial pellet" fractions, respectively. For the crude whole tissue extract, mouse liver was homogenized for 2 min in ice-cold homogenization buffer containing 250 mm sucrose, 50 mm Tris-HCl (pH 7.4), 5 mm MgCl2, 1 mm DTT, and 1 mm phenylmethylsulfonyl fluoride. The resulting solution was briefly sonicated and centrifuged at 800 × g, and the supernatant was analyzed. An aliquot of 150 μg of total protein from each fraction was precipitated overnight with 5 volumes of ice-cold acetone followed by centrifugation at 21,000 × g for 20 min. The protein pellet was solubilized in 8 m urea, 50 mm Tris-HCl, pH 8.5 at 37 °C for 2 h and reduced by the addition of 1 mm DTT for 1 h at room temperature followed by carboxyamidomethylation with 5 mm iodoacetamide for 1 h at 37 °C. The samples were then diluted to 4 m urea with 50 mm ammonium bicarbonate, pH 8.5 and digested with a 1:150 molar ratio of endoproteinase Lys-C at 37 °C overnight. The next day the mixtures were further diluted to 2 m urea with 50 mm ammonium bicarbonate, pH 8.5, supplemented with CaCl2 to a final concentration of 1 mm, and incubated overnight with Poroszyme trypsin beads at 30 °C with rotating. The resulting peptide mixtures were solid phase-extracted with SPEC-Plus PT C18 cartridges (Ansys Diagnostics, Lake Forest, CA) according to the manufacturer's instructions and stored at −80 °C until further use. A fully automated 15-cycle, 30-h MudPIT chromatographic procedure was set up essentially as described previously (12.Washburn M.P. Wolters D. Yates III, J.R. Large-scale analysis of the yeast proteome by multidimensional protein identification technology.Nat. Biotechnol. 2001; 19: 242-247Google Scholar, 13.Wolters D.A. Washburn M.P. Yates III, J.R. An automated multidimensional protein identification technology for shotgun proteomics.Anal. Chem. 2001; 73: 5683-5690Google Scholar). Briefly, an HPLC quaternary pump was interfaced with an LCQ DECA XP ion trap tandem mass spectrometer (ThermoFinnigan, San Jose, CA). A 150-μm-inner diameter fused silica capillary microcolumn (Polymicro Technologies, Phoenix, AZ) was pulled to a fine tip using a P-2000 laser puller (Sutter Instruments, Novato, CA) and packed with 10 cm of 5-μm Zorbax Eclipse XDB-C18 resin (Agilent Technologies, Mississauga, Ontario, Canada) and then with 6 cm of 5-μm Partisphere strong cation exchange resin (Whatman). Samples were loaded manually onto separate columns using a pressure vessel. The chromatography was carried out as described by Wolters et al. (13.Wolters D.A. Washburn M.P. Yates III, J.R. An automated multidimensional protein identification technology for shotgun proteomics.Anal. Chem. 2001; 73: 5683-5690Google Scholar). The SEQUEST program (a kind gift from Jimmy Eng and John Yates III) was used to search peptide spectra essentially as described previously (21.Cagney G. Emili A. De novo peptide sequencing and quantitative profiling of complex protein mixtures using mass-coded abundance tagging.Nat. Biotechnol. 2002; 20: 163-170Google Scholar). The database was populated with non-redundant mammalian Swiss-Prot and TrEMBL protein sequences in both a normal and inverted amino acid orientation (22.Moore R.E. Young M.K. Lee T.D. Qscore: an algorithm for evaluating SEQUEST database search results.J. Am. Soc. Mass Spectrom. 2002; 13: 378-386Google Scholar). Statistical analysis (error modeling) was performed on the SEQUEST scores obtained for over 30,000 peptide matches. Formally the output of the analysis, Yi, was given as: Yi = "0" (spectrum is incorrectly matched to an inverted peptide sequence; Yi = "1" (spectrum is matched to a normal peptide sequence, possibly incorrect); Yi = "2" (the spectrum matches the correct peptide sequence). We estimated a function F(x,y,z..) that characterizes the likelihood that a peptide match with score X→i = (x,y,z..) is correct as F(x,y,z..)=P{Yi=2|X→i=(x,y,z,..)}(Eq. 1) For a protein with multiple peptide matches, {X→1, X→2,…, X→m}, one can then estimate the probability of correct identification by P(X→1,X→2,…,X→m)=maxPii≤m(Eq. 2) To compute the detection sensitivity (coverage), an estimate of the number of proteins actually present in the sample was made using NProt ≈ λPepN/(AProL/APepL)(Eq. 3) where PepN is the number of observed peptides, AProL is the average amino acid length of proteins in the database, APepL is the average length of a peptide in the database, and λ ≤ 1 is a positive constant proportional to the number of matches to an actual peptide. F was approximated by first carving the α regions and then fitting a smooth function. Monotonicity implies that for every β, there is a rectangular region, R, for which the function F will have a value of at least β. Assumption (i) implies that for a large β and a rectangle R with a large number of observations (K > 100) the following applies β=F(X→i)=P(Yi=2|X→iϵR)≈2α−1(Eq. 4) where α represents the proportion of 1's in region R. By continuity, for a not too large R, P(Yi = k|X→i ε R) is approximately equal for all X→i ε R. Hence we let pk=P(Yi=k) k=0, 1, 2(Eq. 5) and q=p1+p2=1−p0(Eq. 6) The probability that region R with K observations has α × K of 1's (meaning either 1 or 2 since 2's cannot be recognized a priori) and (1 − α) × K of 0's is given by ( KαK )(q) αK (1−q) (K(1−α))(Eq. 7) It is well known that the maximum likelihood estimator of q is α (23.Casella G. Berger R. Statistical Inference. Duxbury Press, Belmont, CA1990Google Scholar); we therefore let q ≈ α. Assumption (i) implies p1 = p0, resulting in p2≈2α−1(Eq. 8) Since p2=P(Yi=2|X→iϵR)(Eq. 9) Equation 4 is proven. Therefore, if R is a rectangle with at least α 1's and the SEQUEST scores X→i ε R, the probability that a peptide match is correct is approximated by F(x,y,z)≈2α−1(Eq. 10) Next rectangular regions are identified for which F(x,y,z) ≥ β, β = {0.98, 0.96, 0.9, 0.8, 0.7, 0.6, 0.5}. To this end, for fixed β, we defined a function H(x,y,z) = 1 if (x,y,z) ε R and =0 otherwise. For α = (β − 1)/2, we minimized the weighted l1 distance between function H and the data points. In other words, since the rectangle R is easily parameterized (example R = {(x,y,z) such that a < x < b and c < y < d and e < z < f}), one looks for values "a,b,c,…" that minimize the following quantity ∑i | H(X→i)−Yi|Wi(Eq. 11) To ensure α, the weights were computed as follows Wi= { 1if Yi=01−γif Yi=1 }(Eq. 12) where γ = 2 − 1/α. Assumption (ii) implies the existence of a smooth monotone function that can approximate the data. The actual optimization algorithm was an accelerated Random search (24.Radulovic, D., and Appel, M. (2000) Accelerated random search, in Proceedings of 16th IMACS World Congress on Scientific Computing, Laussanne, Switzerland, August 21–25, 2000Google Scholar). Computations were run on a desktop computer using FORTRAN. GOClust takes as input a tab-delimited text file of validated proteins. To facilitate comparison across multiple samples, the programs DTASelect and Contrast (a generous gift from Dave Tabb, Scripps Research Institute, La Jolla, CA) were used to arrange the data sets (25.Tabb D.L. McDonald W.H. Yates J.R. DTASelect and Contrast: tools for assembling and comparing protein identifications from shotgun proteomics.J. Proteome Res. 2002; 1: 21-26Google Scholar). Protein matches to other sequence databases are first mapped to a corresponding Swiss-Prot or TrEMBL entry using the Sequence Retrieval System at the Canadian Bioinformatics Resources (www.cbr.ncr.ca). The GOA flat file (regularly updated) that provides GO annotations for non-redundant Swiss-Prot, TrEMBL, and Ensemble entries was downloaded from the European Bioinformatics Institute (www.ebi.ac.uk). The final output is a series of tables of grouped proteins that share a common annotation to one or more preselected GO terms. The choice of terms is fully flexible to satisfy user interests. The apparent complexity of the mammalian proteome, defined here as the set of proteins produced by cells or tissues, represents a considerable experimental challenge even to high performance LC-MS techniques such as MudPIT (11.Link A.J. Eng J. Schieltz D.M. Carmack E. Mize G.J. Morris D.R. Garvik B.M. Yates III, J.R. Direct analysis of protein complexes using mass spectrometry.Nat. Biotechnol. 1999; 17: 676-682Google Scholar, 12.Washburn M.P. Wolters D. Yates III, J.R. Large-scale analysis of the yeast proteome by multidimensional protein identification technology.Nat. Biotechnol. 2001; 19: 242-247Google Scholar, 13.Wolters D.A. Washburn M.P. Yates III, J.R. An automated multidimensional protein identification technology for shotgun proteomics.Anal. Chem. 2001; 73: 5683-5690Google Scholar). Hence we chose to use subcellular fractionation, an effective technique for selective enrichment of specific subsets of cellular proteins (26.Rappsilber J. Ryder U. Lamond A.I. Mann M. Large-scale proteomic analysis of the human spliceosome.Genome Res. 2002; 12: 1231-1245Google Scholar, 27.Cronshaw J.M. Krutchinsky A.N. Zhang W. Chait B.T. Matunis M.J. Proteomic analysis of the mammalian nuclear pore complex.J. Cell Biol. 2002; 158: 915-927Google Scholar) and organelles (28.Andersen J.S. Lyon C.E. Fox A.H. Leung A.K. Lam Y.W. Steen H. Mann M. Lamond A.I. Directed proteomic analysis of the human nucleolus.Curr. Biol. 2002; 12: 1-11Google Scholar, 29.Scherl A. Coute Y. Deon C. Calle A. Kindbeiter K. Sanchez J.C. Greco A. Hochstrasser D. Diaz J.J. Functional proteomic analysis of human nucleolus.Mol. Biol. Cell. 2002; 13: 4100-4109Google Scholar) as a means of increasing both the proteome coverage and functional insight gained from LC-MS analysis. The entire PRISM methodology is outlined schematically in Fig. 1. As the goal was to develop a simple, generic methodology, we opted for a straightforward procedure based on differential centrifugation (outlined in Fig. 2A; see "Experimental Procedures"), which nonetheless results in respectable enrichment of nuclear, mitochondrial, microsomal, and cytosolic compartments and, importantly, a significant increase in the number of proteins detected by LC-MS (see below).Fig. 2A, schematic representation of the subcellular fractionation procedure using differential centrifugation. Centrifugal forces and times (′, minutes) are indicated inside circular arrows. The final protein fractions analyzed by LC-MS are highlighted in bold (N, nuclear; C, cytosol; MI, microsomes; MS, soluble mitochondria; MP, insoluble mitochondrial). B, STATQUEST, a statistical algorithm to validate putative protein identifications. Graphical representation of the distribution of SEQUEST database search scores for doubly charged mouse peptides and the derived probability function G(x,y). C, GOClust, a program for automatic annotation of identified proteins. Matches to proteins within the Protein Information Resource (PIR) and GenPept (GenBankTM) databases are first linked to the TrEMBL database. Next Swiss-Prot and TrEMBL linked proteins are annotated using an annotation (GOA) flat file. The annotated proteins are then clustered into user-selected GO subcategories, and a summary spreadsheet is produced.View Large Image Figure ViewerDownload (PPT) Proteins are extracted from each of five well defined subcellular fractions and digested with endoproteinase Lys-C and trypsin, and the peptides mixtures are analyzed using a 15-step MudPIT procedure (see "Experimental Procedures"). The MS instrumentation is set to automatically record both the mass-to-charge ratio and the fragmentation pattern of each eluting peptide that undergoes collision-induced dissociation. The fragmentation spectra are then compared with non-redundant human and mouse protein sequences using SEQUEST (30.Eng J.K. McCormack A.L. Yates III, J.R. An approach to correlate tandem mass-spectral data of peptides with amino-acid-sequences in a protein database.J. Am. Soc. Mass Spectrom. 1994; 11: 976-989Google Scholar, 31.Yates III, J.R. Eng J.K. McCormack A.L. Schieltz D. Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database.Anal. Chem. 1995; 67: 1426-1436Google Scholar), a database search program that infers amino acid sequence identity by matching fragment ions to translated genomic sequences. The output of SEQUEST is a series of putative protein matches and associated peptide scores, which include a cross-correlation score based on spectral fit (Xcorr), the normalized difference between the Xcorr of the top and second best matches (ΔCn), and a preliminary ranking based on the number of matched ion peaks (RSp). A subjective combination of these scores as well as other factors such as the charge of the precursor ion, the presence of tryptic termini (relevant in experiments where the peptides are generated by digestion with trypsin), and the number of peptides that map to a given protein is typically used to evaluate the accuracy of a prediction (30.Eng J.K. McCormack A.L. Yates III, J.R. An approach to correlate tandem mass-spectral data of peptides with amino-acid-sequences in a protein database.J. Am. Soc. Mass Spectrom. 1994; 11: 976-989Google Scholar). To provide a more rigorous estimate of the accuracy of SEQUEST predictions, we developed a statistical algorithm, STATQUEST, that uses an empirical, probabilistic method for determining the likelihood of each putative peptide match. We began our error modeling by considering the criteria mentioned above as a collection of variables and evaluating whether a subset, d, of these might describe a region of d-dimensional space enriched for correctly identified proteins. The goal was to produce a function corresponding to the probability that a given protein with SEQUEST scores X→i = (x,y,z,..) is correctly identified. To this end, we evaluated the distribution of SEQUEST scores for tens of thousands of mouse peptide spectra obtained by searching a database populated with mouse and human protein sequences in both the normal amino acid order as well as a fully inverted order (see "Experimental Procedures"). Our analysis had two assumptions. (i) If a match is incorrect, SEQUEST has an equal chance to return a forward (a 1) or an inverted sequence (a 0). (ii) The likelihood of a correct match is a smooth and monotone function dependent on the Xcorr, ΔCn, RSp, charge, and tryptic status of the peptide. Since a match to an inverted sequence, or 0, clearly indicates an incorrect match, we located regions of variable space (i.e. Xcorr, ΔCn, and RSp) where the concentration of 0's is low. Since a low concentration of 0's relates to a high probability of correct matches, we were able to derive a likelihood function (see "Experimental Procedures"). (To further justify the above approach, we offer Supplemental Figs. F1 and F2 that show that the distribution of Xcorr and ΔCn for matches to normal peptide sequences (or 1's; Supplemental Fig. F2) is biased compared with matches to inverted sequences (or 0's; Supplemental Fig. F1)). While the exact form of this function is not known, a good (least squares) fit is achieved with α≈G(x,y)=1−1/Q(x,y)(Eq. 13) where Q(x,y) is a polynomial expression of second degree. Singly, doubly, and triply charged peptides were treated separately, and the three predictors (variables) were fixed as x = Xcorr, y = ΔCn, and z = RSp. The final function is F(x,y,z) = {2G(x,y)-1if z ≤52G(x,0)-1if z > 5}(Eq. 14) The output of this function is a probability value for putative matches, which allows for easy assignment of a confidence factor. We found that the dimension of the problem could be reduced by fixing the third variable, z < 5, since this results in virtually the same entries; an Accelerated Random Search (24.Radulovic, D., and Appel, M. (2000) Accelerated random search, in Proceedings of 16th IMACS World Congress on Scientific Computing, Laussanne, Switzerland, August 21–25, 2000Google Scholar) was used as the optimization algorithm. Heuristically the optimization "slides" the
Referência(s)