Artigo Acesso aberto Revisado por pares

A Bioinformatics Workflow for Variant Peptide Detection in Shotgun Proteomics

2011; Elsevier BV; Volume: 10; Issue: 5 Linguagem: Inglês

10.1074/mcp.m110.006536

ISSN

1535-9484

Autores

Jing Li, Zengliu Su, Zeqiang Ma, Robbert J.C. Slebos, Patrick J. Halvey, David L. Tabb, D.C. Liebler, William Pao, Bing Zhang,

Tópico(s)

Genomics and Rare Diseases

Resumo

Shotgun proteomics data analysis usually relies on database search. However, commonly used protein sequence databases do not contain information on protein variants and thus prevent variant peptides and proteins from been identified. Including known coding variations into protein sequence databases could help alleviate this problem. Based on our recently published human Cancer Proteome Variation Database, we have created a protein sequence database that comprehensively annotates thousands of cancer-related coding variants collected in the Cancer Proteome Variation Database as well as noncancer-specific ones from the Single Nucleotide Polymorphism Database (dbSNP). Using this database, we then developed a data analysis workflow for variant peptide identification in shotgun proteomics. The high risk of false positive variant identifications was addressed by a modified false discovery rate estimation method. Analysis of colorectal cancer cell lines SW480, RKO, and HCT-116 revealed a total of 81 peptides that contain either noncancer-specific or cancer-related variations. Twenty-three out of 26 variants randomly selected from the 81 were confirmed by genomic sequencing. We further applied the workflow on data sets from three individual colorectal tumor specimens. A total of 204 distinct variant peptides were detected, and five carried known cancer-related mutations. Each individual showed a specific pattern of cancer-related mutations, suggesting potential use of this type of information for personalized medicine. Compatibility of the workflow has been tested with four popular database search engines including Sequest, Mascot, X!Tandem, and MyriMatch. In summary, we have developed a workflow that effectively uses existing genomic data to enable variant peptide detection in proteomics. Shotgun proteomics data analysis usually relies on database search. However, commonly used protein sequence databases do not contain information on protein variants and thus prevent variant peptides and proteins from been identified. Including known coding variations into protein sequence databases could help alleviate this problem. Based on our recently published human Cancer Proteome Variation Database, we have created a protein sequence database that comprehensively annotates thousands of cancer-related coding variants collected in the Cancer Proteome Variation Database as well as noncancer-specific ones from the Single Nucleotide Polymorphism Database (dbSNP). Using this database, we then developed a data analysis workflow for variant peptide identification in shotgun proteomics. The high risk of false positive variant identifications was addressed by a modified false discovery rate estimation method. Analysis of colorectal cancer cell lines SW480, RKO, and HCT-116 revealed a total of 81 peptides that contain either noncancer-specific or cancer-related variations. Twenty-three out of 26 variants randomly selected from the 81 were confirmed by genomic sequencing. We further applied the workflow on data sets from three individual colorectal tumor specimens. A total of 204 distinct variant peptides were detected, and five carried known cancer-related mutations. Each individual showed a specific pattern of cancer-related mutations, suggesting potential use of this type of information for personalized medicine. Compatibility of the workflow has been tested with four popular database search engines including Sequest, Mascot, X!Tandem, and MyriMatch. In summary, we have developed a workflow that effectively uses existing genomic data to enable variant peptide detection in proteomics. DNA sequence variation is associated with diseases and differential drug response. As a paradigmatic example, cancers are diseases of clonal proliferations caused by mutations in oncogenes and tumor suppressor genes (1Vogelstein B. Kinzler K.W. Cancer genes and the pathways they control.Nat. Med. 2004; 10: 789-799Crossref PubMed Scopus (3337) Google Scholar). After several decades of searching through traditional biology approaches, many mutant genes have been causally implicated in oncogenesis (2Futreal P.A. Coin L. Marshall M. Down T. Hubbard T. Wooster R. Rahman N. Stratton M.R. A census of human cancer genes.Nat. Rev. Cancer. 2004; 4: 177-183Crossref PubMed Scopus (2419) Google Scholar). Facilitated by the new genomic techniques such as SNP (single nucleotide polymorphism) arrays and deep-sequencing, the identification of cancer genes has made enormous progress over the past several years (3Wood L.D. Parsons D.W. Jones S. Lin J. Sjöblom T. Leary R.J. Shen D. Boca S.M. Barber T. Ptak J. Silliman N. Szabo S. Dezso Z. Ustyanksky V. Nikolskaya T. Nikolsky Y. Karchin R. Wilson P.A. Kaminker J.S. Zhang Z. Croshaw R. Willis J. Dawson D. Shipitsin M. Willson J.K. Sukumar S. Polyak K. Park B.H. Pethiyagoda C.L. Pant P.V. Ballinger D.G. Sparks A.B. Hartigan J. Smith D.R. Suh E. Papadopoulos N. Buckhaults P. Markowitz S.D. Parmigiani G. Kinzler K.W. Velculescu V.E. Vogelstein B. The genomic landscapes of human breast and colorectal cancers.Science. 2007; 318: 1108-1113Crossref PubMed Scopus (2577) Google Scholar, 4Weir B.A. Woo M.S. Getz G. Perner S. Ding L. Beroukhim R. Lin W.M. Province M.A. Kraja A. Johnson L.A. Shah K. Sato M. Thomas R.K. Barletta J.A. Borecki I.B. Broderick S. Chang A.C. Chiang D.Y. Chirieac L.R. Cho J. Fujii Y. Gazdar A.F. Giordano T. Greulich H. Hanna M. Johnson B.E. Kris M.G. Lash A. Lin L. Lindeman N. Mardis E.R. McPherson J.D. Minna J.D. Morgan M.B. Nadel M. Orringer M.B. Osborne J.R. Ozenberger B. Ramos A.H. Robinson J. Roth J.A. Rusch V. Sasaki H. Shepherd F. Sougnez C. Spitz M.R. Tsao M.S. Twomey D. Verhaak R.G. Weinstock G.M. Wheeler D.A. Winckler W. Yoshizawa A. Yu S. Zakowski M.F. Zhang Q. Beer D.G. Wistuba II Watson M.A. Garraway L.A. Ladanyi M. Travis W.D. Pao W. Rubin M.A. Gabriel S.B. Gibbs R.A. Varmus H.E. Wilson R.K. Lander E.S. Meyerson M. Characterizing the cancer genome in lung adenocarcinoma.Nature. 2007; 450: 893-898Crossref PubMed Scopus (929) Google Scholar, 5TCGA Comprehensive genomic characterization defines human glioblastoma genes and core pathways.Nature. 2008; 455: 1061-1068Crossref PubMed Scopus (5836) Google Scholar, 6Sjöblom T. Jones S. Wood L.D. Parsons D.W. Lin J. Barber T.D. Mandelker D. Leary R.J. Ptak J. Silliman N. Szabo S. Buckhaults P. Farrell C. Meeh P. Markowitz S.D. Willis J. Dawson D. Willson J.K. Gazdar A.F. Hartigan J. Wu L. Liu C. Parmigiani G. Park B.H. Bachman K.E. Papadopoulos N. Vogelstein B. Kinzler K.W. Velculescu V.E. The consensus coding sequences of human breast and colorectal cancers.Science. 2006; 314: 268-274Crossref PubMed Scopus (2857) Google Scholar, 7Greenman C. Stephens P. Smith R. Dalgliesh G.L. Hunter C. Bignell G. Davies H. Teague J. Butler A. Stevens C. Edkins S. O'Meara S. Vastrik I. Schmidt E.E. Avis T. Barthorpe S. Bhamra G. Buck G. Choudhury B. Clements J. Cole J. Dicks E. Forbes S. Gray K. Halliday K. Harrison R. Hills K. Hinton J. Jenkinson A. Jones D. Menzies A. Mironenko T. Perry J. Raine K. Richardson D. Shepherd R. Small A. Tofts C. Varian J. Webb T. West S. Widaa S. Yates A. Cahill D.P. Louis D.N. Goldstraw P. Nicholson A.G. Brasseur F. Looijenga L. Weber B.L. Chiew Y.E. DeFazio A. Greaves M.F. Green A.R. Campbell P. Birney E. Easton D.F. Chenevix-Trench G. Tan M.H. Khoo S.K. Teh B.T. Yuen S.T. Leung S.Y. Wooster R. Futreal P.A. Stratton M.R. Patterns of somatic mutation in human cancer genomes.Nature. 2007; 446: 153-158Crossref PubMed Scopus (2389) Google Scholar). The genomic abnormalities of cancer are expressed through aberrant proteins and proteomes and their altered functions. Although proteins reflecting the genomic changes in cancer have the potential to become clinically meaningful biomarkers, their discovery and validation has proven to be challenging. As a result, few biomarker candidates have translated into clinical use. Over the past decade, mass spectrometry (MS)-based shotgun proteomics has emerged as a high-throughput, unbiased method for the identification of proteins in complex samples (8Foster L.J. de Hoog C.L. Zhang Y. Zhang Y. Xie X. Mootha V.K. Mann M. A mammalian organelle map by protein correlation profiling.Cell. 2006; 125: 187-199Abstract Full Text Full Text PDF PubMed Scopus (470) Google Scholar, 9Kislinger T. Cox B. Kannan A. Chung C. Hu P. Ignatchenko A. Scott M.S. Gramolini A.O. Morris Q. Hallett M.T. Rossant J. Hughes T.R. Frey B. Emili A. Global survey of organ and organelle protein expression in mouse: combined proteomic and transcriptomic profiling.Cell. 2006; 125: 173-186Abstract Full Text Full Text PDF PubMed Scopus (399) Google Scholar). Its application to tumor specimens holds great potential in identifying mutant proteins in human cancers. However, because shotgun proteomics data analysis usually relies on database search and because commonly employed protein sequence databases do not contain protein variation information, the application of shotgun proteomics to the detection of protein sequence variants remains a big challenge. Several research groups have made valuable efforts on enabling the identification of variant peptides based on the exhaustive search of all possible sequence variants. A modified version of Sequest provides automated search of human hemoglobin gene variants through dynamically generating all possible single-nucleotide variations and then constructing a database that translates these sequences to peptides (10Gatlin C.L. Eng J.K. Cross S.T. Detter J.C. Yates 3rd, J.R. Automated identification of amino acid sequence variations in proteins by HPLC/microspray tandem mass spectrometry.Anal. Chem. 2000; 72: 757-763Crossref PubMed Scopus (216) Google Scholar). Roth et al. (11Roth M.J. Forbes A.J. Boyne 2nd, M.T. Kim Y.B. Robinson D.E. Kelleher N.L. Precise and parallel characterization of coding polymorphisms, alternative splicing, and modifications in human proteins by mass spectrometry.Mol. Cell. Proteomics. 2005; 4: 1002-1008Abstract Full Text Full Text PDF PubMed Scopus (96) Google Scholar) developed a human protein database tailored for the “top-down” MS approach by combinatorial consideration of protein variability in a search. Similarly, the error-tolerant search in Mascot (12Creasy D.M. Cottrell J.S. Error tolerant searching of uninterpreted tandem mass spectrometry data.Proteomics. 2002; 2: 1426-1434Crossref PubMed Scopus (207) Google Scholar) and the refinement search in X!Tandem (13Craig R. Beavis R.C. TANDEM: matching proteins with tandem mass spectra.Bioinformatics. 2004; 20: 1466-1467Crossref PubMed Scopus (1991) Google Scholar) allow exhaustive test of all amino acid substitutions that can arise from single-base nucleotide substitutions in each protein. Because of the greatly expanded search space, it is difficult to apply meaningful measure of statistical significance for the variant identifications and the results require careful interpretation (12Creasy D.M. Cottrell J.S. Error tolerant searching of uninterpreted tandem mass spectrometry data.Proteomics. 2002; 2: 1426-1434Crossref PubMed Scopus (207) Google Scholar). An effective approach to limit the search space of protein variants is to consider only those derived from known coding SNPs. A SNP annotation method was presented by Bunger et al. in which MS/MS spectra were searched against reference protein databases and a separate SNP database created from peptides from the National Center for Biotechnology Information (NCBI) dbSNP database (14Bunger M.K. Cargile B.J. Sevinsky J.R. Deyanova E. Yates N.A. Hendrickson R.C. Stephenson Jr., J.L. Detection and validation of non-synonymous coding SNPs from orthogonal analysis of shotgun proteomics data.J. Proteome Res. 2007; 6: 2331-2340Crossref PubMed Scopus (47) Google Scholar). Schandorff et al. established the MSIPI protein sequence database through elongating the original International Protein Index sequences with coding-SNPs from dbSNP, sequence conflicts, and N-terminal peptides (15Schandorff S. Olsen J.V. Bunkenborg J. Blagoev B. Zhang Y. Andersen J.S. Mann M. A mass spectrometry-friendly database for cSNP identification.Nat. Methods. 2007; 4: 465-466Crossref PubMed Scopus (60) Google Scholar). More recently, a web-based platform SysPIMP was created for identifying human disease-related mutant sequences based on the X!Tandem search of shotgun proteomics data (16Xi H. Park J. Ding G. Lee Y.H. Li Y. SysPIMP: the web-based systematical platform for identifying human disease-related mutated sequences from mass spectrometry.Nucleic Acids Res. 2009; 37: D913-920Crossref PubMed Scopus (18) Google Scholar). SysPIMP collects human disease-related mutant sequences from the Online Mendelian Inheritance in Man(17Hamosh A. Scott A.F. Amberger J.S. Bocchini C.A. McKusick V.A. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders.Nucleic Acids Res. 2005; 33: D514-517Crossref PubMed Scopus (1971) Google Scholar), Protein Mutant Database (18Kawabata T. Ota M. Nishikawa K. The Protein Mutant Database.Nucleic Acids Res. 1999; 27: 355-357Crossref PubMed Scopus (109) Google Scholar), and SwissProt database (19Boeckmann B. Bairoch A. Apweiler R. Blatter M.C. Estreicher A. Gasteiger E. Martin M.J. Michoud K. O'Donovan C. Phan I. Pilbout S. Schneider M. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003.Nucleic Acids Res. 2003; 31: 365-370Crossref PubMed Scopus (2796) Google Scholar). Despite these exciting developments, the problem of applying shotgun proteomics to the identification of protein variants in human cancers has not been addressed adequately. First, mutations, especially cancer-specific ones, are not specifically considered in existing approaches. NCBI's dbSNP database provides a general catalog of genome variation to address large-scale sampling designs required by association studies. It has been an invaluable resource for applying genetic approaches to understanding the etiology of different cancers (20Packer B.R. Yeager M. Staats B. Welch R. Crenshaw A. Kiley M. Eckert A. Beerman M. Miller E. Bergen A. Rothman N. Strausberg R. Chanock S.J. SNP500Cancer: a public resource for sequence validation and assay development for genetic variation in candidate genes.Nucleic Acids Res. 2004; 32: D528-532Crossref PubMed Google Scholar). However, cancer somatic mutations are collected in the Catalogue of Somatic Mutations in Cancer (http://www.sanger.ac.uk/genetics/CGP/cosmic) (21Bamford S. Dawson E. Forbes S. Clements J. Pettett R. Dogan A. Flanagan A. Teague J. Futreal P.A. Stratton M.R. Wooster R. The COSMIC (Catalogue of Somatic Mutations in Cancer) database and website.Br. J. Cancer. 2004; 91: 355-358Crossref PubMed Scopus (896) Google Scholar) and other cancer specific databases (22Olivier M. Petitjean A. Teague J. Forbes S. Dunnick J.K. den Dunnen J.T. Langerod A. Wilkinson J.M. Vihinen M. Cotton R.G. Hainaut P. Somatic mutation databases as tools for molecular epidemiology and molecular pathology of cancer: proposed guidelines for improving data collection, distribution, and integration.Hum. Mutat. 2009; 30: 275-282Crossref PubMed Scopus (14) Google Scholar) rather than dbSNP. As a result, most cancer-specific mutations have been omitted from previous studies. Recently, we developed a human Cancer Proteome Variation database (CanProVar 1The abbreviations used are:CanProVarthe human Cancer Proteome Variation DatabaseSNPsingle nucleotide polymorphism., http://bioinfo.vanderbilt.edu/canprovar/) (23Li J. Duncan D.T. Zhang B. CanProVar: a human cancer proteome variation database.Hum. Mutat. 2010; 31: 219-228Crossref PubMed Scopus (52) Google Scholar) that comprehensively integrates proteome variation data from a variety of cancer specific variation data sources including HPI (24Boeckmann B. Blatter M.C. Famiglietti L. Hinz U. Lane L. Roechert B. Bairoch A. Protein variety and functional diversity: Swiss-Prot annotation in its biological context.C. R. Biol. 2005; 328: 882-899Crossref PubMed Scopus (83) Google Scholar, 25O'Donovan C. Apweiler R. Bairoch A. The human proteomics initiative (HPI).Trends Biotechnol. 2001; 19: 178-181Abstract Full Text Full Text PDF PubMed Scopus (68) Google Scholar), COSMIC, OMIM, and large-scale mutational profiling studies on cancer genes and cancer genomes (6Sjöblom T. Jones S. Wood L.D. Parsons D.W. Lin J. Barber T.D. Mandelker D. Leary R.J. Ptak J. Silliman N. Szabo S. Buckhaults P. Farrell C. Meeh P. Markowitz S.D. Willis J. Dawson D. Willson J.K. Gazdar A.F. Hartigan J. Wu L. Liu C. Parmigiani G. Park B.H. Bachman K.E. Papadopoulos N. Vogelstein B. Kinzler K.W. Velculescu V.E. The consensus coding sequences of human breast and colorectal cancers.Science. 2006; 314: 268-274Crossref PubMed Scopus (2857) Google Scholar, 7Greenman C. Stephens P. Smith R. Dalgliesh G.L. Hunter C. Bignell G. Davies H. Teague J. Butler A. Stevens C. Edkins S. O'Meara S. Vastrik I. Schmidt E.E. Avis T. Barthorpe S. Bhamra G. Buck G. Choudhury B. Clements J. Cole J. Dicks E. Forbes S. Gray K. Halliday K. Harrison R. Hills K. Hinton J. Jenkinson A. Jones D. Menzies A. Mironenko T. Perry J. Raine K. Richardson D. Shepherd R. Small A. Tofts C. Varian J. Webb T. West S. Widaa S. Yates A. Cahill D.P. Louis D.N. Goldstraw P. Nicholson A.G. Brasseur F. Looijenga L. Weber B.L. Chiew Y.E. DeFazio A. Greaves M.F. Green A.R. Campbell P. Birney E. Easton D.F. Chenevix-Trench G. Tan M.H. Khoo S.K. Teh B.T. Yuen S.T. Leung S.Y. Wooster R. Futreal P.A. Stratton M.R. Patterns of somatic mutation in human cancer genomes.Nature. 2007; 446: 153-158Crossref PubMed Scopus (2389) Google Scholar). Confirmed coding variations in NCBI's dbSNP are also included in CanProVar. This cancer-centric proteome variation repository provides an opportunity to create a protein sequence database that can facilitate protein variant detection in shotgun proteomics analysis of human cancer samples. the human Cancer Proteome Variation Database single nucleotide polymorphism. Second, although limiting protein variants to known coding SNPs and mutations could effectively reduce the search space as compared with the exhaustive test of all possible amino acid substitutions, this method still significantly increases the number of entries in a protein sequence database, which in turn increases the risk of false positive identifications. Many previous reports failed to address this critical problem (14Bunger M.K. Cargile B.J. Sevinsky J.R. Deyanova E. Yates N.A. Hendrickson R.C. Stephenson Jr., J.L. Detection and validation of non-synonymous coding SNPs from orthogonal analysis of shotgun proteomics data.J. Proteome Res. 2007; 6: 2331-2340Crossref PubMed Scopus (47) Google Scholar). In the study by Bunger et al. (14Bunger M.K. Cargile B.J. Sevinsky J.R. Deyanova E. Yates N.A. Hendrickson R.C. Stephenson Jr., J.L. Detection and validation of non-synonymous coding SNPs from orthogonal analysis of shotgun proteomics data.J. Proteome Res. 2007; 6: 2331-2340Crossref PubMed Scopus (47) Google Scholar), a peptide is assigned as an “alternative allele” SNP if the search score for its match against the dbSNP is at least 15% higher than the score for corresponding reference hit. The threshold of 15% was chosen based on manual examination to provide the best balance between false positives and false negatives (14Bunger M.K. Cargile B.J. Sevinsky J.R. Deyanova E. Yates N.A. Hendrickson R.C. Stephenson Jr., J.L. Detection and validation of non-synonymous coding SNPs from orthogonal analysis of shotgun proteomics data.J. Proteome Res. 2007; 6: 2331-2340Crossref PubMed Scopus (47) Google Scholar). Although it was proven successful in this study, selection of the score threshold requires manual examination by experienced researchers and cannot be generalized and automated. Other problems introduced by adding variations to sequence databases include (1) efficient storage of variation information in the database, (2) compatibility of the database with different search engines, and (3) interpretability of reports that include both variant and wild-type peptides. In this paper, we present an integrated workflow to address the above problems. First, we created a variation-containing protein sequence database based on the CanProVar database. Next, we developed a workflow for identifying both wild-type and variant peptides simultaneously from shotgun proteomics data. We used data sets from colorectal cancer cell lines and human patient samples to demonstrate our workflow. Identified variants were validated through genomic sequencing. Moreover, we tested the compatibility of the workflow with popular search engines including MyriMatch(26Tabb D.L. Fernando C.G. Chambers M.C. MyriMatch: highly accurate tandem mass spectral peptide identification by multivariate hypergeometric analysis.J. Proteome Res. 2007; 6: 654-661Crossref PubMed Scopus (448) Google Scholar), Sequest(27Eng J.K. MaCormack A.L. Yates 3rd, J.R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database.J. Am. Soc. Mass Spectrom. 1994; 5: 976-989Crossref PubMed Scopus (5444) Google Scholar), Mascot(28Perkins D.N. Pappin D.J. Creasy D.M. Cottrell J.S. Probability-based protein identification by searching sequence databases using mass spectrometry data.Electrophoresis. 1999; 20: 3551-3567Crossref PubMed Scopus (6776) Google Scholar), and X!Tandem(13Craig R. Beavis R.C. TANDEM: matching proteins with tandem mass spectra.Bioinformatics. 2004; 20: 1466-1467Crossref PubMed Scopus (1991) Google Scholar). A postprocessing tool was also developed to generate easily interpretable reports based on the output from different search engines. Finally, we benchmarked our workflow against the exhaustive search-based methods. The human proteomics datasets from colorectal adenocarcinoma cell lines (RKO, SW480, and HCT-116) and three colorectal tumor specimens were generated in the Ayers Institute at Vanderbilt. The cell lines were obtained from American Type Culture Collection (ATCC, Manassas, VA) and grown and harvested within 6 months of date of purchase, or grown from frozen stocks that had been made within 6 months of original purchase. They were grown in 10% fetal bovine serum and penicillin and streptomycin supplemented medium at 37 °C with 5% CO2. SW480 was grown in RPMI 1640 medium, whereas HCT-116 and RKO were grown in McCoy's5A medium. Cells were grown to 80% confluency, the growth medium was aspirated, cells were washed once in 1× phosphate-buffered saline and collected in 1× phosphate-buffered saline. Cells were centrifuged at 300 × g for 5 min and supernatant was discarded. Cell pellets were stored at −80 °C until cell lysis could be carried out. Biological replicates were harvested ∼1 week apart from the identical cell culture. These replicates were processed separately and independently through the complete analysis procedure. Colorectal tumor specimens were obtained from the Vanderbilt colorectal cancer repository under an IRB-approved protocol that included informed consent from the patients. We obtained three Stage III sigmoid carcinoma specimens based on availability of the biological material and confirmed for the presence of more than 70% tumor cells by a certified pathologist (Dr M.K. Washington). A total of 60 μm thickness for each of the frozen specimens was sectioned and collected into microcentrifuge tubes. Mass spectrometry methods have been described in detail (29Slebos R.J. Brock J.W. Winters N.F. Stuart S.R. Martinez M.A. Li M. Chambers M.C. Zimmerman L.J. Ham A.J. Tabb D.L. Liebler D.C. Evaluation of strong cation exchange versus isoelectric focusing of peptides for multidimensional liquid chromatography-tandem mass spectrometry.J. Proteome Res. 2008; 7: 5286-5294Crossref PubMed Scopus (80) Google Scholar, 30Sprung Jr., R.W. Brock J.W. Tanksley J.P. Li M. Washington M.K. Slebos R.J. Liebler D.C. Equivalence of protein inventories obtained from formalin-fixed paraffin-embedded and frozen tissue in multidimensional liquid chromatography-tandem mass spectrometry shotgun proteomic analysis.Mol. Cell Proteomics. 2009; 8: 1988-1998Abstract Full Text Full Text PDF PubMed Scopus (165) Google Scholar). In summary, proteins from cell line or tissue samples were reduced, alkylated with iodoacetamide, and digested with trypsin. The resulting peptides were separated on isoelectric focusing strips that were cut into 15 (for cell lines) or 20 (for human tissues) separate fractions. Each of these fractions was analyzed by a second separation on a liquid chromatography column, followed by MS/MS analysis on an LTQ-Orbitrap. Binary spectral data present in the raw files were converted to the mzML format using the msConvert tool in the ProteoWizard library (v2.0.1757, 01/27/2010) (31Kessner D. Chambers M. Burke R. Agus D. Mallick P. ProteoWizard: open source software for rapid proteomics tools development.Bioinformatics. 2008; 24: 2534-2536Crossref PubMed Scopus (1233) Google Scholar). Protein variation data were downloaded from the CanProVar database on 9/26/2009, which included 41,541 nonsynonymous SNPs in 30,322 proteins and 8570 cancer-related variations in 2921 proteins (23Li J. Duncan D.T. Zhang B. CanProVar: a human cancer proteome variation database.Hum. Mutat. 2010; 31: 219-228Crossref PubMed Scopus (52) Google Scholar). A corresponding normal protein database was downloaded from Ensembl (human, v53) at ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/pep/. We tested our workflow against four popular database search engines, including MyriMatch(26Tabb D.L. Fernando C.G. Chambers M.C. MyriMatch: highly accurate tandem mass spectral peptide identification by multivariate hypergeometric analysis.J. Proteome Res. 2007; 6: 654-661Crossref PubMed Scopus (448) Google Scholar) (v1.5.6), Sequest(27Eng J.K. MaCormack A.L. Yates 3rd, J.R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database.J. Am. Soc. Mass Spectrom. 1994; 5: 976-989Crossref PubMed Scopus (5444) Google Scholar) (TurboSEQUEST v27), Mascot(28Perkins D.N. Pappin D.J. Creasy D.M. Cottrell J.S. Probability-based protein identification by searching sequence databases using mass spectrometry data.Electrophoresis. 1999; 20: 3551-3567Crossref PubMed Scopus (6776) Google Scholar) (v2.2.04), and X!Tandem(13Craig R. Beavis R.C. TANDEM: matching proteins with tandem mass spectra.Bioinformatics. 2004; 20: 1466-1467Crossref PubMed Scopus (1991) Google Scholar) (X!TANDEM TORNADO v2008.02.01.3). MyriMatch was used as the primary search engine in this study. All cysteines were assumed to be carbamidomethylated, and methionines were allowed to be oxidized. A precursor error of up to 0.007 m/z was permitted, whereas fragment ions were required to fall within 0.5 m/z of their expected locations. Ambiguous identifications that mapped to three or more peptide sequences with equal scores were excluded. One missed cleavage was permitted and no nonspecific cleavage was allowed. The configurations for all search engines are provided in supplemental File S1. Genomic DNA from cell lines RKO, SW480, and HCT-116 was isolated using a DNeasy® kit (Qiagen). After identification of putative variant peptides by shotgun proteomics, the corresponding exons encoding the protein sequences were amplified using a HotStarTaq® Master Mix Kit (Qiagen). The following polymerase chain reaction (PCR) conditions were used: 96 °C × 15 min, followed by 40 cycles of 95 °C × 30 s, 60 °C × 30 s, 72 °C × 60 s, and a final extension of 72 °C × 10 min. A list of all the primers used for the PCR amplifications is provided in supplemental File S2. Excess primers and nucleotides were digested using ExoSAP (USB). Sequencing reactions were performed by using Applied Biosystems Version 3.1 Big Dye Terminator chemistry and then analyzed on an Applied Biosystems 3730XL Sequencer. All sequence chromatograms were read in both forward (F) and reverse (R) directions. As illustrated in Fig. 1, our workflow for identifying wild-type and variant peptides based on shotgun proteomics data includes three steps: database creation, peptide identification, and post-processing. The variation-containing protein sequence database was created based on the Ensembl protein database (human, v53) and the CanProVar database (23Li J. Duncan D.T. Zhang B. CanProVar: a human cancer proteome variation database.Hum. Mutat. 2010; 31: 219-228Crossref PubMed Scopus (52) Google Scholar). Missense variations, nonsense variations and single amino acid deletions and insertions were included in the database. After the naming convention in dbSNP, each cancer-related variation in CanProVar was given an identifier prefixed with “cs.” For each single amino acid alteration, the sequence covering the enclosing tryptic peptide and the two flanking tryptic peptides was taken as an independent entry in the FASTA format. Peptide entries with less than 4 residues were excluded because they cannot be confidently identified in shotgun proteomics. Adding the flanking peptides allows for the identification of peptides with missed enzyme cleavage (15Schandorff S. Olsen J.V. Bunkenborg J. Blagoev B. Zhang Y. Andersen J.S. Mann M. A mass spectrometry-friendly database for cSNP identification.Nat. Methods. 2007; 4: 465-466Crossref PubMed Scopus (60) Google Scholar). This database construction approach shares the same space-saving advantage as appending sequence variants to the original protein sequence, which was adopted in the study of Schandorff et al. (15Schandorff S. Olsen J.V. Bunkenborg J. Blagoev B. Zhang Y. Andersen J.S. Mann M. A mass spectrometry-friendly database for cSNP identification.Nat. Methods. 2007; 4: 465-466Crossref PubMed Scopus (60) Google Scholar). We chose to keep these peptides as independent entries because related variation information can be easily recorded in the seq

Referência(s)