Artigo Acesso aberto Revisado por pares

Protein Significance Analysis in Selected Reaction Monitoring (SRM) Measurements

2011; Elsevier BV; Volume: 11; Issue: 4 Linguagem: Inglês

10.1074/mcp.m111.014662

ISSN

1535-9484

Autores

Ching-Yun Chang, Paola Picotti, Ruth Hüttenhain, Viola Heinzelmann‐Schwarz, Marko Jovanović, Ruedi Aebersold, Olga Vitek,

Tópico(s)

Metabolomics and Mass Spectrometry Studies

Resumo

Selected reaction monitoring (SRM) is a targeted mass spectrometry technique that provides sensitive and accurate protein detection and quantification in complex biological mixtures. Statistical and computational tools are essential for the design and analysis of SRM experiments, particularly in studies with large sample throughput. Currently, most such tools focus on the selection of optimized transitions and on processing signals from SRM assays. Little attention is devoted to protein significance analysis, which combines the quantitative measurements for a protein across isotopic labels, peptides, charge states, transitions, samples, and conditions, and detects proteins that change in abundance between conditions while controlling the false discovery rate. We propose a statistical modeling framework for protein significance analysis. It is based on linear mixed-effects models and is applicable to most experimental designs for both isotope label-based and label-free SRM workflows. We illustrate the utility of the framework in two studies: one with a group comparison experimental design and the other with a time course experimental design. We further verify the accuracy of the framework in two controlled data sets, one from the NCI-CPTAC reproducibility investigation and the other from an in-house spike-in study. The proposed framework is sensitive and specific, produces accurate results in broad experimental circumstances, and helps to optimally design future SRM experiments. The statistical framework is implemented in an open-source R-based software package SRMstats, and can be used by researchers with a limited statistics background as a stand-alone tool or in integration with the existing computational pipelines. Selected reaction monitoring (SRM) is a targeted mass spectrometry technique that provides sensitive and accurate protein detection and quantification in complex biological mixtures. Statistical and computational tools are essential for the design and analysis of SRM experiments, particularly in studies with large sample throughput. Currently, most such tools focus on the selection of optimized transitions and on processing signals from SRM assays. Little attention is devoted to protein significance analysis, which combines the quantitative measurements for a protein across isotopic labels, peptides, charge states, transitions, samples, and conditions, and detects proteins that change in abundance between conditions while controlling the false discovery rate. We propose a statistical modeling framework for protein significance analysis. It is based on linear mixed-effects models and is applicable to most experimental designs for both isotope label-based and label-free SRM workflows. We illustrate the utility of the framework in two studies: one with a group comparison experimental design and the other with a time course experimental design. We further verify the accuracy of the framework in two controlled data sets, one from the NCI-CPTAC reproducibility investigation and the other from an in-house spike-in study. The proposed framework is sensitive and specific, produces accurate results in broad experimental circumstances, and helps to optimally design future SRM experiments. The statistical framework is implemented in an open-source R-based software package SRMstats, and can be used by researchers with a limited statistics background as a stand-alone tool or in integration with the existing computational pipelines. Selected reaction monitoring (SRM) 1The abbreviations used are:SRMselected reaction monitoring. 1The abbreviations used are:SRMselected reaction monitoring. is a mass spectrometry technique that can accurately and reproducibly quantify proteins in complex biological mixtures (1Kiyonami R. Schoen A. Prakash A. Peterman S. Zabrouskov V. Picotti P. Aebersold R. Huhmer A. Domon B. Increased selectivity, analytical precision, and throughput in targeted proteomics.Mol. Cell. Proteomics. 2011; 10 (002931): M110Abstract Full Text Full Text PDF PubMed Scopus (150) Google Scholar, 2Lange V. Picotti P. Domon B. Aebersold R. Selected reaction monitoring for quantitative proteomics: a tutorial.Mol. Syst. Biol. 2008; 4: 222-235Crossref PubMed Scopus (1121) Google Scholar, 3Picotti P. Bodenmiller B. Mueller L.N. Domon B. Aebersold R. Full dynamic range proteome analysis of S. cerevisiae by targeted proteomics.Cell. 2009; 138: 795-806Abstract Full Text Full Text PDF PubMed Scopus (647) Google Scholar). It can cover a nearly complete dynamic range of abundance of cellular proteome, with a lower boundary of detection below 50 copies per cell for single cellular organisms (3Picotti P. Bodenmiller B. Mueller L.N. Domon B. Aebersold R. Full dynamic range proteome analysis of S. cerevisiae by targeted proteomics.Cell. 2009; 138: 795-806Abstract Full Text Full Text PDF PubMed Scopus (647) Google Scholar). Considerable efforts are currently invested into developing high-throughput SRM assays, even for whole proteomes (4Ahrens C.H. Brunner E. Qeli E. Basler K. Aebersold R. Generating and navigating proteome maps using mass spectrometry.Nat. Rev. Mol. Cell Biol. 2010; 27: 789-801Crossref Scopus (133) Google Scholar, 5Picotti P. Rinner O. Stallmach R. Dautel F. Farrah T. Domon B. Wenschuh H. Aebersold R. High-throughput generation of selected reaction-monitoring assays for proteins and proteomes.Nat. Methods. 2010; 10: 43-46Crossref Scopus (399) Google Scholar). These assays are then used to simultaneously quantify hundreds of proteins with a high degree of reproducibility across multiple samples, and as a result the assays are increasingly used in systems biology and in clinical investigations (3Picotti P. Bodenmiller B. Mueller L.N. Domon B. Aebersold R. Full dynamic range proteome analysis of S. cerevisiae by targeted proteomics.Cell. 2009; 138: 795-806Abstract Full Text Full Text PDF PubMed Scopus (647) Google Scholar, 6Cima I. Schiess R. Wild P. Kaelin M. Schüffler P. Lange V. Picotti P. Ossola R. Templeton A. Schubert O. Fuchs T. Leippold T. Wyler S. Zehetner J. Jochum W. Buhmann J. Cerny T. Moch H. Gillessen S. Aebersold R. Krek W. Cancer genetics-guided discovery of serum biomarker signatures for diagnosis and prognosis of prostate cancer.Proc. Natl. Acad. Sci. U.S.A. 2011; 108: 3342-3347Crossref PubMed Scopus (148) Google Scholar, 7Whiteaker J.R. Lin C. Kennedy J. Hou L. Trute M. Sokal I. Yan P. Schoenherr R.M. Zhao L. Voytovich U.J. Kelly-Spratt K.S. Krasnoselsky A. Gafken P.R. Hogan J.M. Jones L.A. Wang P. Amon L. Chodosh L.A. Nelson P.S. McIntosh M.W. Kemp C.J. Paulovich A.G. A targeted proteomics-based pipeline for verification of biomarkers in plasma.Nat. Biotechnol. 2011; 29: 625-634Crossref PubMed Scopus (291) Google Scholar, 8Wolf-Yadlin A. Hautaniemi S. Lauffenburger D.A. White F.M. Multiple reaction monitoring for robust quantitative proteomic analysis of cellular signaling networks.Proc. Natl. Acad. Sci. U.S.A. 2007; 104: 5860-5865Crossref PubMed Scopus (430) Google Scholar). selected reaction monitoring. selected reaction monitoring. SRM experiments quantify a priori known protein species. They require knowledge of the peptides of these proteins that are unique to the target proteins and can be observed by a mass spectrometer (9Kuster B. Schirle M. Mallick P. Aebersold R. Scoring proteomes with proteotypic peptide probes.Nat. Rev. Mol. Cell Biol. 2005; 6: 577-583Crossref PubMed Scopus (303) Google Scholar, 10Mallick P. Schirle M. Chen S.S. Flory M.R. Lee H. Martin D. Ranish J. Raught B. Schmitt R. Werner T. Kuster B. Aebersold R. Computational prediction of proteotypic peptides for quantitative proteomics.Nat. Biotechnol. 2007; 25: 125-131Crossref PubMed Scopus (568) Google Scholar), and of the mass spectrometric characteristics of these peptides such as fragment ion mass, signal intensity distribution, and optimal collision energy (2Lange V. Picotti P. Domon B. Aebersold R. Selected reaction monitoring for quantitative proteomics: a tutorial.Mol. Syst. Biol. 2008; 4: 222-235Crossref PubMed Scopus (1121) Google Scholar). Enzymatically digested proteins are subjected to liquid chromatography separation and are monitored in a triple quadrupole mass spectrometer, and the ion signals for an a priori selected set of fragment ions are recorded over chromatographic time (11Domon B. Aebersold R. Mass spectrometry and protein analysis.Science. 2006; 312: 212-217Crossref PubMed Scopus (1610) Google Scholar, 12Domon B. Aebersold R. Options and considerations when selecting a quantitative proteomics strategy.Nat. Biotechnol. 2010; 28: 710-721Crossref PubMed Scopus (482) Google Scholar). The intensities of precursor/fragment ion pairs of a peptide, called transitions, are then used as measurements of protein abundance. Label-based workflows further enhance the accuracy of the quantification by spiking an isotopically labeled reference version of each target peptide into the samples and then compare the relative intensities of the endogenous and the reference transitions. Many SRM experiments aim at class comparison, i.e. comparing protein abundance across conditions or time points of interests. Statistical and computational tools are essential for this task. In addition to increasing sample throughput, the tools allow us to conduct and interpret the experiment in an objective and reproducible fashion. A typical SRM bioinformatics analysis workflow is overviewed in Fig. 1A. It includes assay development, signal processing, and protein significance analysis. Currently, most available bioinformatics tools focus on the first two steps in Fig. 1A. A variety of stand-alone software packages such as MRMmaid, MRM worksheet and others reviewed in (13Cham J.A. Bianco L. Bessant C. Free computational resources for designing selected reaction monitoring transitions.Proteomics. 2010; 10: 1106-1126Crossref PubMed Scopus (57) Google Scholar) help to automatically select transition lists, schedule retention times, optimize collision energy, etc. Other stand-alone tools, e.g. MultiQuant (Applied Biosystems/MDS Sciex) and Pinpoint (Thermo Scientific), quantify and visualize the acquired signals, and AuDIT (14Abbatiello S.E. Mani D.R. Keshishian H. Carr S.A. Automated detection of inaccurate and imprecise transitions in peptide quantification by multiple reaction monitoring mass spectrometry.Clin. Chem. 2010; 56: 291-305Crossref PubMed Scopus (166) Google Scholar) and mProphet (15Reiter L. Rinner O. Picotti P. Hüttenhain R. Beck M. Brusniak M.-Y. Hengartner M.O. Aebersold R. mProphet: automated data processing and statistical validation for large scale SRM experiments.Nat. Methods. 2011; 8: 430-435Crossref PubMed Scopus (365) Google Scholar) detect and filter out poor quality transitions. Alternatively, comprehensive pipelines such as Skyline (16MacLean B. Tomazela D.M. Shulman N. Chambers M. Finney G.L. Frewen B. Kern R. Tabb D.L. Liebler D.C. MacCoss M.J. Skyline: an open source document editor for creating and analyzing targeted proteomics experiments.Bioinformatics. 2010; 26: 966-968Crossref PubMed Scopus (2964) Google Scholar) and ATAQS (17Brusniak M.-Y. K. Kwok S.-T. Christiansen M. Campbell D. Reiter L. Picotti P. Kusebauch U. Ramos H. Deutsch E.W. Chen J. Moritz R.L. Aebersold R. Ataqs: A computational software tool for high throughput transition optimization and validation for selected reaction monitoring mass spectrometry.BMC Bioinformatics. 2011; 12: 1-15Crossref PubMed Scopus (59) Google Scholar) integrate the transition design and the signal processing steps in user-friendly workflows. Therefore, the upstream setup of SRM measurements and the analysis of the raw primary data to quantify the targeted peptides are well supported by the available software tools. At the same time, there is no generally accepted approach to significance analysis of the identified and quantified proteins. We define protein significance analysis as a procedure that appropriately combines the quantitative measurements for a targeted protein across isotopic labels, peptides, charge states, transitions, samples, and conditions, and detects proteins that change in abundance between conditions more systematically than as expected by random chance, while controlling the false discovery rate (18Benjamini Y. Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing.J. Roy. Statistical Soc. 1995; 57: 289-300Google Scholar). Significance analysis is challenging, in part, because of the natural biological variation of protein abundance and the experimental variation in sample handling (19Addona T.A. Abbatiello S.E. Schilling B. Skates S.J. Mani D.R. Bunk D.M. Spiegelman C.H. Zimmerman L.J. Ham A.L. Keshishian H. Hall S.C. Allen S. Blackman R.K. Borchers C.H. Buck. C. Cardasis H.L. Cusack M.P. Dodder N.G. Gibson B.W. Held J.M. Hiltke T. Jackson A. Johansen E.B. Kinsinger C.R. Li J. Mesri M. Neubert T.A. Niles R.K. Pulsipher T.C. Ransohoff D.F. Rodriguez H. Rudnick P.A. Smith D. Tabb D.L. Tegeler T.J. Variyath A.M. Vega-Montoto L.J. Wahlander A. Waldemarson S. Wang M. Whiteaker J.R. Zhao L. Anderson N.L. Fisher S.J. Liebler D.C. Paulovich A.G. Regnier F.E. Tempst P. Carr S.A. Multi-site assessment of the precision and reproducibility of multiple reaction monitoring-based measurements of proteins in plasma.Nat. Biotechnol. 2009; 27: 633-641Crossref PubMed Scopus (862) Google Scholar). Signal processing also contributes to the variation when individual transitions are missed, misidentified or imprecisely quantified (14Abbatiello S.E. Mani D.R. Keshishian H. Carr S.A. Automated detection of inaccurate and imprecise transitions in peptide quantification by multiple reaction monitoring mass spectrometry.Clin. Chem. 2010; 56: 291-305Crossref PubMed Scopus (166) Google Scholar). Statistical inference allows us to make objective conclusions in such situations, however the development of statistical methods for protein significance analysis in SRM experiments has not received sufficient attention as of yet. Many investigations perform significance analysis using simple statistical methods such as the two-sample t test, which compares the abundance of all the transitions from one condition to another. Such tests take as input intensities of the individual transitions of the protein (or their averages within a run) in label-free experiments, or ratios of the endogenous and reference transitions in label-based experiments. In this manuscript we argue that more accurate conclusions can be obtained by a more detailed probabilistic modeling. We propose a general and flexible statistical framework for SRM experiments, which is schematically illustrated in Fig. 1B. The framework consists of the following components: (a) definition of the biological populations of interest and of the desired scope of conclusions, (b) exploratory data analysis to control the quality of MS runs, (c) joint representation of the quantitative measurements of the protein using a flexible linear mixed-effects model, and model-based determination of proteins that change in abundance from one condition to another, and (d) statistical design of future follow-up experiments. The proposed framework is implemented in an open source R-based software package SRMstats, and can be used stand-alone or as a module integrated with comprehensive pipelines such as Skyline and ATAQS. It is implemented for both label-free and label-based SRM workflows. We evaluated the statistical framework using two controlled spike-in experimental data sets. The first was generated by a multilaboratory investigation of the Clinical Proteomic Technology Assessment for Cancer network of the National Cancer Institute (NCI-CPTAC), described in detail in (19Addona T.A. Abbatiello S.E. Schilling B. Skates S.J. Mani D.R. Bunk D.M. Spiegelman C.H. Zimmerman L.J. Ham A.L. Keshishian H. Hall S.C. Allen S. Blackman R.K. Borchers C.H. Buck. C. Cardasis H.L. Cusack M.P. Dodder N.G. Gibson B.W. Held J.M. Hiltke T. Jackson A. Johansen E.B. Kinsinger C.R. Li J. Mesri M. Neubert T.A. Niles R.K. Pulsipher T.C. Ransohoff D.F. Rodriguez H. Rudnick P.A. Smith D. Tabb D.L. Tegeler T.J. Variyath A.M. Vega-Montoto L.J. Wahlander A. Waldemarson S. Wang M. Whiteaker J.R. Zhao L. Anderson N.L. Fisher S.J. Liebler D.C. Paulovich A.G. Regnier F.E. Tempst P. Carr S.A. Multi-site assessment of the precision and reproducibility of multiple reaction monitoring-based measurements of proteins in plasma.Nat. Biotechnol. 2009; 27: 633-641Crossref PubMed Scopus (862) Google Scholar). The original goal of this investigation was to assess reproducibility, recovery, linear dynamic range, limits of detection, and quantification of SRM assays. Here we used the data sets to evaluate the ability of significance analysis to detect known changes in protein concentration. Briefly, the investigation targeted seven proteins, each was represented by up to three peptides and each peptide by up to three transitions. The proteins were spiked into a complex background in a series of nine concentrations, and heavy isotope-labeled synthetic peptides (20Gerber S.A. Rush J. Stemman O. Kirschner M.W. Gygi S.P. Absolute quantification of proteins and phosphoproteins from cell lysates by tandem MS.Proc. Natl. Acad. Sci. U.S.A. 2003; 100: 6940-6945Crossref PubMed Scopus (1542) Google Scholar) were used as internal standards. The spike-in samples were prepared according to three mixing and digestion protocols (called Study I, II, and III). Study III was subjected to sources of technical variation closest to a real-life experiment. The samples were shipped to eight participating sites, which independently performed SRM analyses to detect and quantify the transitions. We took as input the tabulated transition abundances as quantified by each participating site, and evaluated the ability of significance analysis to detect known fold changes, separately for each study and site. The second controlled data set was an in-house spike-in experiment (supplemental Material Sec. 1.1). To evaluate the sensitivity of significance analysis, six proteins were spiked into a complex background in varying concentrations according to a Latin Square design (21Montgomery D. John Wiley & Sons, Inc. Hoboken, NJ.Design and analysis of experiments. 2005; Google Scholar). The design is advantageous in that it allows us to evaluate the ability to detect a range of fold changes over a series of proteins, baseline concentrations, and replicate runs, while limiting the number of mixtures. To evaluate the specificity of the significance analysis, six additional proteins were spiked into the background in constant concentrations. Each protein was represented by two peptides and each peptide by up to three transitions. Heavy isotope-labeled synthetic peptides (20Gerber S.A. Rush J. Stemman O. Kirschner M.W. Gygi S.P. Absolute quantification of proteins and phosphoproteins from cell lysates by tandem MS.Proc. Natl. Acad. Sci. U.S.A. 2003; 100: 6940-6945Crossref PubMed Scopus (1542) Google Scholar) were used as references, and each mixture was profiled in two mass spectrometry runs. We evaluated the ability of significance analysis to detect known fold changes, as well as the extent of false positive changes among proteins spiked in constant concentrations. We tested the statistical framework using example data sets from two biological investigations. The first studied plasma samples from six patients with epithelial ovarian cancer and ten healthy controls in a group comparison experimental design (supplemental Material Sec. 1.2). Each specimen was analyzed in a single mass spectrometry run. 14 N-glycosylated proteins were selected based on prior evidence of differential abundance from the literature and profiled as potential diagnostic ovarian cancer biomarkers. Heavy isotope-labeled reference peptides (20Gerber S.A. Rush J. Stemman O. Kirschner M.W. Gygi S.P. Absolute quantification of proteins and phosphoproteins from cell lysates by tandem MS.Proc. Natl. Acad. Sci. U.S.A. 2003; 100: 6940-6945Crossref PubMed Scopus (1542) Google Scholar) were spiked into the samples as internal references, and up to three transitions were measured for each endogenous and heavy-labeled peptide. The second example is the study of central carbon metabolism of Saccharomyces cerevisiae described in detail in (3Picotti P. Bodenmiller B. Mueller L.N. Domon B. Aebersold R. Full dynamic range proteome analysis of S. cerevisiae by targeted proteomics.Cell. 2009; 138: 795-806Abstract Full Text Full Text PDF PubMed Scopus (647) Google Scholar) (supplemental Material Sec. 1.3). The experiment targeted 45 proteins in the glycolysis/gluconeogenesis/TCA cycle/glyoxylate cycle network in yeast, which spans the range of protein abundance from less than 128 to 10E6 copies per cell. Unlike the previous example this study had a time course experimental design. Three biological replicates were analyzed at ten time points (T1-T10), while the cells transited through exponential growth in a glucose-rich medium (T1-T4), diauxic shift (T5-T6), post-diauxic phase (T7-T9), and stationary phase (T10). Prior to trypsinization, the samples were mixed with an equal amount of proteins from a common 15N-labeled yeast sample, which was used as a reference. Each sample was profiled in a single mass spectrometry run, where each protein was represented by up to two peptides and each peptide by up to three transitions. Transcriptional activity under the same experimental conditions has been previously investigated in (22DeRisi J.L. Iyer V.R. Brown P.O. Exploring the metabolic and genetic control of gene expression on a genomic scale.Science. 1997; 278: 680-686Crossref PubMed Scopus (3695) Google Scholar). Genes coding for 29 out of the 45 targeted proteins were found differentially expressed between conditions similar to those represented by T7 and T1 in this proteomic study and are used in this manuscript for external validation. Additional simulated data sets were generated to demonstrate the performance of the proposed significance analysis in a variety of circumstances, such as in the presence of missing transitions, relatively large between-run variation, and the presence/absence of technical replication of a same biological sample. Pseudocode of the basic algorithm used to generate these synthetic data sets is provided in (supplemental Material Sec. 4.1). In the following sections we detail each step of the proposed statistical framework for protein significance analysis in SRM measurements (Fig. 1B). To detect differences in protein abundance with high sensitivity and specificity, the key step is to select an appropriate statistical model, i.e. STEP 3 (model-based analysis) of the proposed framework. Prior to the statistical modeling, it is necessary to examine the properties of the data as demonstrated in STEP 1 (problem statement) and STEP 2 (exploratory data analysis). Finally, STEP 4 uses the model and the data to plan the design of future experiments. The important first step in designing and analyzing an SRM experiment is to determine the comparisons of interest, the experimental design, and the scope of the conclusions of the study, in order to select an appropriate statistical model for the data. In this manuscript we refer to this step as problem statement. First, we define the comparisons of interests, i.e. the conditions that will be compared in the study. For example in the ovarian cancer study the only comparison of interest was between the mean protein abundance of subjects with the disease and the controls; in the yeast metabolism study researchers could compare the mean protein abundances in many pairs of time points. In the following we focus on comparing the time points 1 and 7. Next, we distinguish the experimental designs that are group comparison and time course. Group comparison studies acquire measurements on distinct individuals in each condition. For example the ovarian cancer study had a group comparison design. In contrast, time course studies acquire measurements on each biological source repeatedly at several conditions or time points. The repeated nature of the experiment allows us to profile changes in protein abundance for each individual over time. The yeast metabolism study had the time course design. Different statistical models will be required for these two experimental designs, and we will discuss this in the following section. An important aspect of the problem statement is the specification of the desired scope of validity of the conclusions, i.e. the scope to which the conclusions from the analysis will be valid, with respect to the biological and technical MS run variation. As an illustration, Fig. 3A shows a hypothetical protein measured with two subjects per condition (Healthy and Disease) and three endogenous transitions per subject in large biological variation scenario. The black curves represent the distribution of protein abundances in the populations of subjects that are of interest to the investigation. The solid dots are the protein abundances of the subjects selected in the study. The black circles are the endogenous transitions measured for this protein and this subject.

Referência(s)