Peptide-level Robust Ridge Regression Improves Estimation, Sensitivity, and Specificity in Data-dependent Quantitative Label-free Shotgun Proteomics
2015; Elsevier BV; Volume: 15; Issue: 2 Linguagem: Inglês
10.1074/mcp.m115.055897
ISSN1535-9484
AutoresLudger J.E. Goeminne, Kris Gevaert, Lieven Clement,
Tópico(s)Biosensors and Analytical Detection
ResumoPeptide intensities from mass spectra are increasingly used for relative quantitation of proteins in complex samples. However, numerous issues inherent to the mass spectrometry workflow turn quantitative proteomic data analysis into a crucial challenge. We and others have shown that modeling at the peptide level outperforms classical summarization-based approaches, which typically also discard a lot of proteins at the data preprocessing step. Peptide-based linear regression models, however, still suffer from unbalanced datasets due to missing peptide intensities, outlying peptide intensities and overfitting. Here, we further improve upon peptide-based models by three modular extensions: ridge regression, improved variance estimation by borrowing information across proteins with empirical Bayes and M-estimation with Huber weights. We illustrate our method on the CPTAC spike-in study and on a study comparing wild-type and ArgP knock-out Francisella tularensis proteomes. We show that the fold change estimates of our robust approach are more precise and more accurate than those from state-of-the-art summarization-based methods and peptide-based regression models, which leads to an improved sensitivity and specificity. We also demonstrate that ionization competition effects come already into play at very low spike-in concentrations and confirm that analyses with peptide-based regression methods on peptide intensity values aggregated by charge state and modification status (e.g. MaxQuant's peptides.txt file) are slightly superior to analyses on raw peptide intensity values (e.g. MaxQuant's evidence.txt file). Peptide intensities from mass spectra are increasingly used for relative quantitation of proteins in complex samples. However, numerous issues inherent to the mass spectrometry workflow turn quantitative proteomic data analysis into a crucial challenge. We and others have shown that modeling at the peptide level outperforms classical summarization-based approaches, which typically also discard a lot of proteins at the data preprocessing step. Peptide-based linear regression models, however, still suffer from unbalanced datasets due to missing peptide intensities, outlying peptide intensities and overfitting. Here, we further improve upon peptide-based models by three modular extensions: ridge regression, improved variance estimation by borrowing information across proteins with empirical Bayes and M-estimation with Huber weights. We illustrate our method on the CPTAC spike-in study and on a study comparing wild-type and ArgP knock-out Francisella tularensis proteomes. We show that the fold change estimates of our robust approach are more precise and more accurate than those from state-of-the-art summarization-based methods and peptide-based regression models, which leads to an improved sensitivity and specificity. We also demonstrate that ionization competition effects come already into play at very low spike-in concentrations and confirm that analyses with peptide-based regression methods on peptide intensity values aggregated by charge state and modification status (e.g. MaxQuant's peptides.txt file) are slightly superior to analyses on raw peptide intensity values (e.g. MaxQuant's evidence.txt file). High-throughput LC-MS-based proteomic workflows are widely used to quantify differential protein abundance between samples. Relative protein quantification can be achieved by stable isotope labeling workflows such as metabolic (1.Oda Y. Huang K. Cross F.R. Cowburn D. Chait B.T. Accurate quantitation of protein expression and site-specific phosphorylation.Proc. Natl. Acad. Sci. U.S.A. 1999; 96: 6591-6596Crossref PubMed Scopus (940) Google Scholar, 2.Ong S.-E. Blagoev B. Kratchmarova I. Kristensen D.B. Steen H. Pandey A. Mann M. Stable isotope labeling by amino acids in cell culture, SILAC, as a simple and accurate approach to expression proteomics.Mol. Cell. Proteomics. 2002; 1: 376-386Abstract Full Text Full Text PDF PubMed Scopus (4569) Google Scholar) and postmetabolic labeling (3.Gygi S.P. Rist B. Gerber S.A. Turecek F. Gelb M.H. Aebersold R. Quantitative analysis of complex protein mixtures using isotope-coded affinity tags.Nat. Biotech. 1999; 17: 994-999Crossref PubMed Scopus (4341) Google Scholar, 4.Hsu J.-L. Huang S.-Y. Chow N.-H. Chen S.-H. Stable-isotope dimethyl labeling for quantitative proteomics.Anal. Chem. 2003; 75: 6843-6852Crossref PubMed Scopus (599) Google Scholar, 5.Ross P.L. Huang Y.N. Marchese J.N. Williamson B. Parker K. Hattan S. Khainovski N. Pillai S. Dey S. Daniels S. Purkayastha S. Juhasz P. Martin S. Bartlet-Jones M. He F. Jacobson A. Pappin D.J. Multiplexed protein quantitation in Saccharomyces cerevisiae using amine-reactive isobaric tagging reagents.Mol. Cell. Proteomics. 2004; 3: 1154-1169Abstract Full Text Full Text PDF PubMed Scopus (3680) Google Scholar, 6.Thompson A. Schäfer J. Kuhn K. Kienle S. Schwarz J. Schmidt G. Neumann T. Hamon C. Tandem mass tags: A novel quantification strategy for comparative analysis of complex protein mixtures by MS/MS.Anal. Chem. 2003; 75: 1895-1904Crossref PubMed Scopus (1709) Google Scholar). These types of experiments generally avoid run-to-run differences in the measured peptide (and thus protein) content by pooling and analyzing differentially labeled samples in a single run. Label-free quantitative (LFQ) 1The abbreviations used are:LFQlabel-free quantitationCPTACClinical Proteomic Technology Assessment for Cancer NetworkDDAdata-dependent acquisitionDIAdata-independent acquisitionFCfold changeFDRfalse discovery rateFNfalse negativesFPfalse positivespAUCpartial area under the curvePPVpositive predictive valuePSMpeptide-to-spectrum matchROCreceiver operating curveRRrobust ridgerpAUCrelative partial area under the curveSILACstable isotope labeling of amino acids in cell cultureTNtrue negativesTPtrue positivesUPS1Universal Proteomics Standard 1. workflows become increasingly popular as the often expensive and time-consuming labeling protocols are omitted. Moreover, LFQ proteomics allows for more flexibility in comparing samples and tends to cover a larger area of the proteome at a higher dynamic range (7.Bantscheff M. Schirle M. Sweetman G. Rick J. Kuster B. Quantitative mass spectrometry in proteomics: a critical review.Anal. Bioanal. Chem. 2007; 389: 1017-1031Crossref PubMed Scopus (1256) Google Scholar, 8.Patel V.J. Thalassinos K. Slade S.E. Connolly J.B. Crombie A. Murrell J.C. Scrivens J.H. A comparison of labeling and label-free mass spectrometry-based proteomics approaches.J. Proteome Res. 2009; 8: 3752-3759Crossref PubMed Scopus (201) Google Scholar). Nevertheless, the nature of the LFQ protocol makes shotgun proteomic data analysis a challenging task. Missing values are omnipresent in proteomic data generated by data-dependent acquisition workflows, for instance because of low-abundant peptides that are not always fragmented in complex peptide mixtures and a limited number of modifications and mutations that can be accounted for in the feature search. Moreover, the overall abundance of a peptide is determined by the surroundings of its corresponding cleavage sites as these influence protease cleavage efficiency (9.Rodriguez J. Gupta N. Smith R.D. Pevzner P.A. Does trypsin cut before proline?.J. Proteome Res. 2008; 7: 300-305Crossref PubMed Scopus (181) Google Scholar). Similarly, some peptides are more easily ionized than others (10.Abaye D.A. Pullen F.S. Nielsen B.V. Peptide polarity and the position of arginine as sources of selectivity during positive electrospray ionisation mass spectrometry.Rapid Commun. Mass Spectrom. 2011; 25: 3597-3608Crossref PubMed Scopus (14) Google Scholar). These issues not only lead to missing peptides, but also increase variability in individual peptide intensities. The discrete nature of MS1 sampling following continuous elution of peptides from the LC column leads to increased variability in peptide quantifications. Finally, competition for ionization and co-elution of other peptides with similar m/z values may cause biased quantifications (11.Schliekelman P. Liu S. Quantifying the effect of competition for detection between coeluting peptides on detection probabilities in mass-spectrometry-based proteomics.J. Proteome Res. 2013; 13: 348-361Crossref PubMed Scopus (11) Google Scholar). However, note that in this respect, using data-independent acquisition (DIA), all peptide ions (or all peptide ions within a certain m/z range, depending on the method used) are fragmented simultaneously, resulting in multiplexed MS/MS spectra (12.Venable J.D. Dong M.-Q. Wohlschlegel J. Dillin A. Yates J.R. Automated approach for quantitative analysis of complex peptide mixtures from tandem mass spectra.Nat. Meth. 2004; 1: 39-45Crossref PubMed Scopus (509) Google Scholar, 13.Gillet L.C. Navarro P. Tate S. Röst H. Selevsek N. Reiter L. Bonner R. Aebersold R. Targeted data extraction of the MS/MS spectra generated by data-independent acquisition: A new concept for consistent and accurate proteome analysis.Mol. Cell. Proteomics. 2012; 11: 1-17Abstract Full Text Full Text PDF Scopus (1777) Google Scholar). Hence, issues of missing fragment spectra are less a problem with DIA, however, some of its challenges lie in deconvoluting MS/MS spectra and mapping their features to their corresponding peptides (14.Bilbao A. Varesio E. Luban J. Strambio-De-Castillia C. Hopfgartner G. Müller M. Lisacek F. Processing strategies and software solutions for data-independent acquisition in mass spectrometry.Proteomics. 2015; 15: 964-980Crossref PubMed Scopus (116) Google Scholar). label-free quantitation Clinical Proteomic Technology Assessment for Cancer Network data-dependent acquisition data-independent acquisition fold change false discovery rate false negatives false positives partial area under the curve positive predictive value peptide-to-spectrum match receiver operating curve robust ridge relative partial area under the curve stable isotope labeling of amino acids in cell culture true negatives true positives Universal Proteomics Standard 1. Standard data analysis pipelines for DDA-LFQ proteomics can be divided into two groups: spectral counting techniques, which are based on counting the number of peptide features as a proxy for protein abundance (15.Liu H. Sadygov R.G. Yates J.R. A Model for Random Sampling and Estimation of Relative Protein Abundance in Shotgun Proteomics.Anal. Chem. 2004; 76: 4193-4201Crossref PubMed Scopus (2066) Google Scholar), and intensity-based methods that quantify peptide features by measuring their corresponding spectral intensities or areas under the peaks in either MS or MS/MS spectra. Spectral counting is intuitive and easy to perform, but, the determination of differences in peptide and thus protein levels is not as precise as intensity-based methods, especially when analyzing rather small differences (16.Old W.M. Meyer-Arendt K. Aveline-Wolf L. Pierce K.G. Mendoza A. Sevinsky J.R. Resing K.A. Ahn N.G. Comparison of label-free methods for quantifying human proteins by shotgun proteomics.Mol. Cell. Proteomics. 2005; 4: 1487-1502Abstract Full Text Full Text PDF PubMed Scopus (1017) Google Scholar). More fundamentally, spectral counting ignores a large part of the information that is available in high-precision mass spectra. Further, dynamic exclusion during LC-MS/MS analysis, meant to increase the overall number of peptides that are analyzed, can worsen the linear dynamic range of these methods (17.Bantscheff M. Lemeer S. Savitski M. Kuster B. Quantitative mass spectrometry in proteomics: critical review update from 2007 to the present.Anal. Bioanal. Chem. 2012; 404: 939-965Crossref PubMed Scopus (581) Google Scholar). Also, any changes in the MS/MS sampling conditions will prevent comparisons between runs. Intensity-based methods are more sensitive than spectral counting (18.Milac T.I. Randolph T.W. Wang P. Analyzing LC-MS/MS data by spectral count and ion abundance: two case studies.Statistics Interface. 2012; 5: 75-87Crossref PubMed Google Scholar). Among intensity-based methods, quantification on the MS-level is somewhat more accurate than summarizing the MS/MS-level feature intensities (19.Krey J.F. Wilmarth P.A. Shin J.-B. Klimek J. Sherman N.E. Jeffery E.D. Choi D. David L.L. Barr-Gillespie P.G. Accurate label-free protein quantitation with high- and low-resolution mass spectrometers.J. Proteome Res. 2014; 13: 1034-1044Crossref PubMed Scopus (93) Google Scholar). Therefore, we further focus on improving data analysis methods for MS-level quantification. Typical intensity-based workflows summarize peptide intensities to protein intensities before assessing differences in protein abundances (20.Zhang Y. Fonslow B.R. Shan B. Baek M.-C. Yates J.R. Protein analysis by shotgun/bottom-up proteomics.Chem. Rev. 2013; 113: 2343-2394Crossref PubMed Scopus (942) Google Scholar). Peptide-based linear regression models estimate protein fold changes directly from peptide intensities and outperform summarization-based methods by reducing bias and generating more correct precision estimates (21.Goeminne L.J.E Argentini A. Martens L. Clement L. Summarization vs peptide-based models in label-free quantitative proteomics: Performance, pitfalls, and data analysis guidelines.J. Proteome Res. 2015; 14: 2457-2465Crossref PubMed Scopus (27) Google Scholar, 22.Clough T. Key M. Ott I. Ragg S. Schadow G. Vitek O. Protein quantification in label-free LC-MS experiments.J. Proteome Res. 2009; 8: 5275-5284Crossref PubMed Scopus (76) Google Scholar). However, peptide-based linear regression models suffer from overfitting due to extreme observations and the unbalanced nature of proteomics data; i.e. different peptides and a different number of peptides are typically identified in each sample. We illustrate this using the CPTAC spike-in data set where 48 human UPS1 proteins were spiked at five different concentrations in a 60 ng protein/μl yeast lysate. Thus, when comparing different spike-in concentrations, only the human proteins should be flagged as differentially abundant (DA), whereas the yeast proteins should not be flagged as DA (null proteins). Fig. 1 illustrates the structure of missing data in label-free shotgun proteomics experiments using a representative DA UPS1 protein from the CPTAC spike-in study: missing peptides in the lowest spike-in condition tend to have rather low log2 intensity values in higher spike-in conditions compared to peptides that were not missing in both conditions, which supports the fact that the missing value problem in label-free shotgun proteomic data is largely intensity-dependent (23.Karpievitch Y.V. Dabney A.R. Smith R.D. Normalization and missing value imputation for label-free LC-MS analysis.BMC Bioinformatics. 2012; 13: S5Crossref PubMed Scopus (177) Google Scholar). Fig. 2 shows the quantile normalized log2 intensity values for the peptides corresponding to the yeast null protein CG121 together with average log2 intensity estimates for each condition based on protein-level MaxLFQ intensities, as well as estimates derived from a peptide-based linear model. Here, three important remarks can be made:(1) CG121 is a yeast background protein, for which the true concentration is thus equal in all conditions, which appears to be monitored as such by MaxLFQ, except in conditions 6B and 6E (for the latter, no estimate is available). The LM estimate, however, is more reliable but seems to suffer from overfitting.(2) A lot of shotgun proteomic datasets are very sparse, causing a large sample-to-sample variability. Constructing a linear model based on a limited number of observations will thus lead to unstable variance estimates. Intuitively, a small sample drawn from a given population might "accidentally" show a very small variance while another small sample from the same population might display a very large variance just by random chance. This effect is clear from the sizes of the boxes. The interquartile range is twice as large in condition 6E compared to condition 6C. This issue leads to false positives since some proteins with very few observations are flagged as DA with very high statistical evidence solely due to their low observed variance (24.Ting L. Cowley M.J. Hoon S.L. Guilhaus M. Raftery M.J. Cavicchioli R. Normalization and statistical analysis of quantitative proteomics data generated by metabolic labeling.Mol. Cell. Proteomics. 2009; 8: 2227-2242Abstract Full Text Full Text PDF PubMed Scopus (99) Google Scholar).(3) Two observed features at log2 intensities 14.0 and 14.3 in condition 6B have a strong influence on the parameter estimate for this condition. Without these extreme observations, the 6B estimate lies closer to the estimates in the other conditions. As missingness is strongly intensity-dependent, these low intensity values could easily become missing values in subsequent experiments. More generally, a strong influence of only one or two peptides on the average protein level intensity estimate for a condition is an unfavorable property. These issues illustrate that state-of-the-art analysis methods experience difficulties in coping with peptide imbalances that are inherent to DDA LFQ proteomics data. We here propose three modular improvements to deal with the problems of overfitting, sample-to-sample variability and outliers:(1) Ridge regression, which penalizes the size of the model parameters. Shrinkage estimators can strongly improve reproducibility and overall performance as they have a lower overall mean squared error compared to ordinary least squares estimators (25.Ahmed S.E. Raheem S.M.E Shrinkage and absolute penalty estimation in linear regression models.Computational Stat. 2012; 4: 541-553Crossref Scopus (10) Google Scholar, 26.Stein C. Inadmissibility of the usual estimator for the mean of a multivariate normal distribution.Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. University of California Press, Berkeley, Calif1956: 197-206Google Scholar, 27.Copas J.B. Regression, prediction and shrinkage.J. Roy. Statist. Soc. 1983; 45: 311-354Google Scholar).(2) Empirical Bayes variance estimation, which shrinks the individual protein variances toward a common prior variance, hence stabilizing the variance estimation.(3) M-estimation with Huber weights, which will make the estimators more robust toward outliers (28.Huber P.J. Robust estimation of a location parameter.The Annals of Mathematical Statistics. 1964; 35: 73-101Crossref Google Scholar). We illustrate our method on the CPTAC Study 6 spike-in data and a published ArgP knock-out Francisella tularensis proteomics experiment and show that our method provides more stable log2 FC estimates and a better DA ranking than competing methods. The publicly available Study 6 of the Clinical Proteomic Technology Assessment for Cancer (29.Paulovich A.G. Billheimer D. Ham A.-J. L. Vega-Montoto L. Rudnick P.A. Tabb D.L. Wang P. Blackman R.K. Bunk D.M. Cardasis H.L. Clauser K.R. Kinsinger C.R. Schilling B. Tegeler T.J. Variyath A.M. Wang M. Whiteaker J.R. Zimmerman L.J. Fenyo D. Carr S.A. Fisher S.J. Gibson B.W. Mesri M. Neubert T.A. Regnier F.E. Rodriguez H. Spiegelman C. Stein S.E. Tempst P. Liebler D.C. Interlaboratory study characterizing a yeast performance standard for benchmarking LC-MS platform performance.Mol. Cell. Proteomics. 2010; 9: 242-254Abstract Full Text Full Text PDF PubMed Scopus (133) Google Scholar) is used to evaluate the performance of our method. Raw data can be accessed at https://cptac-data-portal.georgetown.edu/cptac/public?scope=Phase+I. In this study, the Sigma Universal Protein Standard mixture 1 (UPS1, Sigma-Aldrich, St. Louis, MO) containing 48 different human proteins was spiked into a 60 ng protein/μl Saccharomyces cerevisiae strain BY4741 (MATa, leu2Δ0, met15Δ0, ura3Δ0, his3Δ1) lysate in five different concentrations (6A: 0.25 fmol UPS1 proteins/μl; 6B: 0.74 fmol UPS1 proteins/μl; 6C: 2.22 fmol UPS1 proteins/μl; 6D: 6.67 fmol UPS1 proteins/μl; and 6E: 20 fmol UPS1 proteins/μl). These samples were sent to five independent laboratories and analyzed on seven different instruments. For convenience, we limited ourselves to the data originating from the LTQ-Orbitrap at site 86, LTQ-Orbitrap O at site 65 and LTQ-Orbitrap W at site 56. Samples were run three times on each instrument. The used dataset thus features five different samples, each analyzed in threefold on three different instruments. Raw data files were searched using MaxQuant version 1.5.2.8 (30.Cox J. Mann M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification.Nat. Biotechnol. 2008; 26: 1367-1372Crossref PubMed Scopus (9150) Google Scholar) with the following settings. As variable modifications we allowed acetylation (protein N terminus), methionine oxidation (to methionine-sulfoxide) and N-terminal glutamine to pyroglutamate conversion. As a fixed modification, we selected carbamidomethylation on cysteine residues as all samples were treated with iodoacetamide. We used the enzymatic rule of trypsin/P with a maximum of 2 missed cleavages and allowed MaxQuant to perform matching between runs with a match time window of 0.7 min and an alignment time window of 20 min. The main search peptide tolerance was set to 4.5 ppm and the ion trap MS/MS match tolerance was set to 0.5 Da. Peptide-to-spectrum match level was set at 1% FDR with an additional minimal Andromeda score of 40 for modified peptides as these settings are most commonly used by researchers. Protein FDR was set at 1% and estimated by using the reversed search sequences. We performed label-free quantitation with MaxQuant's standard settings. The maximal number of modifications per peptide was set to 5. As a search FASTA file we used the 6718 reviewed proteins present in the Saccharomyces cerevisiae (strain ATCC 204508/S288c) proteome downloaded from Uniprot at March 27, 2015 supplemented with the 48 human UPS1 protein sequences. Potential contaminants present in the contaminants.fasta file that comes with MaxQuant were automatically added to the search space by the software. For protein quantification in the proteinGroups.txt file, we used unique and razor peptides and allowed all modifications as all samples originate in essence from the same yeast lysate and the same UPS1 spike-in sample. The data of Ramond et al. (31.Ramond E. Gesbert G. Guerrera I.C. Chhuon C. Dupuis M. Rigard M. Henry T. Barel M. Charbit A. Importance of host cell arginine uptake in Francisella phagosomal escape and ribosomal protein amounts.Mol. Cell. Proteomics. 2015; 14: 870-881Abstract Full Text Full Text PDF PubMed Scopus (18) Google Scholar) is used to illustrate our method on a real biological experiment. Both raw and processed data are publicly available and can be found in the PRIDE repository at http://www.ebi.ac.uk/pride/archive/projects/PXD001584. The authors explored changes in the proteome of the facultative intracellular pathogenic coccobacillus Francisella tularensis after gene deletion of a newly identified arginine transporter, ArgP. Both wild-type and ArgP mutants were grown in biological triplicate. Each biological replicate was analyzed in technical triplicate via label-free LC-MS/MS. Data were processed with MaxQuant version 1.4.1.2 and potential contaminants and reverse sequences were removed. In addition, only proteins present with at least two peptides in at least 9 out of the 18 replicates were retained. Subsequent data analysis via t-tests on imputed LFQ intensities was performed. This is a standard summarization-based analysis pipeline that is available in the popular MaxQuant-Persues software package (21.Goeminne L.J.E Argentini A. Martens L. Clement L. Summarization vs peptide-based models in label-free quantitative proteomics: Performance, pitfalls, and data analysis guidelines.J. Proteome Res. 2015; 14: 2457-2465Crossref PubMed Scopus (27) Google Scholar). Briefly, the MaxQuant ProteinGroups.txt file was loaded into Perseus version 1.5.1.6, potential contaminants that did not correspond to any UPS1 protein as well as reversed sequences and proteins that were only identified by site (thus only by a peptide carrying a modified residue) were removed from the data set. MaxLFQ intensities (32.Cox J. Hein M.Y. Luber C.A. Paron I. Nagaraj N. Mann M. Accurate proteome-wide label-free quantification by delayed normalization and maximal peptide ratio extraction, termed MaxLFQ.Mol. Cell. Proteomics. 2014; 13: 2513-2526Abstract Full Text Full Text PDF PubMed Scopus (2687) Google Scholar) were log2 transformed and pairwise comparisons between conditions were done via t-tests. The MaxQuant ProteinGroups.txt file is used as input for R version 3.1.2 (Pumpkin Helmet) (33.R Core Team R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria2014Google Scholar). Potential contaminants and reversed sequences (see above) were removed from the data set. The MaxLFQ intensities were log2 transformed and analyzed in limma, an R/Bioconductor package for the analysis of microarray and next-generation sequencing data (34.Ritchie M.E. Phipson B. Wu D. Hu Y. Law C.W. Shi W. Smyth G.K. limma powers differential expression analyses for RNA-sequencing and microarray studies.Nucleic Acids Res. 2015; 43: e47Crossref PubMed Scopus (15342) Google Scholar). Limma makes use of posterior variance estimators to stabilize the naive variance estimator by borrowing strength across proteins (see also below). MaxQuant's peptides.txt file was read into R version 3.1.2, the peptide intensities were log2 transformed and quantile normalized (35.Amaratunga D. Cabrera J. Analysis of data from viral DNA microchips.J. Am. Statist. Assoc. 2001; 96: 1161-1170Crossref Scopus (72) Google Scholar, 36.Bolstad B.M. Irizarry R.A. Åstrand M. Speed T.P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias.Bioinformatics. 2003; 19: 185-193Crossref PubMed Scopus (6397) Google Scholar). Many other normalization approaches do exist, however, comparing them is beyond the scope of this paper (24.Ting L. Cowley M.J. Hoon S.L. Guilhaus M. Raftery M.J. Cavicchioli R. Normalization and statistical analysis of quantitative proteomics data generated by metabolic labeling.Mol. Cell. Proteomics. 2009; 8: 2227-2242Abstract Full Text Full Text PDF PubMed Scopus (99) Google Scholar, 36.Bolstad B.M. Irizarry R.A. Åstrand M. Speed T.P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias.Bioinformatics. 2003; 19: 185-193Crossref PubMed Scopus (6397) Google Scholar, 37.Callister S.J. Barry R.C. Adkins J.N. Johnson E.T. Qian W.-j. Webb-Robertson B.-J. M. Smith R.D. Lipton M.S. Normalization approaches for removing systematic biases associated with mass spectrometry and label-free proteomics.J. Proteome Res. 2006; 5: 277-286Crossref PubMed Scopus (287) Google Scholar, 38.Rudnick P.A. Wang X. Yan X. Sedransk N. Stein S.E. Improved normalization of systematic biases affecting ion current measurements in label-free proteomics data.Mol. Cell. Proteomics. 2014; 13: 1341-1351Abstract Full Text Full Text PDF PubMed Scopus (19) Google Scholar). Reversed sequences and potential contaminants were removed from the data. For the CPTAC dataset, we only removed potential contaminants that did not map to any UPS1 protein. MaxQuant assigns proteins to protein groups using an Occam's razor approach. However, to avoid the added complexity of proteins mapped to multiple protein groups, we discarded peptides belonging to protein groups that contained one or more proteins that were also present in a smaller protein group. Next, peptides were grouped per protein group in a data frame. Finally, values belonging to peptide sequences that appeared only once were removed as the model parameter for the peptide effect for these sequences is unidentifiable. For notational convenience, a unique protein or protein group is referred to as a protein in the remainder of this article. We start from the peptide-based linear regression models as proposed by Daly et al. (39.Daly D.S. Anderson K.K. Panisko E.A. Purvine S.O. Fang R. Monroe M.E. Baker S.E. Mixed-effects statistical model for comparative LC-MS proteomics studies †.J. Proteome Res. 2008; 7: 1209-1217Crossref PubMed Scopus (35) Google Scholar) Clough et al. (22.Clough T. Key M. Ott I. Ragg S. Schadow G. Vitek O. Protein quantification in label-free LC-MS experiments.J. Proteome Res. 2009; 8: 5275-5284Crossref PubMed Scopus (76) Google Scholar) and Karpievitch et al. (40.Karpievitch Y. Stanley J. Taverner T. Huang J. Adkins J.N. Ansong C. Heffron F. Metz T.O. Qian W.-J. Yoon H. Smith R.D. Dabney A.R. A statistical framework for protein quantitation in bottom-up MS-based proteomics.Bioinformatics. 2009; 25: 2028-2034Crossref PubMed Scopus (119) Google Scholar), of which we have independently proven their superior performance compared to summarization-based workflows (21.Goeminne L.J.E Argentini A. Martens L. Clement L. Summarization vs peptide-based models in label-free quantitative proteomics: Performance, pitfalls, and data analysis guidelines.J. Proteome Res. 2015; 14: 2457-2465Crossref PubMed Scopus (27) Google Scholar). In general, the following model is proposed: yijklmn=βijtreat+βikpep+βilbiorep+βimtechrep+εijklmn(Eq. 1) with yijklmn the nth log2-transformed normalized feature intensity for the ith protein under the jth treatment (treat), the kth peptide sequence (pep), the lth biological repeat (biorep) and the mth technical repeat (techrep) and εijklmn a normally distributed error term with mean zero and protein specific variance σi2. The β's denote the effect sizes for treat, pep, biore
Referência(s)