PolySTest: Robust Statistical Testing of Proteomics Data with Missing Values Improves Detection of Biologically Relevant Features

Artigo Acesso aberto Revisado por pares

PolySTest: Robust Statistical Testing of Proteomics Data with Missing Values Improves Detection of Biologically Relevant Features

2020; Elsevier BV; Volume: 19; Issue: 8 Linguagem: Inglês

10.1074/mcp.ra119.001777

ISSN

1535-9484

Autores

Veit Schwämmle, Christina E. Hagensen, Adelina Rogowska‐Wrzesinska, Ole N. Jensen,

Tópico(s)

Metabolomics and Mass Spectrometry Studies

Resumo

Statistical testing remains one of the main challenges for high-confidence detection of differentially regulated proteins or peptides in large-scale quantitative proteomics experiments by mass spectrometry. Statistical tests need to be sufficiently robust to deal with experiment intrinsic data structures and variations and often also reduced feature coverage across different biological samples due to ubiquitous missing values. A robust statistical test provides accurate confidence scores of large-scale proteomics results, regardless of instrument platform, experimental protocol and software tools. However, the multitude of different combinations of experimental strategies, mass spectrometry techniques and informatics methods complicate the decision of choosing appropriate statistical approaches. We address this challenge by introducing PolySTest, a user-friendly web service for statistical testing, data browsing and data visualization. We introduce a new method, Miss test, that simultaneously tests for missingness and feature abundance, thereby complementing common statistical tests by rescuing otherwise discarded data features. We demonstrate that PolySTest with integrated Miss test achieves higher confidence and higher sensitivity for artificial and experimental proteomics data sets with known ground truth. Application of PolySTest to mass spectrometry based large-scale proteomics data obtained from differentiating muscle cells resulted in the rescue of 10–20% additional proteins in the identified molecular networks relevant to muscle differentiation. We conclude that PolySTest is a valuable addition to existing tools and instrument enhancements that improve coverage and depth of large-scale proteomics experiments. A fully functional demo version of PolySTest and Miss test is available via http://computproteomics.bmb.sdu.dk/Apps/PolySTest. Statistical testing remains one of the main challenges for high-confidence detection of differentially regulated proteins or peptides in large-scale quantitative proteomics experiments by mass spectrometry. Statistical tests need to be sufficiently robust to deal with experiment intrinsic data structures and variations and often also reduced feature coverage across different biological samples due to ubiquitous missing values. A robust statistical test provides accurate confidence scores of large-scale proteomics results, regardless of instrument platform, experimental protocol and software tools. However, the multitude of different combinations of experimental strategies, mass spectrometry techniques and informatics methods complicate the decision of choosing appropriate statistical approaches. We address this challenge by introducing PolySTest, a user-friendly web service for statistical testing, data browsing and data visualization. We introduce a new method, Miss test, that simultaneously tests for missingness and feature abundance, thereby complementing common statistical tests by rescuing otherwise discarded data features. We demonstrate that PolySTest with integrated Miss test achieves higher confidence and higher sensitivity for artificial and experimental proteomics data sets with known ground truth. Application of PolySTest to mass spectrometry based large-scale proteomics data obtained from differentiating muscle cells resulted in the rescue of 10–20% additional proteins in the identified molecular networks relevant to muscle differentiation. We conclude that PolySTest is a valuable addition to existing tools and instrument enhancements that improve coverage and depth of large-scale proteomics experiments. A fully functional demo version of PolySTest and Miss test is available via http://computproteomics.bmb.sdu.dk/Apps/PolySTest. A typical mass spectrometry-based proteomics study identifies and quantifies thousands of proteins in large scale experiments, including different perturbations, cell types or organisms, dose-responses or time courses. Experimental and biological variance is captured by analyzing technical and biological replicates to enhance reproducibility and repeatability. However, the extensive resource requirements of experimental, large-scale proteomics experiments usually result in low replicate numbers, i.e. n = 2 − 5. This limitation requires careful application of appropriate statistical methods. Given the measured values of thousands of peptides or proteins, biological interpretation of the large data sets calls for data processing and filtering for detection of the biologically relevant features. This is usually achieved by the detection of differentially regulated features (DRFs), such as, differentially expressed proteins, peptides, or post-translationally modified peptides. Statistical testing, also known as significance analysis, helps identifying biological changes in the experimental setup. Statistical tests aim to recognize DRFs by providing probabilities for a significant change of protein abundance. Here, variance between biological samples, technical variation and "computational variance" coming from applying a particular data processing workflow (1Palmblad M. Lamprecht A.-L. Ison J. Schwämmle V. Automated workflow composition in mass spectrometry-based proteomics.Bioinformatics. 2019; 35: 656-664Crossref PubMed Scopus (27) Google Scholar, 2Griss J. Vinterhalter G. Schwämmle V. Isoprot: A complete and reproducible workflow to analyze itraq/tmt experiments.J. Proteome Res. 2019; 18: 1751-1759Crossref PubMed Scopus (10) Google Scholar) determines the significance. Optimal application of suitable statistical tests relies on estimating these variances to yield the correct false discovery rate (FDR, defined by the number of false positives divided by the number of DRFs) while maximizing the number of correctly identified proteins (sensitivity). Particularly "missing values" can compromise statistical testing as they severely decrease the statistical power of the data analysis. Missing values are values of a feature that are absent in a given replicate by not having been detected and reported by the measurement equipment. A missing value originates because a signal is below the detection limit, due to sample loss, and/or stochastic precursor selection in mass spectrometry. The different origin of missing values in a data set impedes accurate prediction of the correct values. Thus, imputation methods that replace missing values with estimated abundances can be considered inappropriate as they add knowingly false measurements. Even when assuming missing values to be due to absent proteins, applying imputation by 0-values would lead to an estimated variance of zero and impede transforming the values to their logarithm. Therefore, data imputation has been reported to lead to erroneous results (3Wang J. Li L. Chen T. Ma J. Zhu Y. Zhuang J. Chang C. In-depth method assessments of differentially expressed protein detection for shotgun proteomics data with missing values.Sci. Reports. 2017; 7: 3367Crossref PubMed Scopus (21) Google Scholar, 4Goeminne L.J.E. Gevaert K. Clement L. Experimental design and data-analysis in label-free quantitative lc/ms proteomics: A tutorial with msqrob.J. Proteomics. 2018; 171: 23-36Crossref PubMed Scopus (41) Google Scholar). Currently, most computational approaches apply a prior filter to exclude proteins of low coverage that are missing too many values and do not provide sufficient replication to estimate their variance by having only 1–2 values per experimental condition. However, proteins that are represented in datasets with both "present values" and "missing values" still contain highly valuable information. Missing values coming from proteins of low or no abundance can contribute to a statistical test with additional evidence. For instance, consider a case where a protein is completely repressed and thus all replicate data are missing in one condition, but are highly abundant in the other condition. Including these otherwise discarded cases using a statistically sound method would rescue more proteins for the statistical testing and so is likely to increase the number of differentially regulated features (DRFs). Current tests with capability to include missing values however are usually binary, i.e. they oversimplify the analysis by dividing all data values into absent and present, or by assuming all missing values to be below the detection limit (5Webb-Robertson B.-J.M. McCue L.A. Waters K.M. Matzke M.M. Jacobs J.M. Metz T.O. Varnum S.M. Pounds J.G. Combined statistical analyses of peptide intensities and peptide occurrences improves identification of significant peptides from MS-based proteomics data.J. Proteome Res. 2010; 9: 5748-5756Crossref PubMed Scopus (69) Google Scholar, 6Taylor S.L. Leiserowitz G.S. Kim K. Accounting for undetected compounds in statistical analyses of mass spectrometry 'omic studies.Statistical Appl. Genetics Mol. Biol. 2013; 12: 703-722PubMed Google Scholar). The fraction of missing values often increases drastically on the peptide level. Peptide-centric approaches where the statistical test includes peptide quantifications instead of summarized protein amounts are very promising. They extrapolate protein variance from peptide variance and some tools are already available, including MSStats (7Choi M. Chang C.-Y. Clough T. Broudy D. Killeen T. MacLean B. Vitek O. Msstats: an r package for statistical analysis of quantitative mass spectrometry-based proteomic experiments.Bioinformatics. 2014; 30: 2524-2526Crossref PubMed Scopus (518) Google Scholar) and MSqRob (4Goeminne L.J.E. Gevaert K. Clement L. Experimental design and data-analysis in label-free quantitative lc/ms proteomics: A tutorial with msqrob.J. Proteomics. 2018; 171: 23-36Crossref PubMed Scopus (41) Google Scholar). However, in most cases rather low numbers of quantified peptides per protein complicate the performance of such approaches. In addition, different mass spectrometry data acquisition methods (e.g. label-free versus stable isotope labeling protocols) require specialized methods for protein summarization. Therefore, we here consider summarization as a separate task. Several methods can process summarized data with few missing values and produce satisfactory results. The LIMMA method (8Smyth G.K. Linear models and empirical bayes methods for assessing differential expression in microarray experiments.Stat. Appl. Genet. Mol. Biol. 2004; 3 (Article3)Crossref PubMed Scopus (9158) Google Scholar), originally developed for micro-array data, is widely used in proteomics research. LIMMA was shown to perform well for proteomics experiments with as little as three replicates per condition (9Schwämmle V. Jensen O.N. A simple and fast method to determine the parameters for fuzzy c-means cluster analysis.Bioinformatics. 2010; 26: 2841-2848Crossref PubMed Scopus (135) Google Scholar). However, depending on the data set and its structure, other statistical tests such as rank products (10Breitling R. Armengaud P. Amtmann A. Herzyk P. Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments.FEBS Lett. 2004; 573: 83-92Crossref PubMed Scopus (1214) Google Scholar) were found to improve performance in a complementary manner. Distinct data properties originating from different numbers of differentially regulated features and different variations within them lead to either rank products or LIMMA becoming the best-performing method in ground truth data (9Schwämmle V. Jensen O.N. A simple and fast method to determine the parameters for fuzzy c-means cluster analysis.Bioinformatics. 2010; 26: 2841-2848Crossref PubMed Scopus (135) Google Scholar). Another approach that holds potential to deal with differently structured data is based on comparing the observed values to randomizations of the data. Such permutation tests have been implemented e.g. in the Perseus suite (11Tyanova S. Temu T. Sinitcyn P. Carlson A. Hein M.Y. Geiger T. Mann M. Cox J. The perseus computational platform for comprehensive analysis of (prote)omics data.Nat. Methods. 2016; 13: 731-740Crossref PubMed Scopus (3529) Google Scholar). Given the current state of available methods, there is a need to carry out significance analysis that a) considers missing values, b) uses robust and complementary tests to include different intrinsic data properties and c) provides simple usage through a user-friendly program interface. We here introduce a novel statistical test (Miss test) for datasets with missing values. The Miss test is included in our new PolySTest web service that provides a set of complementary statistical tests for quantitative data and versatile data browsing and visualization. Validation of the methods by extensive tests with artificial and real data sets shows the power of our approach. We demonstrate the performance of PolySTest using a proteomics data set to achieve improved coverage of biologically significant features in muscle cell differentiation. The probability pNA to find a missing value is given by the fraction of missing values in the data set. A binomial distribution describes the case of having multiple missing measurements. The probability to find i missing values in r replicates is given by bi=(ri)pNAi(1−pNA)r−i.(1) We derive the probability to observe a difference of the number of missing values in two groups of values. Then the probability for finding a difference of k missing values between sample groups yields Pk=2∑j=0r−kbj+kbj(2) For an example for calculating the probabilities, see the Result section. The probabilities Pk only account for presence/absence of protein quantifications but not for their abundance-dependence, i.e. whether they are more likely to occur when measuring low abundant proteins. Therefore, the algorithm applies the following additional steps. The distribution of all quantitative values is divided into 100 quantiles. Then, for each quantile, (a) values below the quantile were set to be missing; (b) probabilities described by Eq. 1 are calculated on basis of the new pNA; (c) the difference in number of missing values between conditions for each feature is calculated; and (d) the probabilities for the differences using Eq. 2 are stored. This method gives, for each feature and each comparison between conditions, a vector of 100 Pk values. The smallest Pk of each vector is multiplied by r + 1 and represents the p value of the Miss test. p values larger than one are set to one. Real data with ground truth was simulated by superposing a fraction of the normally distributed (mean 0, standard deviation 1) features with an offset. Out of N features per replicate (R) and one of the two conditions, NR features were displaced by δ to each side with random assignment of up- and down-regulations. Then, for the simulation of abundance-dependent missingness, we randomly removed m% of the values by elimination with weights [1 − r(i)/N]μ where r(i) is the rank of the feature i in a sample after sorting for their abundance. Table I shows a summary of the parameter ranges used for creating the artificial data sets.Table IParameter values of artificial data including their rangeParameterSymbolRangeFeature numberN500–10,000Replicates per conditionR3–10Percentage regulated featuresNR0–50%Difference regulated featuresδ1–5Percentage missing valuesm0–50%Abundance-dependence of missing valuesμ0–100 Open table in a new tab The original experiment (12Navarro P. Kuharev J. Gillet L.C. Bernhardt O.M. MacLean B. Röst H.L. Tate S.A. Tsou C.-C. Reiter L. Distler U. Rosenberger G. Perez-Riverol Y. Nesvizhskii A.I. Aebersold R. Tenzer S. A Multicenter study benchmarks software tools for label-free proteome quantification.Nat. Biotechnol. 2016; 34: 1130-1136Crossref PubMed Scopus (202) Google Scholar) comprises two hybrid proteome samples with a "ground truth" mix of human, yeast and Escherichia coli proteins (ratios 1:1, 2:1, and 1:4, respectively), referred to as HYE124. The samples were analyzed in technical triplicates on a TripleTOF 6600 instrument using a 64 variable window of 10 min for SWATH-MS acquisition. The protein quantification summary from data analysis with SWATH 2.0 was chosen. The file PViewBuiltinProteins_TTOF6600_64w_shift_iRT_extractionWindow10min_30ppm_proteins.csv was downloaded from PRIDE repository (PXD002952) and the protein quantification values were log-transformed. We retrieved data from a study where Escherichia coli digest was spiked into a HeLa digest in four different concentrations: 3%, 7.5%, 10 and 15% (13Shalit T. Elinger D. Savidor A. Gabashvili A. Levin Y. Ms1-based label-free proteomics using a quadrupole orbitrap mass spectrometer.J. Proteome Res. 2015; 14: 1979-1986Crossref PubMed Scopus (65) Google Scholar). The data was acquired via label-free MS on a Q Exactive Plus instrument. Protein abundances were given as supplementary data file. These were log-transformed and normalized using the normalizeCyclicLoess function (LIMMA R package (14Ritchie M.E. Phipson B. Wu D. Hu Y. Law C.W. Shi W. Smyth G.K. limma powers differential expression analyses for rna-sequencing and microarray studies.Nucleic Acids Res. 2015; 43: e47Crossref PubMed Scopus (15342) Google Scholar)). Immortalized human satellite cells (KM155C25) were differentiated into myoblasts following procedures established before (15Zhu C.-H. Mouly V. Cooper R.N. Mamchaoui K. Bigot A. Shay J.W. Di Santo J.P. Butler-Browne G.S. Wright W.E. Cellular senescence in human myoblasts is overcome by human telomerase reverse transcriptase and cyclin-dependent kinase 4: consequences in aging muscle and therapeutic strategies for muscular dystrophies.Aging Cell. 2007; 6: 515-523Crossref PubMed Scopus (207) Google Scholar, 16Mamchaoui K. Trollet C. Bigot A. Negroni E. Chaouch S. Wolff A. Kandalla P.K. Marie S. Di Santo J. St Guily J.L. Muntoni F. Kim J. Philippi S. Spuler S. Levy N. Blumen S.C. Voit T. Wright W.E. Aamiri A. Butler-Browne G. Mouly V. Immortalized pathological human myoblasts: towards a universal tool for the study of neuromuscular disorders.Skeletal Muscle. 2011; 1: 34Crossref PubMed Scopus (166) Google Scholar, 17Thorley M. Duguez S. Mazza E.M.C. Valsoni S. Bigot A. Mamchaoui K. Harmon B. Voit T. Mouly V. Duddy W. Skeletal muscle characteristics are preserved in htert/cdk4 human myogenic cell lines.Skeletal Muscle. 2016; 6: 43Crossref PubMed Scopus (42) Google Scholar). Biological triplicate protein extracts were prepared at 6 time points: proliferating activated myoblasts (Day −1) and 5 days following the initiation of differentiation into mature myocytes (Day 0, 1, 2, 3 and 4). Protein extraction and sample preparation was performed following previously published protocols (18Wang W.-Q. Jensen O.N. Møller I.M. Hebelstrup K.H. Rogowska-Wrzesinska A. Evaluation of sample preparation methods for mass spectrometry-based proteomic analysis of barley leaves.Plant Methods. 2018; 14: 72Crossref PubMed Scopus (25) Google Scholar, 19Kovalchuk S.I. Jensen O.N. Rogowska-Wrzesinska A. Flashpack: Fast and simple preparation of ultrahigh-performance capillary columns for lc-ms.Mol. Cell. Proteomics. 2019; 18: 383-390Abstract Full Text Full Text PDF PubMed Scopus (51) Google Scholar) with minor modifications. Cells were lysed using (50 mm TEAB, 1% SDC, 10 mm TCEP, 40 mm chloroacetamide buffer in the presence of protease inhibitors (cOmplete, EDTA-free Protease Inhibitor Mixture, Roche) and phosphatase inhibitors (PhosSTOP, Roche, Switzerland). Lysates were heated to 80°C for 10 min and vortexed 1 min, followed by sonication for 3 × 15 s with 30 s breaks on ice between at 40% output intensity utilizing a Q125 sonicator (Qsonica, CT). Protein concentration was determined by ProStainTM (Active Motif) and 100 μg of protein was taken through digestion. Each sample was analyzed by mass spectrometry in technical triplicates and in randomized order. The desalted peptides were captured on a μ-precolumn (5 μm, 5 mm × 300 μm, 100 Å pore size, Acclaim PepMap 100 C18, Thermo Fisher Scientific) before being separated using a home-packed fused silica column (50 cm × 100 μm, 120 Å pore size) of 2 μm InertSil ODS-3 beads (GLSciences, Japan). Peptides were separated using a 120 mins gradient of 8 to 35% buffer B (99.99% ACN, 0.01% FA), at 200 nl/min flow with a Dionex UltiMate 3000 nanoLC system (Thermo Fisher Scientific) and analyzed by MS/MS using an Orbitrap Fusion Lumos Tribrid mass spectrometer (Thermo Fisher Scientific). For each cycle a full MS scan across the mass range 350–1800 m/z was performed within the orbitrap with a resolution of 120,000 and an AGC target of 5 × 105 ions, with a maximum injection time (IT) of 60ms, a dynamic exclusion window of 40 s was used. This was followed by fragmentation of ions with charge states of +2–6 by HCD within the ion trap. MS/MS scans were performed at a rapid ion trap scan rate, with a collision energy of 35%, a maximum IT of 35ms and an AGC target of 1 × 104. An isolation window of 0.7 m/z was used, with an isolation offset of 0.2 m/z. Data were acquired using Xcalibur (Thermo Fisher Scientific). Generated data was analyzed using MaxQuant (v 1.5.5.1) (20Cox J. Mann M. Maxquant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification.Nat. Biotechnol. 2008; 26: 1367-1372Crossref PubMed Scopus (9150) Google Scholar) and the in-built Andromeda search engine (21Cox J. Neuhauser N. Michalski A. Scheltema R.A. Olsen J.V. Mann M. Andromeda: a peptide search engine integrated into the maxquant environment.J. Proteome Res. 2011; 10: 1794-1805Crossref PubMed Scopus (3448) Google Scholar). The human UniProt Reference Proteome database, containing Swiss-Prot proteins including isoforms (downloaded 26 April 2017, containing 20,198 entries) was used for identification. Data was searched with the following parameters: trypsin digestion with a maximum of 2 missed cleavages, a fixed modification of Carbamidomethylation (C), variable modifications of Oxidation (M), Acetylation (Protein N-term) and Deamidation (NQ). Label-Free Quantitation was performed with "match between runs" and disabled "second peptide search." The false discovery rate (FDR) for both PSM and protein identification was set at 0.01%. All other parameters remained as default including a mass tolerance for precursor ions of 20 ppm and a mass tolerance for fragment ions of 10 ppm. Raw data and search results are available via ProteomXchange identifier PXD018588. Protein quantifications were taken from MaxQuant and contaminants and decoy hits were removed. Quantifications from technical replicates were summed, log2-transformed and normalized by median. Furthermore, a minimum of 2 unique peptides per protein and 6 non-missing values over all 18 samples was required to reduce the effect of wrong identifications. Paired or unpaired t tests were carried out using the default t test in R. Paired or unpaired moderated tests were calculated applying the default procedure for using the LIMMA package (14Ritchie M.E. Phipson B. Wu D. Hu Y. Law C.W. Shi W. Smyth G.K. limma powers differential expression analyses for rna-sequencing and microarray studies.Nucleic Acids Res. 2015; 43: e47Crossref PubMed Scopus (15342) Google Scholar) including linear model estimation and empirical Bayes moderation of standard errors. The p values for the paired tests were calculated according to ref (22Schwämmle V. León I.R. Jensen O.N. Assessment and improvement of statistical tools for comparative proteomics analysis of sparse data sets with few experimental replicates.J Proteome Res. 2013; 12: 3874-3883Crossref PubMed Scopus (84) Google Scholar). Unpaired samples were simulated by obtaining the p values from 100 randomly picked pairings and taking their means. This method requires a minimum of 7 replicates per condition (unpaired) or 7 ratios between pairings. For lower replicate numbers, the method adds samples with values randomly drawn from the pool containing all values of the considered samples. Then, for each feature, upaired or uunpaired were obtained for 1,000 randomizations of the feature values. Similar to calculating the t-values in a t test, the mean of all paired ratios was divided by their standard deviation. For feature i, this gives upaired(i) = |mean(rk(i))/std(rk(i))|n(i) where rk(i) is the abundance of feature i in replicate k and n(i) is the number of non-missing values. Unpaired comparison of values of feature i between conditions t and c was calculated by uunpaired(i)=|mean(rk(i)(t))−mean(rk(i)(c))|var⁡(rk(i)(t))+var⁡(rk(i)(c))n(i)(3) where lower replicate numbers are punished by multiplying by the number of non-missing values. We then calculated the p value of each feature by comparing u(i) of the feature to 1000 instances of u(r) that were calculated for the randomizations r explained above. The resulting p value is given by pi=(∑r=11000θ(u(i)−u(r))+1)/1000(4) where θ(x) is the Heaviside function which is 1 for x > 0 and 0 otherwise. For details on this statistical test, see above. The p values from the statistical tests need to be corrected for multiple testing to estimate the false discovery rates. The p values of each test are corrected by either the Benjamini-Hochberg procedure (Miss test, rank products and permutation test) or the numerical qvalue method (23Storey J.D. A direct approach to false discovery rates.J. R. Statist. Soc. B. 2002; 64: 479-498Crossref Scopus (3858) Google Scholar) (t test and LIMMA), where the latter requires sufficient p values for a good estimation of the background distribution. The Benjamini-Hochberg method was used for the statistical tests that discard a fraction of the features by setting the p values to p = 1 and thus might not provide the required number of p values to apply the qvalue method. Combining the resulting FDRs to a unified PolySTest FDR requires an additional step of correction for multiple testing. For the calculation of the PolySTest FDR, the FDRs from LIMMA, Miss test, rank products and permutation test were corrected for multiple testing by the method of Hommel (24Hommel G. A stagewise rejective multiple test procedure based on a modified bonferroni test.Biometrika. 1988; 75: 383-386Crossref Scopus (912) Google Scholar) which allows for positive dependence. Then, the PolySTest value is given by the smallest of the resulting corrected FDR values. Pathway enrichment analysis was carried out using the ClusterProfiler R package (25Yu G. Wang L.G. Han Y. He Q.Y. clusterprofiler: an r package for comparing biological themes among gene clusters.OMICS. 2012; 16: 284-287Crossref PubMed Scopus (11788) Google Scholar) (version 3.10.1) using default parameters. All methods are freely available through our web service at http://computproteomics.bmb.sdu.dk/Apps/PolySTest or can be downloaded to be run on a local computer. The software was written as Shiny app which is highly interactive and allows extensive data browsing and visualization including upset plots (26Conway J.R. Lex A. Gehlenborg N. Upsetr: an r package for the visualization of intersecting sets and their properties.Bioinformatics. 2017; 33: 2938-2940Crossref PubMed Scopus (1060) Google Scholar), circlize plots (27Gu Z. Gu L. Eils R. Schlesner M. Brors B. circlize implements and enhances circular visualization in r.Bioinformatics. 2014; 30: 2811-2812Crossref PubMed Scopus (1465) Google Scholar) and an interactive heatmap (28Galili T. O'Callaghan A. Sidi J. Sievert C. heatmaply: an r package for creating interactive cluster heatmaps for online publishing.Bioinformatics. 2018; 34: 1600-1602Crossref PubMed Scopus (240) Google Scholar). The full source code is available through https://bitbucket.org/veitveit/polystest. We introduce a new statistical method, Miss test, to combine the occurrence of missing values with protein abundance for improved detection of DRFs in data with high amounts of missing values. To assess its performance, robustness to different data structures and complementarity to other statistical tests, we assessed five different statistical tests applied on hundreds of artificial data sets, experimental data sets with ground truth and a data set with well-defined biological content. The tests are implemented and combined in the PolySTest web service to analyze and visualize the statistically tested data prior to down-stream biological interpretation (Fig 1A). Miss test integrates missingness with measured protein abundance by reapplying a binary statistics method over a range of different data representations. To describe the impact of the missing values, we derived the probabilities for each combination of missing values being distributed over the replicates of the experimental conditions. We illustrate the calculation by a simple scenario with 50% of missing values and 3 replicates per each of 2 experimental conditions A and B. The probability for observing a missing value is assumed to be pNA = 0.5. The null hypothesis assumes that the missing values are missing at random, i.e. they are randomly distributed over the entire data set. This leads to a probability of pNA6 = 1/64 for each combination of missing/non-missing values across the four values (Fig. 1B). We then count the number of cases where the number of missing values in both conditions differs by 0, 1, 2 and 3. A difference of 3 missing values corresponds to only missing values in condition 1 or only in condition 2, thus giving a probability of 2 · 1/64 = 1/32. A difference of 0, 1 or 2 missing values between the conditions occurs in 20, 30 and 12 different cases. Hence, the probability to find a difference of 0, 1, 2 and 3 missing values is given by the probabilities 5/16, 15/32, 3/16 and 1/32, respectively. Finding only missing values in one condition would then correspond to a p value of 2pNA6 = 0.03125. This calculati

Ver no editor

Altmetric

PlumX

Entrar

Lembrar minha senha

Receber meu e-mail de confirmação

PolySTest: Robust Statistical Testing of Proteomics Data with Missing Values Improves Detection of Biologically Relevant Features