Artigo Acesso aberto Revisado por pares

MOGSA: Integrative Single Sample Gene-set Analysis of Multiple Omics Data

2019; Elsevier BV; Volume: 18; Issue: 8 Linguagem: Inglês

10.1074/mcp.tir118.001251

ISSN

1535-9484

Autores

Chen Meng, Azfar Basunia, Bjoern Peters, Amin Moghaddas Gholami, Bernhard Küster, Aedín C. Culhane,

Tópico(s)

Genetic Mapping and Diversity in Plants and Animals

Resumo

Gene-set analysis (GSA) summarizes individual molecular measurements to more interpretable pathways or gene-sets and has become an indispensable step in the interpretation of large-scale omics data. However, GSA methods are limited to the analysis of single omics data. Here, we introduce a new computation method termed multi-omics gene-set analysis (MOGSA), a multivariate single sample gene-set analysis method that integrates multiple experimental and molecular data types measured over the same set of samples. The method learns a low dimensional representation of most variant correlated features (genes, proteins, etc.) across multiple omics data sets, transforms the features onto the same scale and calculates an integrated gene-set score from the most informative features in each data type. MOGSA does not require filtering data to the intersection of features (gene IDs), therefore, all molecular features, including those that lack annotation may be included in the analysis. Using simulated data, we demonstrate that integrating multiple diverse sources of molecular data increases the power to discover subtle changes in gene-sets and may reduce the impact of unreliable information in any single data type. Using real experimental data, we demonstrate three use-cases of MOGSA. First, we show how to remove a source of noise (technical or biological) in integrative MOGSA of NCI60 transcriptome and proteome data. Second, we apply MOGSA to discover similarities and differences in mRNA, protein and phosphorylation profiles of a small study of stem cell lines and assess the influence of each data type or feature on the total gene-set score. Finally, we apply MOGSA to cluster analysis and show that three molecular subtypes are robustly discovered when copy number variation and mRNA data of 308 bladder cancers from The Cancer Genome Atlas are integrated using MOGSA. MOGSA is available in the Bioconductor R package "mogsa." Gene-set analysis (GSA) summarizes individual molecular measurements to more interpretable pathways or gene-sets and has become an indispensable step in the interpretation of large-scale omics data. However, GSA methods are limited to the analysis of single omics data. Here, we introduce a new computation method termed multi-omics gene-set analysis (MOGSA), a multivariate single sample gene-set analysis method that integrates multiple experimental and molecular data types measured over the same set of samples. The method learns a low dimensional representation of most variant correlated features (genes, proteins, etc.) across multiple omics data sets, transforms the features onto the same scale and calculates an integrated gene-set score from the most informative features in each data type. MOGSA does not require filtering data to the intersection of features (gene IDs), therefore, all molecular features, including those that lack annotation may be included in the analysis. Using simulated data, we demonstrate that integrating multiple diverse sources of molecular data increases the power to discover subtle changes in gene-sets and may reduce the impact of unreliable information in any single data type. Using real experimental data, we demonstrate three use-cases of MOGSA. First, we show how to remove a source of noise (technical or biological) in integrative MOGSA of NCI60 transcriptome and proteome data. Second, we apply MOGSA to discover similarities and differences in mRNA, protein and phosphorylation profiles of a small study of stem cell lines and assess the influence of each data type or feature on the total gene-set score. Finally, we apply MOGSA to cluster analysis and show that three molecular subtypes are robustly discovered when copy number variation and mRNA data of 308 bladder cancers from The Cancer Genome Atlas are integrated using MOGSA. MOGSA is available in the Bioconductor R package "mogsa." Increasing numbers of studies report comprehensive molecular profiling using multiple different experimental approaches on the same set of biological samples. These multi-omics studies can potentially yield great insights into the complex molecular machinery of biological systems. High-throughput sequencing allows quantification of global DNA variation and whole transcriptome RNA expression (1.Metzker M.L. Sequencing technologies - the next generation.Nat. Rev. Genet. 2010; 11: 31-46Crossref PubMed Scopus (5015) Google Scholar, 2.Ozsolak F. Milos P.M. RNA sequencing: advances, challenges and opportunities.Nat. Rev. Genet. 2011; 12: 87-98Crossref PubMed Scopus (1436) Google Scholar). Mass spectrometry (MS)-based proteomics can identify and quantify most proteins expressed in human tissues or cell lines (3.Wilhelm M. Schlegl J. Hahne H. Gholami A.M. Lieberenz M. Savitski M.M. Ziegler E. Butzmann L. Gessulat S. Marx H. Mathieson T. Lemeer S. Schnatbaum K. Reimer U. Wenschuh H. Mollenhauer M. Slotta-Huspenina J. Boese J.H. Bantscheff M. Gerstmair A. Faerber F. Kuster B. Mass-spectrometry-based draft of the human proteome.Nature. 2014; 509: 582-587Crossref PubMed Scopus (1317) Google Scholar). Emerging single cell sequencing technologies enable simultaneous measurement of transcriptomes and protein markers expressed in the same cell, using CITE-seq or REAP-seq (4.Peterson V.M. Zhang K.X. Kumar N. Wong J. Li L. Wilson D.C. Moore R. McClanahan T.K. Sadekova S. Klappenbach J.A. Multiplexed quantification of proteins and transcripts in single cells.Nat. Biotechnol. 2017; 35: 936-939Crossref PubMed Scopus (443) Google Scholar, 5.Stoeckius M. Hafemeister C. Stephenson W. Houck-Loomis B. Chattopadhyay P.K. Swerdlow H. Satija R. Smibert P. Simultaneous epitope and transcriptome measurement in single cells.Nat. Methods. 2017; 14: 865-868Crossref PubMed Scopus (1138) Google Scholar). Integrating, interpreting and generating biological hypothesis from such complex data sets is a considerable challenge. Gene-set analysis (GSA) 1The abbreviations used are: GSA, gene-set analysis; ANOVA, analysis of variance; AUC, area under the receiver operating characteristic curve; BLCA, bladder cancer; BP, biological process; CC, cellular component; CCA, canonical correlation analysis; CIA, co-inertia analysis; CLT, central limited theorem; CPTAC, Clinical Proteomic Tumor Analysis Consortium; DE, differentially expressed; DEGS, differentially expressed gene-set; EMT, Epithelial to mesenchymal transition; GIS, gene influential score; GO, gene ontology; GS, gene-set; GSEA, gene-set enrichment analysis; GSS, gene-set score; MAD, median absolute deviation; MCIA, multiple co-inertia analysis; MF, molecular function; MF, matrix factorization; MFA, multiple factorial analysis; MVA, multivariate analysis; NMM, naïve matrix multiplication; PCA, principal component analysis; ROC, Receiver operating characteristic; SVD, singular value decomposition; TCGA, the cancer genome atlas; TF, transcriptional factor; TFT, transcriptional factor target; t-SNE, t-Distributed Stochastic Neighbor Embedding.1The abbreviations used are: GSA, gene-set analysis; ANOVA, analysis of variance; AUC, area under the receiver operating characteristic curve; BLCA, bladder cancer; BP, biological process; CC, cellular component; CCA, canonical correlation analysis; CIA, co-inertia analysis; CLT, central limited theorem; CPTAC, Clinical Proteomic Tumor Analysis Consortium; DE, differentially expressed; DEGS, differentially expressed gene-set; EMT, Epithelial to mesenchymal transition; GIS, gene influential score; GO, gene ontology; GS, gene-set; GSEA, gene-set enrichment analysis; GSS, gene-set score; MAD, median absolute deviation; MCIA, multiple co-inertia analysis; MF, molecular function; MF, matrix factorization; MFA, multiple factorial analysis; MVA, multivariate analysis; NMM, naïve matrix multiplication; PCA, principal component analysis; ROC, Receiver operating characteristic; SVD, singular value decomposition; TCGA, the cancer genome atlas; TF, transcriptional factor; TFT, transcriptional factor target; t-SNE, t-Distributed Stochastic Neighbor Embedding. is widely used in the analysis of genome scale data and is often the first step in the biological interpretation of lists of genes or proteins that are differentially expressed between phenotypically distinct groups (6.Khatri P. Sirota M. Butte A.J. Ten years of pathway analysis: current approaches and outstanding challenges.PLoS Comput. Biol. 2012; 8: e1002375Crossref PubMed Scopus (1032) Google Scholar). These methods use external biological information, including gene ontologies, to reduce thousands of genes or proteins into lists of gene-sets that describe cellular pathways, subcellular localization, transcription factors or miRNA targets etc., thus facilitating hypothesis generation. Large scale omics studies or single cell studies may have limited a priori knowledge of phenotype groups or may aim to discover new molecular subtypes in a panel of experimental conditions or tissues with complex phenotypes, exemplified by The Cancer Genome Atlas (TCGA) (7.Cancer Genome Atlas Research Weinstein N.J.N. Collisson E.A. Mills G.B. Shaw K.R. Ozenberger B.A. Ellrott K. Shmulevich I. Sander C. Stuart J.M. The Cancer Genome Atlas Pan-Cancer analysis project.Nat. Genet. 2013; 45: 1113-1120Crossref PubMed Scopus (4190) Google Scholar) and the Clinical Proteomic Tumor Analysis Consortium (CPTAC) (8.Ellis M.J. Gillette M. Carr S.A. Paulovich A.G. Smith R.D. Rodland K.K. Townsend R.R. Kinsinger C. Mesri M. Rodriguez H. Liebler D.C. Clinical Proteomic Tumor Analysis, C Connecting genomic alterations to cancer biology with proteomics: the NCI Clinical Proteomic Tumor Analysis Consortium.Cancer Discov. 2013; 3: 1108-1112Crossref PubMed Scopus (170) Google Scholar). Classical GSA methods that require phenotypically distinct groups (6.Khatri P. Sirota M. Butte A.J. Ten years of pathway analysis: current approaches and outstanding challenges.PLoS Comput. Biol. 2012; 8: e1002375Crossref PubMed Scopus (1032) Google Scholar) have limited application in such cases and several unsupervised, single sample GSA (ssGSA) methods have been developed (9.Hanzelmann S. Castelo R. Guinney J. GSVA: gene set variation analysis for microarray and RNA-seq data.BMC Bioinformatics. 2013; 14: 7Crossref PubMed Scopus (4227) Google Scholar, 10.Barbie D.A. Tamayo P. Boehm J.S. Kim S.Y. Moody S.E. Dunn I.F. Schinzel A.C. Sandy P. Meylan E. Scholl C. Frohling S. Chan E.M. Sos M.L. Michel K. Mermel C. Silver S.J. Weir B.A. Reiling J.H. Sheng Q. Gupta P.B. Wadlow R.C. Le H. Hoersch S. Wittner B.S. Ramaswamy S. Livingston D.M. Sabatini D.M. Meyerson M. Thomas R.K. Lander E.S. Mesirov J.P. Root D.E. Gilliland D.G. Jacks T. Hahn W.C. Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1.Nature. 2009; 462: 108-112Crossref PubMed Scopus (1875) Google Scholar, 11.Tomfohr J. Lu J. Kepler T.B. Pathway level analysis of gene expression using singular value decomposition.BMC Bioinformatics. 2005; 6: 225Crossref PubMed Scopus (255) Google Scholar, 12.Lee E. Chuang H.Y. Kim J.W. Ideker T. Lee D. Inferring pathway activity toward precise disease classification.PLoS Comput. Biol. 2008; 4: e1000217Crossref PubMed Scopus (381) Google Scholar). These methods do not require prior availability of phenotypic or clinical data. Arguably, one of the most popular approaches is single-sample GSEA that ranks genes according to the empirical cumulative distribution function and calculates a single sample-wise gene-set score by comparing the scores of genes that are inside and outside a gene-set (10.Barbie D.A. Tamayo P. Boehm J.S. Kim S.Y. Moody S.E. Dunn I.F. Schinzel A.C. Sandy P. Meylan E. Scholl C. Frohling S. Chan E.M. Sos M.L. Michel K. Mermel C. Silver S.J. Weir B.A. Reiling J.H. Sheng Q. Gupta P.B. Wadlow R.C. Le H. Hoersch S. Wittner B.S. Ramaswamy S. Livingston D.M. Sabatini D.M. Meyerson M. Thomas R.K. Lander E.S. Mesirov J.P. Root D.E. Gilliland D.G. Jacks T. Hahn W.C. Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1.Nature. 2009; 462: 108-112Crossref PubMed Scopus (1875) Google Scholar). A related method, gene-set variation analysis (GSVA), also calculates sample-wise gene-set enrichment as a function of the genes that are inside and outside a gene-set, and also uses a similar Kolmogorov-Smirnov-like rank statistic to assess the enrichment score, but genes are ranked using a kernel estimation of a cumulative density function (9.Hanzelmann S. Castelo R. Guinney J. GSVA: gene set variation analysis for microarray and RNA-seq data.BMC Bioinformatics. 2013; 14: 7Crossref PubMed Scopus (4227) Google Scholar). These single-sample GSA methods are designed for the analysis of a single data set, and do not integrate or calculate a single sample GSA score on multiple data sets simultaneously. Here, we present a novel unsupervised single-sample gene-set analysis that calculates an integrated enrichment score using all the information in multiple omics data sets, named "multi-omics GSA" (MOGSA). The method relies on matrix factorization (MF), powerful methods that can be used to learn patterns of biological significance in high dimensional data (13.Stein-O'Brien G.L. Arora R. Culhane A.C. Favorov A.V. Garmire L.X. Greene C.S. Goff L.A. Li Y. Ngom A. Ochs M.F. Xu Y. Fertig E.J. Enter the matrix: factorization uncovers knowledge from omics.Trends Genet. 2018; 34: 790-805Abstract Full Text Full Text PDF PubMed Scopus (84) Google Scholar) as well as identify and exclude batch effects (14.Leek J.T. Storey J.D. Capturing heterogeneity in gene expression studies by surrogate variable analysis.PLoS Genet. 2007; 3: 1724-1735Crossref PubMed Scopus (1134) Google Scholar). Coupled or tensor MF methods can learn latent correlated structure within and between omics data sets (15.Meng C. Kuster B. Culhane A.C. Gholami A.M. A multivariate approach to the integration of multi-omics datasets.BMC Bioinformatics. 2014; 15: 162Crossref PubMed Scopus (172) Google Scholar, 16.de Tayrac M. Le S. Aubry M. Mosser J. Husson F. Simultaneous analysis of distinct Omics data sets with integration of biological knowledge: Multiple Factor Analysis approach.BMC Genomics. 2009; 10: 32Crossref PubMed Scopus (88) Google Scholar, 17.Fagan A. Culhane A.C. Higgins D.G. A multivariate analysis approach to the integration of proteomic and gene expression data.Proteomics. 2007; 7: 2162-2171Crossref PubMed Scopus (59) Google Scholar, 18.Le Cao K.A. Martin P.G. Robert-Granie C. Besse P. Sparse canonical methods for biological data integration: application to a cross-platform study.BMC Bioinformatics. 2009; 10: 34Crossref PubMed Scopus (202) Google Scholar) and have been applied to the analysis of molecular data from different technology platforms (19.Culhane A.C. Perriere G. Higgins D.G. Cross-platform comparison and visualisation of gene expression data using co-inertia analysis.BMC Bioinformatics. 2003; 4: 59Crossref PubMed Scopus (110) Google Scholar) and integration of diverse multi-omics data (15.Meng C. Kuster B. Culhane A.C. Gholami A.M. A multivariate approach to the integration of multi-omics datasets.BMC Bioinformatics. 2014; 15: 162Crossref PubMed Scopus (172) Google Scholar, 17.Fagan A. Culhane A.C. Higgins D.G. A multivariate analysis approach to the integration of proteomic and gene expression data.Proteomics. 2007; 7: 2162-2171Crossref PubMed Scopus (59) Google Scholar). An attractive characteristic of coupled or tensor MF approaches, is that they identify global correlated patterns among samples or observations. They can be applied to integrating data from experimental platforms that include known and unknown molecules (for example lipidomics, metabolomics) or molecules that are difficult to map one-to-one between data sets (e.g. transcript variants to proteins). Therefore, these approaches do not require pre-filtering of gene identifiers in each data set to a common intersecting subset of features. Although coupled MF or latent factor methods are powerful, they identify components, the interpretation of which, can be challenging and may require domain knowledge (15.Meng C. Kuster B. Culhane A.C. Gholami A.M. A multivariate approach to the integration of multi-omics datasets.BMC Bioinformatics. 2014; 15: 162Crossref PubMed Scopus (172) Google Scholar, 20.Abdi H. Williams L.J. Valentin D. Multiple factor analysis: principal component analysis for multitable and multiblock data sets.Wiley Interdisciplinary Reviews: Computational Statistics. 2013; 5: 149-179Crossref Scopus (265) Google Scholar, 21.Meng C. Zeleznik O.A. Thallinger G.G. Kuster B. Gholami A.M. Culhane A.C. Dimension reduction techniques for the integrative analysis of multi-omics data.Brief Bioinform. 2016; 17: 628-641Crossref PubMed Scopus (207) Google Scholar, 22.Tenenhaus A. Tenenhaus M. Regularized generalized canonical correlation analysis.Psychometrika. 2011; 76: 257-284Crossref Scopus (193) Google Scholar). To solve this problem, MOGSA incorporates gene-set annotation in the correlated patterns of molecules resulting from MF, calculates scores for gene-sets in each biological sample, providing simple, accessible biological interpretation. We showed that integrative ssGSA by MOGSA has higher sensitivity and specificity for the detection of differentially expressed gene-sets compared with popular ssGSA approaches when applied to simulated data. To demonstrate result interpretation and application, we applied MOGSA to both small- and large-scale biological data from high throughput experiments. This project describes a new algorithm called MOGSA. The mathematical details of the algorithm are provided in the Supplementary Methods. We rigorously tested and validated the performance of the method using four distinct use-cases. First, we used simulated data to benchmark the performance of MOGSA and compared it to the most widely used methods in the field. Using diverse scenarios, each with 100 simulated data sets, we demonstrated that MOGSA can integrate multiple data sets thus increasing the sensitivity to identify gene-sets with subtle perturbations. Second, using well-characterized multi-omics cell line data, we demonstrate the benefit of removing a source of noise (batch effect/biological bias) by excluding a component to amplify the signal in gene-set analysis. We applied this to removing the effect of cell line doubling time in multiple transcriptomics data sets of 59 cell lines. The third use-case examined one of most common needs of biological laboratories; the integrative analysis of diverse molecular data obtained on a small number of biological samples. We integrated transcriptome, proteome and phosphoproteome data on four iPS ES cell lines and demonstrated how to interpret the gene-set scores to reveal which data set contributed most to a specific biology process. Finally, the fourth use-case examined molecular subtype discovery using multi-omics data by the integrative analysis of copy number variation and transcriptome data of 308 bladder tumors. We demonstrate how to rigorously apply the multi-omics single sample GSA method to discover molecular subtypes and interpret the biological basis for each tumor subtype in a large-scale multi-omics studies. In each case, the data are publicly available, and we provide details on the methods and code such that our analysis can be reproduced. We simulated 100 multiple omics data projects. Each simulated data set was a triplet (K = 3) containing three data matrices (supplemental Fig. S1), each matrix had the dimension 1000 × 30, representing 30 matched observations (n = 30) and 1,000 features (pk = 1000). Each data set had an annotation matrix, which assigned each feature to one of 20 "gene-sets." The gene-sets were set to be either overlapping or non-overlapping with each other. In the nonoverlap setting, there were no shared candidate features in different gene-sets and therefore, the exact number of DE genes could be exactly controlled. In the overlap setting, the candidate features of different gene-sets were randomly selected and thus one feature may belong to more than one gene-set. As a result, the exact number of DE features was not precise, but this scenario is more analogous to real gene-set annotations. The binary annotation matrix had dimensions of 1000 features × 20 gene-sets. Each gene-set contained 50 genes. The 30 observations were defined by 6 equal sized clusters with 5 samples per cluster. In each observation, 5 out of 20 gene-sets were simulated as differentially expressed (DE). For observations in the same cluster, the same set of DE gene-sets were randomly selected as we assumed that differentially expressed (DE) gene-sets define the difference between clusters. For a DE gene-set, several genes was randomly simulated as DE genes (DEG), denoted as DEGj. Random selection of DEGs means that the DEGs in different data sets may overlap. In separate simulations, we varied the number of DEGs per gene-set (e.g. 5, 10, and 25 out of 50) or mean signal to noise ratio. We used the following linear additive model adapted from (9.Hanzelmann S. Castelo R. Guinney J. GSVA: gene set variation analysis for microarray and RNA-seq data.BMC Bioinformatics. 2013; 14: 7Crossref PubMed Scopus (4227) Google Scholar), the expression or abundance of gene on ith row and jth column is simulated asXij=αi+βl+γij+εij1 where with i = 1, …, p is a gene specific effect. βl ∼ N(μ = 0, σ = s) is the cluster effect. For observations belonging to the same cluster l, the same βl was applied. The cluster effect factor (categorical variable) was introduced following the hypothesis that observations from the same clusters are driven by some common pathways or "gene-sets" and ensures that observations from the same cluster have a higher within than between cluster correlation. The six correlated clusters in the simulated data were captured by the first five components. The cluster effect βl ∼ N(μ = 0, σ = s) was sampled from a distribution with a mean of 0 and standard deviation s. The standard deviation (s) adjusts the correlation between observations in the same cluster, and thus each cluster can have different within cluster variance and different proportions of variance would be captured by the top five components. In this study, we set s = 0.3, 0.5 and 1.0, which led to 25%, 30 and 50% of total variance captured by the top 5 components. ε ∼ N(μ = 0, σ = 1) is the noise factor. γij is the differential expression factor describing if a gene is differentially expressed (DE):γij{∼N(μ=m,σ=1)if i∈DEGj=0otherwise2 Apart from the retained variance by top five components, two other parameters were tuned in the simulation study. First, the number of DEGs in a DE gene-set (5, 10, and 25 out of 50 DEGs). The second parameter was different signal-to-noise ratios, which was tuned through modifying m in expression (2.Ozsolak F. Milos P.M. RNA sequencing: advances, challenges and opportunities.Nat. Rev. Genet. 2011; 12: 87-98Crossref PubMed Scopus (1436) Google Scholar). The candidate values of m were 0.3, 0.5, and 0.8 standing for low, medium and high signal-to-noise ratio. In total, 100 projects of triplet data sets were generated. The three matrix triplets were analyzed by MOGSA. NMM, GSVA and ssGSEA, only accept one matrix as input; therefore, the three simulated matrices were concatenated into one grand matrix in these analyses. The performance was assessed by the area under the ROC curve (AUC). Processed mRNA expression data (normalized score averaged from 5 microarray platform) and clinical information were downloaded from CELLMINER (download date: 06/01/2017) (23.Shankavaram U.T. Varma S. Kane D. Sunshine M. Chary K.K. Reinhold W.C. Pommier Y. Weinstein J.N. CellMiner: a relational database and query tool for the NCI-60 cancer cell lines.BMC Genomics. 2009; 10: 277Crossref PubMed Scopus (201) Google Scholar). Quantitative proteome profiles were downloaded from the supplementary table of (24.Gholami A.M. Hahne H. Wu Z. Auer F.J. Meng C. Wilhelm M. Kuster B. Global proteome analysis of the NCI-60 cell line panel.Cell Rep. 2013; 4: 609-620Abstract Full Text Full Text PDF PubMed Scopus (222) Google Scholar). The proteome data were quantified and normalized using the iBAQ method (25.Schwanhausser B. Busse D. Li N. Dittmar G. Schuchhardt J. Wolf J. Chen W. Selbach M. Global quantification of mammalian gene expression control.Nature. 2011; 473: 337-342Crossref PubMed Scopus (4091) Google Scholar) and iBAQ values were transformed by xi = log10(iBAQi + 1). The LC cell line SNB19 and melanoma cell line MDAN were missing mRNA data and were therefore excluded from the analysis. A random sampling method was applied to determine the number of components that represented significant correlated structure between data sets. First, MFA was applied to NCI60 transcriptome and proteome data and the (true) variance associated with each MFA component was recorded. Next, cell line labels were randomly shuffled in both transcriptomics and proteomics data and the variance of components were calculated from the randomly labels data. We repeated this process 20 times to estimate the null distribution of variances associated with each component. The variance of the top three components was significantly higher than the null distribution (supplemental Fig. S2). The transcriptomics (RNA-sequencing), proteomics and phoshphoproteomics data were downloaded from Stem Cell-Omic Repository (supplemental Table S1, S2, and S5 from http://scor.chem.wisc.edu/data.php) (26.Phanstiel D.H. Brumbaugh J. Wenger C.D. Tian S. Probasco M.D. Bailey D.J. Swaney D.L. Tervo M.A. Bolin J.M. Ruotti V. Stewart R. Thomson J.A. Coon J.J. Proteomic and phosphoproteomic comparison of human ES and iPS cells.Nat. Methods. 2011; 8: 821-827Crossref PubMed Scopus (217) Google Scholar). In this study, we used the 4-plex data, which consisted of 17,347 genes, 7952 proteins and 10,499 sites of phosphorylation in four cell lines. For the transcriptomics data, the expression levels of genes were represented by RPKM values. Raw RPKM and MS-based proteomics intensity data may quantify many low abundance molecules, therefore, is strongly skewed and contain outliers (27.Zwiener I. Frisch B. Binder H. Transforming RNA-Seq data to improve the performance of prognostic gene signatures.PLoS ONE. 2014; 9: e85150Crossref PubMed Scopus (90) Google Scholar). However, MFA, the matrix decomposition method used in MOGSA, is sensitive to outliers, because it minimizes sum squared error between original matrix and its low-rank approximation. Therefore, low abundance genes with a mean RPKM (across the 12 samples) lower than 1 were excluded and RPKM values were further log transformed (log10). The distribution of log transformed data before and after filtering is shown in supplemental Fig. S3. The mean RPKM value of the three replicates was used. When a gene symbol was present more than once in a data set, the one with higher average RPKM was retained. The iTRAQ quantification of protein and phosphorylation sites was performed by TagQuant (28.Wenger C.D. Phanstiel D.H. Lee M.V. Bailey D.J. Coon J.J. COMPASS: a suite of pre- and post-search proteomics software tools for OMSSA.Proteomics. 2011; 11: 1064-1074Crossref PubMed Scopus (132) Google Scholar), as described in (26.Phanstiel D.H. Brumbaugh J. Wenger C.D. Tian S. Probasco M.D. Bailey D.J. Swaney D.L. Tervo M.A. Bolin J.M. Ruotti V. Stewart R. Thomson J.A. Coon J.J. Proteomic and phosphoproteomic comparison of human ES and iPS cells.Nat. Methods. 2011; 8: 821-827Crossref PubMed Scopus (217) Google Scholar). The intensities of protein and phospho-sites were log transformed (base 10). Protein and phospho-sites with low intensity (summed log intensity across samples < 20) were removed to exclude low abundant proteins/sites that are detected in a small number of samples. In this data set, we observed that an intensity threshold of 20 was optimal to retain proteins/sites that were consistently measured in all three replicates of one of the four cell types (average 6.7 in each replicate; see supplemental Fig. S3) but missed in other cell types. Therefore, proteins/sites exclusive to a cell type, which may represent interesting biology, were retained. In the proteomics data, proteins that were not mapped to an official gene symbol were removed. After filtering, 10,961, 5817, and 7912 features were retained in the transcriptomic, proteomic and phospho-proteomic data sets. A few missing values were still present and replaced with zero values. The enrichment analysis was performed on the gene symbol levels, the specific phosphorylation sites were not considered. PCA of each individual data set is shown in supplemental Fig. S4. The strongest signal (first PCs) in all three data sets was the difference between NFF cells and the stem cell lines, and this difference was particularly apparent in the proteomics data sets. The second and third components represented subtle differences between iPSC and ESC lines. Normalized Illumina HiSeq platform mRNA gene expression, copy number variation (CNV) and clinical information of BLCA patients were downloaded from TCGA (Date: 09/26/2014) using TCGA assembler (29.Zhu Y. Qiu P. Ji Y. TCGA-assembler: open-source software for retrieving and processing TCGA data.Nat. Methods. 2014; 11: 599-600Crossref PubMed Scopus (276) Google Scholar). MapSplice and RSEM algorithms were used for the short-read alignment and quantification of the mRNA sequencing data (Referred as RNASeqV2 in TCGA) (30.Wang K. Singh D. Zeng Z. Coleman S.J. Huang Y. Savich G.L. He X. Mieczkowski P. Grimm S.A. Perou C.M. MacLeod J.N. Chiang D.Y. Prins J.F. Liu J. MapSplice: accurate mapping of RNA-seq reads for splice junction discovery.Nucleic Acids Res. 2010; 38: e178Crossref PubMed Scopus (760) Google Scholar, 31.Li B. Ruotti V. Stewart R.M. Thomson J.A. Dewey C.N. RNA-Seq gene expression estimation wi

Referência(s)