Epigenetic Variability Confounds Transcriptome but Not Proteome Profiling for Coexpression-based Gene Function Prediction
2018; Elsevier BV; Volume: 17; Issue: 11 Linguagem: Inglês
10.1074/mcp.ra118.000935
ISSN1535-9484
AutoresPiotr Grabowski, Georg Kustatscher, Juri Rappsilber,
Tópico(s)Gene expression and cancer classification
ResumoGenes are often coexpressed with their genomic neighbors, even if these are functionally unrelated. For small expression changes driven by genetic variation within the same cell type, non-functional mRNA coexpression is not propagated to the protein level. However, it is unclear if protein levels are also buffered against any non-functional mRNA coexpression accompanying large, regulated changes in the gene expression program, such as those occurring during cell differentiation. Here, we address this question by analyzing mRNA and protein expression changes for housekeeping genes across 20 mouse tissues. We find that a large proportion of mRNA coexpression is indeed non-functional and does not lead to coexpressed proteins. Chromosomal proximity of genes explains a proportion of this nonfunctional mRNA coexpression. However, the main driver of non-functional mRNA coexpression across mouse tissues is epigenetic similarity. Both factors together provide an explanation for why monitoring protein coexpression outperforms mRNA coexpression data in gene function prediction. Furthermore, this suggests that housekeeping genes translocating during evolution within genomic subcompartments might maintain their broad expression pattern. Genes are often coexpressed with their genomic neighbors, even if these are functionally unrelated. For small expression changes driven by genetic variation within the same cell type, non-functional mRNA coexpression is not propagated to the protein level. However, it is unclear if protein levels are also buffered against any non-functional mRNA coexpression accompanying large, regulated changes in the gene expression program, such as those occurring during cell differentiation. Here, we address this question by analyzing mRNA and protein expression changes for housekeeping genes across 20 mouse tissues. We find that a large proportion of mRNA coexpression is indeed non-functional and does not lead to coexpressed proteins. Chromosomal proximity of genes explains a proportion of this nonfunctional mRNA coexpression. However, the main driver of non-functional mRNA coexpression across mouse tissues is epigenetic similarity. Both factors together provide an explanation for why monitoring protein coexpression outperforms mRNA coexpression data in gene function prediction. Furthermore, this suggests that housekeeping genes translocating during evolution within genomic subcompartments might maintain their broad expression pattern. Genes are not arranged randomly but tend to be clustered in the genome into coexpressed domains (1Hurst L.D. Pál C. Lercher M.J. The evolutionary dynamics of eukaryotic gene order.Nat. Rev. Genet. 2004; 5: 299-310Crossref PubMed Scopus (538) Google Scholar). Such clustering can be a regulatory strategy of both prokaryotic and eukaryotic genomes. Interestingly, this does not mean that genes that are coexpressed are necessarily also linked functionally. There exist gene clusters that tend to be coexpressed, yet lack evident cofunctionality (1Hurst L.D. Pál C. Lercher M.J. The evolutionary dynamics of eukaryotic gene order.Nat. Rev. Genet. 2004; 5: 299-310Crossref PubMed Scopus (538) Google Scholar, 2Williams E.J.B. Bowles D.J. Coexpression of neighboring genes in the genome of Arabidopsis thaliana.Genome Res. 2004; 14: 1060-1067Crossref PubMed Scopus (198) Google Scholar). This is especially visible for bidirectional gene pairs which are coexpressed because of shared regulatory context, but commonly seem to lack a functional relationship (3Xu C. Chen J. Shen B. The preservation of bidirectional promoter architecture in eukaryotes: what is the driving force?.BMC Syst. Biol. 2012; 6: S21Crossref PubMed Scopus (17) Google Scholar). This has an impact on gene coexpression studies which infer functional associations between genes based on similar gene activity. Coexpression of spatially close genes can be driven by stochastic transcriptional bursting (4Raj A. Peskin C.S. Tranchina D. Vargas D.Y. Tyagi S. Stochastic mRNA synthesis in mammalian cells.PLos Biol. 2006; 4: e309Crossref PubMed Scopus (1171) Google Scholar) or transcriptional interference between neighboring genes (5Ebisuya M. Yamamoto T. Nakajima M. Nishida E. Ripples from neighbouring transcription.Nat. Cell Biol. 2008; 10: 1106-1113Crossref PubMed Scopus (205) Google Scholar). The existence of coexpressed gene clusters that lack a functional connection is intriguing given that nonspecific gene expression should have a negative impact on cell fitness. Interestingly, Hurst and colleagues have shown that clustered genes mutually reinforce their active state and are less likely to be accidentally silenced, for example by stochastic fluctuations of chromatin states (6Batada N.N. Hurst L.D. Evolution of chromosome organization driven by selection for reduced gene expression noise.Nat. Genet. 2007; 39: 945-949Crossref PubMed Scopus (150) Google Scholar). Therefore, clustered genes show lower expression noise, a benefit that may offset the negative impact of their coincidental coexpression. In agreement with this, we have recently demonstrated that coexpression of proximal genes, both in terms of sequence and 3D genomic proximity, is pervasive in the human genome. Importantly, however, coexpression of spatially close, functionally unrelated genes is restricted to their mRNA abundances and is not propagated to the protein level (7Kustatscher G. Grabowski P. Rappsilber J. Pervasive coexpression of spatially proximal genes is buffered at the protein level.Mol. Syst. Biol. 2017; 13: 937Crossref PubMed Scopus (48) Google Scholar). This protein-level buffering of non-functional mRNA coexpression supports the idea that reduction of expression noise is a key driver of the evolution of genome organization. Consequently, function prediction is based better on protein coexpression than mRNA coexpression data (8Wang J. Ma Z. Carr S.A. Mertins P. Zhang H. Zhang Z. Chan D.W. Ellis M.J.C. Townsend R.R. Smith R.D. McDermott J.E. Chen X. Paulovich A.G. Boja E.S. Mesri M. Kinsinger C.R. Rodriguez H. Rodland K.D. Liebler D.C. Zhang B. Proteome Profiling Outperforms Transcriptome Profiling for Coexpression Based Gene Function Prediction.Mol. Cell Proteomics. 2017; 16: 121-134Abstract Full Text Full Text PDF PubMed Scopus (75) Google Scholar, 9Lapek Jr, J.D. Greninger P. Morris R. Amzallag A. Pruteanu-Malinici I. Benes C.H. Haas W. Detection of dysregulated protein-association networks by high-throughput proteomics predicts cancer vulnerabilities.Nat. Biotechnol. 2017; 35: 983-989Crossref PubMed Scopus (95) Google Scholar). Our previous analysis was based on a panel of human lymphoblastoid cell lines (LCLs) 1The abbreviations used are:LCLlymphoblastoid cell linesCDScoding sequenceCVcoefficient of variationFDRfalse discovery rateFPKMfragments per kilobase millionGAMgeneralized additive modelGEOgene expression omnibusGOgene ontologyPCCPearson correlation coefficientRPKMreads per kilobase millionSILACstable isotope labeling with amino acids in cell cultureTPMtranscripts per millionTSStranscription start siteUTRuntranslated region. 1The abbreviations used are:LCLlymphoblastoid cell linesCDScoding sequenceCVcoefficient of variationFDRfalse discovery rateFPKMfragments per kilobase millionGAMgeneralized additive modelGEOgene expression omnibusGOgene ontologyPCCPearson correlation coefficientRPKMreads per kilobase millionSILACstable isotope labeling with amino acids in cell cultureTPMtranscripts per millionTSStranscription start siteUTRuntranslated region. for which the expression changes had a prominent noise component owing to the little variability between the cell lines. A related analysis of human cancer panels also found mRNA—but not protein—coexpression to reflect chromosomal gene colocalization (8Wang J. Ma Z. Carr S.A. Mertins P. Zhang H. Zhang Z. Chan D.W. Ellis M.J.C. Townsend R.R. Smith R.D. McDermott J.E. Chen X. Paulovich A.G. Boja E.S. Mesri M. Kinsinger C.R. Rodriguez H. Rodland K.D. Liebler D.C. Zhang B. Proteome Profiling Outperforms Transcriptome Profiling for Coexpression Based Gene Function Prediction.Mol. Cell Proteomics. 2017; 16: 121-134Abstract Full Text Full Text PDF PubMed Scopus (75) Google Scholar). However, it remains to be seen if a similar uncoupling of transcriptome and proteome exists also for strong, regulated and biologically important expression changes. For example, different cell types have different metabolic needs, morphology, organelle numbers and sizes. Even for ubiquitously expressed housekeeping genes, this can amount to large quantitative differences in expression levels. Here, we investigate the impact of genome organization and epigenetic states on mRNA and protein coexpression across different mouse tissues by integrating multiple published omics data sets. We show that the observations made on cell lines regarding factors governing mRNA and protein coexpression also hold in tissues, with changes in the relative weights of the contributions from genome position versus epigenetic state. We point at possible biases in expression profiling for functional genomics that researchers should consider. lymphoblastoid cell lines coding sequence coefficient of variation false discovery rate fragments per kilobase million generalized additive model gene expression omnibus gene ontology Pearson correlation coefficient reads per kilobase million stable isotope labeling with amino acids in cell culture transcripts per million transcription start site untranslated region. lymphoblastoid cell lines coding sequence coefficient of variation false discovery rate fragments per kilobase million generalized additive model gene expression omnibus gene ontology Pearson correlation coefficient reads per kilobase million stable isotope labeling with amino acids in cell culture transcripts per million transcription start site untranslated region. SILAC mouse tissue proteomes were downloaded from (10Geiger T. Velic A. Macek B. Lundberg E. Kampf C. Nagaraj N. Uhlen M. Cox J. Mann M. Initial quantitative proteomic map of 28 mouse tissues using the SILAC mouse.Mol. Cell Proteomics. 2013; 12: 1709-1722Abstract Full Text Full Text PDF PubMed Scopus (166) Google Scholar), normalized SILAC H/L ratios for each tissue extracted and log2-transformed. SILAC kidney values were obtained by averaging expression values for kidney cortex and medulla. Transcriptomics profiling data of tissues were obtained from (11ENCODEProject Consortium An integrated encyclopedia of DNA elements in the human genome.Nature. 2012; 489: 57-74Crossref PubMed Scopus (11104) Google Scholar, 12Barrett T. Wilhite S.E. Ledoux P. Evangelista C. Kim I.F. Tomashevsky M. Marshall K.A. Phillippy K.H. Sherman P.M. Holko M. Yefanov A. Lee H. Zhang N. Robertson C.L. Serova N. Davis S. Soboleva A. NCBI GEO: archive for functional genomics data sets–update.Nucleic Acids Res. 2013; 41: D991-D995Crossref PubMed Scopus (5129) Google Scholar, 13Brosens J.J. Salker M.S. Teklenburg G. Nautiyal J. Salter S. Lucas E.S. Steel J.H. Christian M. Chan Y.-W. Boomsma C.M. Moore J.D. Hartshorne G.M. Sućurović S. Mulac-Jericevic B. Heijnen C.J. Quenby S. Koerkamp M.J.G. Holstege F.C.P. Shmygol A. Macklon N.S. Uterine selection of human embryos at implantation.Sci. Rep. 2014; 4: 3894Crossref PubMed Scopus (184) Google Scholar, 14Kim H. Toyofuku Y. Lynn F.C. Chak E. Uchida T. Mizukami H. Fujitani Y. Kawamori R. Miyatsuka T. Kosaka Y. Yang K. Honig G. van der Hart M. Kishimoto N. Wang J. Yagihashi S. Tecott L.H. Watada H. German M.S. Serotonin regulates pancreatic beta cell mass during pregnancy.Nat. Med. 2010; 16: 804-808Crossref PubMed Scopus (427) Google Scholar, 15Mustafi D. Kevany B.M. Genoud C. Okano K. Cideciyan A.V. Sumaroka A. Roman A.J. Jacobson S.G. Engel A. Adams M.D. Palczewski K. Defective photoreceptor phagocytosis in a mouse model of enhanced S-cone syndrome causes progressive retinal degeneration.FASEB J. 2011; 25: 3157-3176Crossref PubMed Scopus (68) Google Scholar) (links in supplemental Table S1). Data downloaded from ENCODE were in Gencode M4-aligned bam format with the only exception of the skeletal muscle data which were downloaded in fastq format and aligned using TopHat v2.0.9 and Gencode M4 annotation. The TopHat settings were set to default apart from using "bowtie1" parameter and library type set to "fr-secondstrand." The bam files were subsequently processed using Cufflinks 2.2.1 with default settings to obtain gene expression (fragments per kilobase of exon model per million mapped reads, FPKM) values. The three tissues downloaded from GEO were in normalized FPKM or RPKM format. All the mRNA expression data were transformed into a common transcripts per million (TPM) unit. To make the RNAseq data set comparable with the proteomics data, each mRNA expression value was divided by a median expression value for a given gene in all 20 tissues (analogously to the Super-SILAC approach (16Geiger T. Cox J. Ostasiewicz P. Wisniewski J.R. Mann M. Super-SILAC mix for quantitative proteomics of human tumor tissue.Nat. Methods. 2010; 7: 383-385Crossref PubMed Scopus (433) Google Scholar) used in the proteomics data set). Finally, the normalized TPM ratios were log2-transformed. The resulting mRNA and protein expression data set contains 3391 genes with expression values in at least 8 tissues on both mRNA and protein levels. The proteomics data and mRNA data contain 15.5% and 6.7% missing values, respectively. The processed data set is available as supplemental File S1. ChIPseq data for 9 mouse tissues (marks: H3K27ac, H3K27me3, H3K36me3, H3K4me1, H3K4me3, H3K79me2) were obtained from ENCODE in bigWig format (fold change versus control). The data for H3K9ac was available only for two tissues. To extract mean ChIPseq signal per gene body for all tissues, a UCSC bigWigAverageOverBed command line tool was used in conjunction with a custom-made bed file based on Gencode M4 mouse gene annotation. The processed ChIPseq data set is available as supplemental File S2. To obtain the between-gene correlation values the data were centered at 0 for each experiment and a Pearson correlation coefficient was calculated using R function "corr.test" from the psych package with the "use" parameter set to "pairwise.complete.obs." For improved statistical power, correlations were calculated only for genes which had data in at least 8 overlapping tissues (both on protein and mRNA levels). Gene pairs were considered correlated if their PCC value was > 0.5. For subsequent analyses, only correlations with Benjamini-Hochberg adjusted p values < 0.05 were considered. Mouse gene positions on mm10 genome were obtained from Ensembl Biomart (17Yates A. Akanni W. Amode M.R. Barrell D. Billis K. Carvalho-Silva D. Cummins C. Clapham P. Fitzgerald S. Gil L. Girón C.G. Gordon L. Hourlier T. Hunt S.E. Janacek S.H. Johnson N. Juettemann T. Keenan S. Lavidas I. Martin F.J. Maurel T. McLaren W. Murphy D.N. Nag R. Nuhn M. Parker A. Patricio M. Pignatelli M. Rahtz M. Riat H.S. Sheppard D. Taylor K. Thormann A. Vullo A. Wilder S.P. Zadissa A. Birney E. Harrow J. Muffato M. Perry E. Ruffier M. Spudich G. Trevanion S.J. Cunningham F. Aken B.L. Zerbino D.R. Flicek P. Ensembl 2016.Nucleic Acids Res. 2016; 44: D710-D716Crossref PubMed Scopus (1042) Google Scholar, 18Durinck S. Spellman P.T. Birney E. Huber W. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt.Nat. Protoc. 2009; 4: 1184-1191Crossref PubMed Scopus (1725) Google Scholar) (state on 29.06.2017). For gene distance calculation, first base pair of each gene's outermost transcription start site (TSS) was used and distances between those positions calculated for each gene pair. Two Pearson Chi-squared tests were performed on two 2 × 2 contingency tables (for mRNA and protein levels). The first contingency table (mRNA-level) divided gene pairs by two variables. The first variable considered genomic distances between the gene pairs (close-by/other) and the second variable divided the gene pairs according to their mRNA coexpression (gene pairs with mRNA Pearson correlation coefficient > 0.5 and BH-adjusted p value < 0.05 were considered correlated and all other pairs were considered uncorrelated). Similarly, for the protein-level analysis, the first variable was genomic proximity. In the second variable, pairs were correlated if they both had mRNA and protein PCC > 0.5 and the BH-adjusted p value < 0.05. The miRNA/gene mapping data for mouse brain were obtained from (19Moore M.J. Scheel T.K.H. Luna J.M. Park C.Y. Fak J.J. Nishiuchi E. Rice C.M. Darnell R.B. miRNA–target chimeras reveal miRNA 3′-end pairing as a major determinant of Argonaute target specificity.Nat. Commun. 2015; 6: 8864Crossref PubMed Scopus (186) Google Scholar). The CDS lengths of coexpressed genes were obtained from Biomart using Ensembl Genes 92 database and the GRCm38.p6 data set. The genes were considered to have similar CDS length if the ratio of the length of the longer CDS to the shorter CDS was below 1.5. The liver time-series ribosome profiling data was obtained from (20Janich P. Arpat A.B. Castelo-Szekely V. Lopes M. Gatfield D. Ribosome profiling reveals the rhythmic liver translatome and circadian clock regulation by upstream open reading frames.Genome Res. 2015; 25: 1848-1859Crossref PubMed Scopus (108) Google Scholar). Ribosome profiling matrices were scaled using the accompanying mRNA expression data and the resulting ratios were log2-transformed. Finally, Pearson correlation coefficients between genes were calculated using R function "corr.test" from the "psych" package (21Revelle, W. R., (2017) psych: Procedures for personality and psychological research.Google Scholar). Gene pairs with Pearson correlation coefficient > 0.5 and the Holm-adjusted p value < 0.001 were considered as correlated. Protein translation rates were obtained from (22Schwanhäusser B. Busse D. Li N. Dittmar G. Schuchhardt J. Wolf J. Chen W. Selbach M. Global quantification of mammalian gene expression control.Nature. 2011; 473: 337-342Crossref PubMed Scopus (4095) Google Scholar). For each gene pair, a ratio of their translation rates was calculated, log2-transformed and the absolute values taken. Gene pairs were considered to have similar translation rates if this absolute log2 ratio was lower or equal to 1. The protein degradation profiles were obtained from (23McShane E. Sin C. Zauber H. Wells J.N. Donnelly N. Wang X. Hou J. Chen W. Storchova Z. Marsh J.A. Valleriani A. Selbach M. Kinetic Analysis of Protein Stability Reveals Age-Dependent Degradation.Cell. 2016; 167: 803-815.e21Abstract Full Text Full Text PDF PubMed Scopus (170) Google Scholar) and gene pairs coding at least one nonexponentially degraded protein were counted. The Pearson correlation coefficients for all gene pairs were used to cluster the mRNA and protein data separately. An R clustering function "kmeans" was used for this purpose. The first k value that explained 50% of the variance in the data was selected. The percentage of variance explained was defined as the ratio of the between sum of squares to the total sum of squares for every given k. The parameter "nstart" was set to 3 and "max.iter" set to 20. Subcellular localization annotation was obtained from Uniprot (24The UniProt Consortium UniProt: the universal protein knowledgebase.Nucleic Acids Res. 2017; 45: D158-D169Crossref PubMed Scopus (3136) Google Scholar). Proteins localized to more than one subcellular compartment were removed. Endoplasmic reticulum was joined with Golgi as "ER/Golgi" to balance the group sizes. Only "nucleus," "mitochondrion," and joined "ER/Golgi" groups were considered for subcellular localization enrichments. The expected value for each cluster was defined as the percentage of proteins with the given subcellular localization annotation in the data. The observed value was calculated as a percentage of those proteins in the given cluster. Finally, log2 observed/expected values were calculated for each of the cluster and subcellular localization. Gene Ontology enrichments were performed using DAVID online service (25Huang D.W. Sherman B.T. Lempicki R.A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources.Nat. Protoc. 2009; 4: 44-57Crossref PubMed Scopus (25496) Google Scholar). All Uniprot Accession numbers belonging to each of the clusters were used as a query and the whole mouse genome used as background for statistical analysis. The top 5 significantly enriched terms were reported for each cluster (FDR < 0.01). The median log2 fold-change values used in Fig. 2E were calculated as follows: the median of the epigenetic signal of genes over all clusters in each tissue served as the expected value. The observed value was the median epigenetic signal in a given combination of cluster and tissue. Finally, a log2 observed/expected value was obtained showing the relative enrichment of the epigenetic signal between clusters for each tissue.Fig. 3The impact of gene proximity and epigenetic similarity on mRNA- and protein-level coregulation. A, Positional gene clustering reduces the expression variability on mRNA level. We calculated the expression variability (coefficient of variation, CV) of the 5% most and 5% least positionally clustered genes on the genome (i.e. considering their sequence proximity). The difference is significant (using Mann-Whitney test) on both mRNA level (***p value = 0.00029) and protein level (*p value = 0.019). When using 10 and 1% most and least clustered genes, we obtain the same statistical results as with 5% (data not shown). Boxplot drawn in the style of Tukey, i.e. box limits indicate the first and third quartiles, central lines the median, whiskers extend 1.5 times the interquartile range from the box limits. Notches indicate the 95% confidence interval for comparing medians. B, Gene coregulation increases with epigenetic similarity at the mRNA level, whereas it remains largely independent from epigenetic similarity at the protein level. C, Epigenetic similarity is the major driver of the mRNA coregulation. Gene pairs were considered coregulated if their mRNA level correlation was > 0.5 and the BH-adjusted p values < 0.05. The bins were created by dividing gene pair distances and epigenetic similarity (1/Mahalanobis distance) into 10 roughly equal sets. This yielded 100 unique bin combinations. The color signifies the percentage of coregulated mRNA in each bin. The mean gene pair distance in the left-most column is 115 Mb and 2 Mb in the right-most column. White stars (*) mark corner sectors which have significantly higher mRNA coexpression compared with an equal-sized random background sample as judged by Kolmogorov-Smirnov test. The procedure was repeated 1000 times. The mean p values for sectors 1, 2, 3 and 4 were 0, 10−13, 0.039 and 6*10−9, respectively. p value of 0 is reported by the KS test for extremely low values. D, Effects in linear distance are confined to very close proximity. The 10 bins constituting the right-most column in Fig. 3C were extracted and magnified. The mean gene pair distance for the left-most column is 4 Mb and 240 Kb for the right-most column. E, Protein-level coregulation of housekeeping genes is not generally affected by epigenetic similarity or linear distance.View Large Image Figure ViewerDownload Hi-res image Download (PPT) Inverted Mahalanobis distance (1/Mahalanobis distance) was used to calculate the similarity between epigenetic profiles of genes. The "mahalanobis" R function was used with a user-specified covariance matrix. Distances between all possible pairs of genes located on same chromosomes were calculated. For each gene, the mean distance to its 5 nearest neighbors was calculated. The list of genes was sorted by increasing mean distance to their 5 nearest neighbors. Finally, the genes at the top and bottom 5% of the list were labeled as most and least positionally clustered, respectively. Gene expression variability at the mRNA and protein levels was calculated as the coefficient of variation (CV; standard deviation divided by the mean) of log2-transformed TPM and SILAC ratios. To avoid dividing by zero (for unchanged genes with a log2 ratio of zero), a constant value of 10 was added to all mRNA and protein log2 ratios before calculating the variability. All data processing was performed in R (26RCore Team. (2017) R: A Language and Environment for Statistical Computing.Google Scholar) and the plots made using the ggplot2 package (27Wickham H. ggplot2: Elegant graphics for data analysis. Springer, New York2009Crossref Google Scholar). The R scripts used to analyze data and generate most of the figures can be found on our GitHub (https://github.com/Rappsilber-Laboratory/tissue_mRNA_protein_scripts_MCP). We assembled a mouse tissue expression data set comprising 3391 genes in 20 different tissues by combining proteomics and transcriptomics from different sources. Protein abundance data were derived from a quantitative proteomics data set based on metabolic isotope labeling of mice (10Geiger T. Velic A. Macek B. Lundberg E. Kampf C. Nagaraj N. Uhlen M. Cox J. Mann M. Initial quantitative proteomic map of 28 mouse tissues using the SILAC mouse.Mol. Cell Proteomics. 2013; 12: 1709-1722Abstract Full Text Full Text PDF PubMed Scopus (166) Google Scholar). Transcriptomics data were obtained from the ENCODE Consortium (11ENCODEProject Consortium An integrated encyclopedia of DNA elements in the human genome.Nature. 2012; 489: 57-74Crossref PubMed Scopus (11104) Google Scholar) and Gene Expression Omnibus (GEO) repository (12Barrett T. Wilhite S.E. Ledoux P. Evangelista C. Kim I.F. Tomashevsky M. Marshall K.A. Phillippy K.H. Sherman P.M. Holko M. Yefanov A. Lee H. Zhang N. Robertson C.L. Serova N. Davis S. Soboleva A. NCBI GEO: archive for functional genomics data sets–update.Nucleic Acids Res. 2013; 41: D991-D995Crossref PubMed Scopus (5129) Google Scholar) (Fig. 1A). The tissue collection comprises few main broad functional categories such as the nervous system (cerebellum, brain cortex), digestive system (stomach, intestine, pancreas), immune system (thymus, spleen) and multifunctional organs such as the liver and kidney. To compare the gene expression between multiple tissues with enough statistical power, we used only genes expressed ubiquitously in all tissues as opposed to using tissue-specific genes. These so-called housekeeping genes account for about half of the genome in human (28Uhlén M. Fagerberg L. Hallström B.M. Lindskog C. Oksvold P. Mardinoglu A. Sivertsson Å Kampf C Sjöstedt E. Asplund A. Olsson I. Edlund K. Lundberg E. Navani S. Szigyarto C.A.-K. Odeberg J. Djureinovic D. Takanen J.O. Hober S. Alm T. Edqvist P.-H. Berling H. Tegel H. Mulder J. Rockberg J. Nilsson P. Schwenk J.M. Hamsten M. von Feilitzen K. Forsberg M. Persson L. Johansson F. Zwahlen M. von Heijne G. Nielsen J. Pontén F. Proteomics. Tissue-based map of the human proteome.Science. 2015; 347: 1260419Crossref PubMed Scopus (7354) Google Scholar) and presumably also in mouse. They are involved in basic cellular functions such as energy metabolism (including mitochondrial proteins), genome integrity maintenance, gene expression, protein trafficking, and cell structural functions. To generate a coexpression matrix for all observed gene pairs on both mRNA and protein level, we calculated their Pearson correlation coefficients (PCCs) across the 20 tissues (exemplified in supplemental Fig. S1). Importantly, compared with a previous study on lymphoblastoid cell lines (LCLs) (7Kustatscher G. Grabowski P. Rappsilber J. Pervasive coexpression of spatially proximal genes is buffered at the protein level.Mol. Syst. Biol. 2017; 13: 937Crossref PubMed Scopus (48) Google Scholar), the expression changes observed between tissues and consequently many different cell types were substantially larger (fold-change increased by a mean of ∼75% for both mRNA and proteins, Fig. 1B). We then assessed the quality and information content of the integrated data set by plotting the mRNA- and protein-level correlations for functionally related gene pairs. As expected, functional gene pairs have much higher correlation coefficients than randomly shuffled gene pairs (supplemental Fig. S2). This effect is more pronounced on protein than mRNA level (Fig. 1C). Subunits of the same complex correlated to a median of 0.59 at protein level and 0.35 at mRNA level. For comparison, in lymphoblastoid cell lines we observed 0.61 and 0.27, respectively. As one would expect, mRNA coexpression appears to be closer linked to function across tissues than closely related cell lines. Nevertheless, protein coexpression remains more indicative of shared function. Next, we wondered about the impact of gene proximity on their correlated expression. We took gene pairs separated by less than 50 Kb between their transcription start sites ("close-by genes") and looked at their mRNA correlation compared with gene pairs further apart (Fig. 1D). We observe 13% of close-by genes to have coregulated mRNAs. However, only a quarter of these (3.3%) are also coregulated on the protein level. This suggests that only a fraction of those coregulated mRNA pairs is functionally related. It is worth noting that even though our mRNA and protein data have similar numbers of data points per gene, the protein data is slightly sparser (15.5% and 6.7% missing values, respectively). Despite the numerical disadvantage of the protein data set, protein-level correlations are still more informative on the function than mRNA (Fig. 1C, supplemental Fig. S2). The data also differs in their measurement-based variation as they were acquired by different technol
Referência(s)