Mammalian gene expression variability is explained by underlying cell state
2020; Springer Nature; Volume: 16; Issue: 2 Linguagem: Inglês
10.15252/msb.20199146
ISSN1744-4292
Autores Tópico(s)Bioinformatics and Genomic Networks
ResumoArticle11 February 2020Open Access Transparent process Mammalian gene expression variability is explained by underlying cell state Robert Foreman Robert Foreman Institute for Quantitative and Computational Biosciences, University of California, Los Angeles, Los Angeles, CA, USA Program in Bioinformatics and Systems Biology, University of California, San Diego, San Diego, CA, USA Search for more papers by this author Roy Wollman Corresponding Author Roy Wollman [email protected] orcid.org/0000-0003-3865-2605 Institute for Quantitative and Computational Biosciences, University of California, Los Angeles, Los Angeles, CA, USA Program in Bioinformatics and Systems Biology, University of California, San Diego, San Diego, CA, USA Department of Integrative Biology and Physiology, Department of Chemistry and Biochemistry, University of California, Los Angeles, Los Angeles, CA, USA Search for more papers by this author Robert Foreman Robert Foreman Institute for Quantitative and Computational Biosciences, University of California, Los Angeles, Los Angeles, CA, USA Program in Bioinformatics and Systems Biology, University of California, San Diego, San Diego, CA, USA Search for more papers by this author Roy Wollman Corresponding Author Roy Wollman [email protected] orcid.org/0000-0003-3865-2605 Institute for Quantitative and Computational Biosciences, University of California, Los Angeles, Los Angeles, CA, USA Program in Bioinformatics and Systems Biology, University of California, San Diego, San Diego, CA, USA Department of Integrative Biology and Physiology, Department of Chemistry and Biochemistry, University of California, Los Angeles, Los Angeles, CA, USA Search for more papers by this author Author Information Robert Foreman1,2 and Roy Wollman *,1,2,3 1Institute for Quantitative and Computational Biosciences, University of California, Los Angeles, Los Angeles, CA, USA 2Program in Bioinformatics and Systems Biology, University of California, San Diego, San Diego, CA, USA 3Department of Integrative Biology and Physiology, Department of Chemistry and Biochemistry, University of California, Los Angeles, Los Angeles, CA, USA *Corresponding author. Tel: +1 858 210 0905; E-mail: [email protected] Molecular Systems Biology (2020)16:e9146https://doi.org/10.15252/msb.20199146 PDFDownload PDF of article text and main figures. Peer ReviewDownload a summary of the editorial decision process including editorial decision letters, reviewer comments and author responses to feedback. ToolsAdd to favoritesDownload CitationsTrack CitationsPermissions ShareFacebookTwitterLinked InMendeleyWechatReddit Figures & Info Abstract Gene expression variability in mammalian systems plays an important role in physiological and pathophysiological conditions. This variability can come from differential regulation related to cell state (extrinsic) and allele-specific transcriptional bursting (intrinsic). Yet, the relative contribution of these two distinct sources is unknown. Here, we exploit the qualitative difference in the patterns of covariance between these two sources to quantify their relative contributions to expression variance in mammalian cells. Using multiplexed error robust RNA fluorescent in situ hybridization (MERFISH), we measured the multivariate gene expression distribution of 150 genes related to Ca2+ signaling coupled with the dynamic Ca2+ response of live cells to ATP. We show that after controlling for cellular phenotypic states such as size, cell cycle stage, and Ca2+ response to ATP, the remaining variability is effectively at the Poisson limit for most genes. These findings demonstrate that the majority of expression variability results from cell state differences and that the contribution of transcriptional bursting is relatively minimal. Synopsis Single-cell coupled measurements of Ca2+ signaling dynamics, cell size, cell cycle, and expression of Ca2+ signaling-related genes show that most of the gene expression variability is not gene-specific. The remaining gene-specific variability approaches Poisson limit for most genes. Cell state information, i.e. cell size, cell cycle stage, and Ca2+ signaling dynamic response to ligand was measured alongside single-cell measurements of gene expression for 150 genes related to Ca2+ signaling using MERFISH. The majority of observed gene expression variance can be explained by 13 features related to cell state. The remaining unexplained variance approaches the Poisson limit for most genes. Introduction Gene expression variability is ubiquitous in all biological systems. In multicellular organisms, heterogeneity between different cell types and states confers specialized function giving rise to complexity in whole-system behavior (Raj & van Oudenaarden, 2008; Eldar & Elowitz, 2010; Symmons & Raj, 2016; Suo et al, 2018; Tabula Muris Consortium et al, 2018). Similarly, single-cell organisms and viruses were shown to utilize heterogeneity at the population level to create diverse phenotypes, such as bet-hedging strategies in changing environments (Veening et al, 2008; Vega & Gore, 2014; Rouzine et al, 2015). While variability can provide useful functional heterogeneity in a multicellular organism or cell population, it is not necessarily always beneficial (Raj & van Oudenaarden, 2008; Symmons & Raj, 2016). Unregulated stochastic events, i.e., noise, can limit cells' ability to respond accurately to changing environments and can introduce phenotypic variability that can have a negative contribution to overall fitness. Indeed, many biological mechanisms including buffering (Stoeger et al, 2016) and feedback loops (Jangi & Sharp, 2014; Schmiedel et al, 2015) have been suggested to limit the detrimental effect of gene expression variability. Quantification of the different contributions of mechanisms that cause gene expression variability is an important step toward determining to what degree the variability represents uncontrolled "noise" or cellular stratification and function. Two key contributors of gene expression variability are allele-specific sources and global factors related to underlying cell state. The analysis of expression covariance between genes is a powerful approach to decompose gene expression variability into these two classes. Landmark works used this approach to investigate expression variability in bacterial cells, which laid a foundation for decomposing variability into allele-specific (intrinsic) sources and variability that originate from sources that affect multiple alleles and relate to the underlying cell state (extrinsic) (Elowitz, 2002; Paulsson, 2005). This work was later extended to yeast (Raser & O'Shea, 2004) and mammalian systems (Raj et al, 2006; Sigal et al, 2006; Singh et al, 2012). The decomposition into allele-specific and cell state components is not always simple. Allele-specific noise in an upstream component can propagate into downstream genes (Sigal et al, 2006), whereas temporal fluctuations in the shared components can have nontrivial consequences on expression distributions (Paulsson, 2004; Pedraza & van Oudenaarden, 2005; Shahrezaei et al, 2008). Finally, use of the terms "intrinsic" and "extrinsic" is sometimes ill-defined and some models include a "coupled intrinsic" mode as well, which is a form of shared variability and hence "extrinsic" (Rodriguez et al, 2019). Despite the sometimes confusing nomenclature, the use of expression covariance to distinguish between allele-specific and shared factors is a powerful decomposition approach. In addition to covariance-based approaches, the relationship between gene expression distribution variance and mean provides a useful quantitative framework to gain insights into sources of expression variability (Munsky et al, 2012). The comparison of expression variability between genes is not straightforward as expression variance scales with its mean. Three statistical tools are commonly used to describe mean normalized variance: the coefficient of variation (CV), CV squared (CV2), and Fano factor. CV and CV2 are both unitless measures where the CV is defined as the standard deviation divided by the mean and the CV2 is simply the CV squared, or the variance divided by the mean squared. The CV and CV2 are useful to compare the scale of variance between different genes because of their unitless nature. The third measure, the Fano factor, is the variance divided by the mean and therefore not unitless, but it has a special property of being equal to one in the case of a Poisson process. Many biological processes have a variance to mean ratio that is at least Poisson so the Fano factor can define a "standard dispersion", as a result, distributions with Fano factor smaller/bigger than one are considered under/over-dispersed, respectively. Therefore, a simple quantification of the distribution variance scaled by its mean can provide key insights into the underlying mechanism generating the observed distribution (Choubey et al, 2015; Hansen et al, 2018a). Multiple studies across bacteria, yeast, and mammalian cells measured over-dispersed gene expression distributions. This observation can have two main interpretations. One interpretation is that the observed over-dispersion is simply a result of the superposition of an allele-specific Poisson variability and cell state variability (Battich et al, 2015). The other interpretation is that the allele-specific variability itself is not a simple Poisson process (Suter et al, 2011; Dar et al, 2015; Corrigan et al, 2016; Tantale et al, 2016). The latter interpretation was popularized by the introduction of a simple phenomenological model named the two-state or random telegraph model that represented genes as existing in either "on" or "off" states (Peccoud & Ycart, 1995; Kepler & Elston, 2001; Paulsson, 2004; Thattai & van Oudenaarden, 2004; Kaern et al, 2005; Friedman et al, 2006; Raj et al, 2006; Shahrezaei & Swain, 2008; Suter et al, 2011; Molina et al, 2013; Sanchez & Golding, 2013; Fukaya et al, 2016; Lenstra et al, 2016). More complex models with multiple states were also considered (Suter et al, 2011; Zoller et al, 2015; Corrigan et al, 2016; Tantale et al, 2016; Nicolas et al, 2018) but the addition of multiple states does not change the model in a qualitative way. These models suggest that transcription should occur in distinct bursts with multiple transcripts generated when the gene is "on". These two-state models can be described by two overall key parameters: the burst size and frequency that control the resulting gene expression distributions with lower burst frequency and larger burst size contributing to the over-dispersion of the underlying distribution. Overall, both interpretations, bursting and cell state, can explain the observed over-dispersion. There is mounting evidence that for at least many genes, most of the over-dispersion is explained by cell state variables rather than intrinsically noisy transcriptional bursting (Battich et al, 2015). Nonetheless, the transcriptional bursting model is still widely used (Larsson et al, 2019; Ochiai et al, 2019) calling for more systematic investigation. The relative scales and sources of variability are very important to understand in the modern world of single-cell highly multiplexed measurements. These new technologies are revealing the complex structure of "cell space" with cells occupying a large array of types (Han et al, 2018; Rosenberg et al, 2018; Tabula Muris Consortium et al, 2018), states (Trapnell, 2015; Cheng et al, 2019), and fronts (Shoval et al, 2012) that reflect functional stratification. Despite our knowledge that cell types and states manifest as gene expression heterogeneity, sometimes total gene expression variability is interpreted as arising from two-state transcriptional bursting alone (Larsson et al, 2019). The gap in our understanding of the relative contribution of cell state and allele-specific factors is hindering progress in assigning functional roles to observed variability (Dueck et al, 2016). To address this knowledge gap, we utilized the two key properties of expression variability: covariance and dispersion. We measured gene covariance and dispersion using joint measurements of individual cells, where for each cell, multiple cell state features were measured, as well as a highly multiplexed measurement of gene expression. We used sequential hybridization smFISH (MERFISH implementation) (Moffitt et al, 2016) that allowed us to accurately measure the expression of 150 genes in ~ 5000 single cells. Since expression covariance between genes from the same pathway is higher compared to genes that have distinct functions (Sigal et al, 2006; Stewart-Ornstein et al, 2012), we focused on a single signaling network and biological function, Ca2+ response to ATP in epithelial cells, and a biological response important to wound healing (Funaki et al, 2011; Handly et al, 2015; Handly & Wollman, 2017). The key advantage of Ca2+ response is that the overall signaling response can be measured in < 15 min, a fast timescale that precludes any ATP-induced changes in transcription. Using the combined dataset, we were able to separate the correlated and uncorrelated components using a simple multiple linear regression model guided by the changes in the covariance matrix. We found that after removing all shared components, the remaining allele-specific variability shows very little over-dispersion for most genes measured. Overall, these results indicate that transcriptional bursting is only a minor contributor to the overall observed expression variability. Results To assess the relative contribution of the overall expression variability that stems from allele-specific sources vs underlying cell state variability, we took advantage of the fact that these two sources have different expression covariance signatures. Figure 1 shows simulated data to illustrate how covariance signatures can be utilized to decompose sources of variability. By definition, allele-specific variability is uncorrelated to any other gene, whereas variability that is due to heterogeneity in the underlying cell state will likely be shared between genes with similar function (Fig 1A). When transcriptional bursting dominates (Fig 1B top) the shared regulatory factors will have a small contribution, there will be little correlation between genes and the expression variance will remain largely unchanged after conditioning expression level on any cell state factors (Fig 1B top right). The residual intrinsic variance will have a Fano factor greater than one. On the other hand, when cell state variability dominates (Fig 1B bottom), expression between genes will be highly correlated and conditioning the expression on cell state factors will reduce both the variance and correlation between genes. At the limit, when all shared factors are accounted for, the correlation between genes will approach zero and the Fano factor of the residuals will approach one, the Poisson limit (Fig 1B bottom right). When the contribution of bursting and cell state is comparable (Fig 1B middle), conditioning on cell state factors will have some effect but the final Fano factor will be higher than one even when the correlation is zero (Fig 1B middle right). Conditioning on cell state factors has a dual effect on correlation and Fano factor, and therefore, it is possible to assess whether the conditioning removed all the obvious extrinsic variability. When all the extrinsic variability is conditioned out, one can confidently interpret whether the residual intrinsic variability is under- or over-dispersed. Figure 1. Transcriptional bursting and trans-acting factors are two distinct causes of cell-to-cell heterogeneity Cartoon depicting that different cells can have different activities of trans-factor (TF) regulatory molecules in addition to the effects of transcriptional bursting. Simulated data showing that variability from shared regulatory factors results in correlation between two genes with three example cases: intrinsic dominated noise (top three panels), mixture of cell state and allele-specific sources (middle three), and cell state dominated (bottom three). This correlation is diminished when the expression levels are conditioned on the levels of these shared regulatory factors (middle and right). After conditioning on all trans-acting regulatory factors, the remaining variability due to transcriptional bursting alone is potentially significantly smaller (right). Inset text is the Pearson correlation coefficient between gene A and gene B (brown) and the Fano factor of gene A (blue). Download figure Download PowerPoint To distinguish between the possible situations described above requires accurate highly multiplexed single-cell measurements of gene expression and a sufficient number of cellular features that correlate with the underlying cell state factors controlling gene expression. To achieve this, we developed an experimental protocol that combines MERFISH, a multiplexed and error robust protocol of counting RNA transcripts using fluorescent in situ hybridization (Chen et al, 2015; Moffitt et al, 2016) with rich profiling of the underlying cell state (Fig 2). We used the MCF10A mammary epithelial cell line, which is often used in studies of cellular variability due to their nontransformed nature and their accessibility to imaging (Selimkhanov et al, 2014; Qu et al, 2015). We focused on genes that share biological function: involvement in the Ca2+ signaling network, a key pathway important to the cellular response to tissue wounding (Minns & Trinkaus-Randall, 2016; Justet et al, 2019). The two advantages of Ca2+ signaling are that (i) we expect that genes that share a function will show a high degree of correlation in their expression levels (Stewart-Ornstein et al, 2012). (ii) Ca2+ signaling is fast, and we can measure the overall emergent phenotype of the network in < 15 min (Fig 2A), a timescale faster than that of gene expression in mammalian cells (Shamir et al, 2016). In our protocol, cells were rapidly fixed after live cell imaging (10–15 min from ATP stimulation to fixation, Appendix Fig S1), and therefore, the gene expression measured in the same cell is unlikely to have changed as a result of the agonist. Figure 2. Paired single-cell MERFISH and live cell calcium imaging Experimental overview—live cells are imaged for their calcium response to ATP before being fixed and imaged to measure gene expression of 150 genes. smFISH spots are imaged over several rounds of hybridization and aligned such that individual genes are encoded as specific series of dark and bright spots throughout all rounds of hybridization. Left, representative calcium trajectories demonstrating the heterogeneous response to ATP stimulation, top vs bottom left. The right panel is an image plot of all 5000+ successfully paired to smFISH cells. Cellular volume is measured and the correlation between total transcripts per cell and the cellular volume is shown. Left, shows marker gene expression for cell cycle-related genes used to derive a g2m score (coloring). Right, is the same as the left panel with a representative gene used to derive the S score for each cell. Correlation of a representative gene (ATP2A2) with a gene that marks the differentiation status of MCF10A cells (CD44). Table of the cell state features categories and the complete list of the 13 factors used in the multiple linear regression (MLR) statistical model. Download figure Download PowerPoint MERFISH is a multiplexing scheme of smFISH where transcript identity is barcode-based, and the barcodes are imaged over several rounds of hybridization. During each hybridization round, dye-labeled oligos are hybridized to a subset of RNA species being measured, the sample is imaged, and RNA appears as diffraction-limited spots; then, the dye molecules are quenched, and the process is repeated until all barcode "bits" are imaged. By linking diffraction-limited spots across imaging rounds, we can decode the RNA barcodes by identifying the subset of images where a bright diffraction-limited spot appears at the same XYZ coordinate (Fig 2B). The use of combinatorial labeling allows exponential scaling of the number of gene images with the number of imaging rounds. The scaling is mostly limited by the built-in error correction (Chen et al, 2015). In this experiment, we used 24 imaging rounds (eight hybs × three colors) where each RNA molecule was labeled in four imaging rounds. An example of the MERFISH data is shown in Fig 2B. Overall, we measured the expression of 150 genes including 131 genes annotated as involved in Ca2+ signaling network (Kanehisa & Goto, 2000; Bandara et al, 2013; Kanehisa et al, 2019), 17 genes to mark stages of the cell cycle (Whitfield et al, 2002), and two genes that correlate with the sub-differentiated state of MCF10A cells (Qu et al, 2015). We estimate our detection efficiency to be ~ 95.5% and false-positive rate < 1% per gene per cell. Overall spearman correlation with bulk RNAseq was 0.84 (Appendix Fig S3). Our decomposition into allele-specific and cell state components is based on conditioning on multiple cell state factors. While it would be ideal to directly measure the regulatory factors that causatively control gene expression variability, more accessible measurements, e.g., cell size or cell cycle stage, that are correlated with these causative regulatory factors are sufficient for the conditioning process. Given that the genes we probe are related to Ca2+ signaling, we first extracted key features from time series of cytoplasmic Ca2+ response measured with a calibrated GCaMP5 biosensor (Appendix Fig S2A). The live cell imaging of cytoplasmic Ca2+ levels (Fig 2C) showed a highly heterogeneous response, qualitatively and quantitatively similar to previous work on Ca2+ signaling in MCF10A cells where we observed a mixed population response with a wide range of response phenotypes (Yao et al, 2016; Handly & Wollman, 2017). We used a feature-based representation of Ca2+ response to represent cellular factors that we anticipate correlate with underlying cell state (Fig 2G and Appendix Fig S2). In addition to Ca2+ features that are specific to Ca2+ signaling, we also measured a few global features of the cell that are likely to be correlated with expression changes of most genes. Specifically, we measured cell volume, cell cycle stage, and two markers of MCF10A differentiation status (Fig 2D–F). As was shown in the past, cell volume strongly correlated with the total number of transcripts per cell (Fig 2D) indicating that at least for some genes, cell state factors must be important contributors to their expression variability (das Neves et al, 2010; Shalek et al, 2014; Battich et al, 2015; Padovan-Merhar et al, 2015; Hansen et al, 2018a). However, not all genes show the same strength correlation with volume, and some cell cycle genes are more complexly related to volume (Appendix Fig S4). Similarly, the cell cycle stage and MCF10A differentiation status were correlated with specific genes (Fig 2E and F) (Buettner et al, 2015). Overall, we measured 13 different cellular features that will be used to decompose variance in all 131 Ca2+-related genes we measured. By focusing on a smaller number of specific features that relate to the Ca2+ response augmented by established global cell state features like cell size and cell cycle state, we expected to be able to capture most of the expression variability that comes from underlying cell state heterogeneity. These results are consistent with previous work demonstrating widespread cell cycle and differentiation-related variability in the transcriptome (Battich et al, 2015). To decompose the observed expression into multiple components, we used standard multiple linear regression (MLR) (Battich et al, 2015; Hansen et al, 2018a). Figure 3A shows the scatter plots of expression of two representative genes (ATP2A2 and RRM1) plotted against cell volume, cell cycle, differentiation markers, and Ca2+ feature. The scatter plots show that (i) there is indeed a correlation between expression and some of these cell state features. (ii) The amount of variance that is explained by each cell state feature can change between genes. Overall, the simple MLR model with 13 independent measurements was able to explain between ~ 15 and 85% of the observed variance with a median of 0.62 (Fig 3B). To assess the relative contribution of each cell state feature, we looked into the relative fraction of explanatory power for each feature category (Fig 3C). Overall, cell volume has the most explanatory power, but for some genes, cell cycle and Ca2+ features contribute meaningfully to the explained variance. While some of the features had a small effect in terms of the overall variance explained by the feature, in most cases, the effects were very unlikely to be a result of pure random sampling, permutation-based statistical testing showed that most genes measured here are statistically correlated with at least one calcium feature (Fig 3D). Figure 3. Decomposition of gene expression variability using multiple linear regression Representative scatter plots of correlation between two individual genes (rows) and different cell state factors (columns). The percent of variance explained by each factor in the MLR model for each gene is annotated in the corner. A histogram of the overall explained variance for each gene. Stacked bar plot showing which cell state feature categories contribute to the explained variance of the MLR. The significance of calcium features for 150 genes was estimated by Z-scoring the slope of the feature in a null distribution of 1,600 bootstrapped shuffled data slopes. The number of statistically significant genes for each feature is shown above [adjusted P-value (Bonferroni) < 0.05]. Whiskers are at 1.5 times the interquartile range. Download figure Download PowerPoint A key uniqueness of our approach is that gene expression is measured in a multiplexed fashion allowing the estimation of the correlation between genes. Figure 4A shows the correlation matrix of the raw counts, and the counts conditioned on cell state features. As expected, as we increase the number of cell state features included in the MLR, the overall gene-to-gene correlation goes down. Interestingly, the full MLR model that only includes 13 identical terms for all genes is able to reduce the overall correlation between genes significantly. To quantify the bulk correlation, we measured the amount of variance that is explained by the first two components of a principal component analysis (PCA; Fig 4B). Without conditioning on any cellular feature, the first two components explain > 40% of the variance. This is reduced substantially to < 10% of the overall variance, in the full MLR. The substantial reduction in the gene-to-gene correlation demonstrates that we were able to condition away most of the shared components. Still, the remaining correlation was not completely removed, and therefore, we added another term to the model that is based on the first two principal components of a PCA after taking all other features into account. These two components most likely represent some cell state features that were not sufficiently captured by our 13 cellular features. With the addition of the last "hidden" feature, the overall variance that is shared is very close to values from shuffled data. Overall, the analysis of expression covariance demonstrates that our simple MLR sufficiently captures most of the information related to cell state that is required for conditioning expression distribution. Figure 4. Residual variability from MLR models contains significantly less covariation between genes and close to Poisson variability within individual genes Gene–gene correlation matrices showing the reduction of covariance after conditioning on cell state features. Explained variance of first two components of PCA for each stage of MLR models showing reduction in shared variability with increasing number of cell state factors. The dotted line shows the variance explained by first two PCA components when the data are shuffled. Fano factor distributions of 150 genes measured at different levels of cell state conditioning are shown as boxplots. The whiskers are at 1.5 times the interquartile range. Dashed line is the Poisson expectation with technical noise. Scatter plot of residual gene expression coefficient variation squared for each gene after decomposition of all cell state features. Poisson expectation is shown as dashed line. Download figure Download PowerPoint Finally, we wanted to determine the overall dispersion remaining in the allele-specific gene expression distribution. The allele-specific variability is estimated as the residual variability in the raw gene expression counts after conditioning on cell state factors. As we increase the number of cell state features we conditioned on, we saw a substantial reduction in the distributions of Fano factor magnitudes (Fig 4C). When all 13 cell state features and the two hidden features estimated based on PCA are included, the Fano factor is very close to one for most of the genes. Note that we do not perform any correction for technical noise; so, the limit of one is only theoretical. Similarly, analysis of the CV2 vs the expression means on a log–log plot shows that all genes are very close to the Poisson limit (Fig 4D). The proximity to the Poisson limit is similar across all expression levels. Therefore, these data indicate that super-Poissonian transcriptional bursting plays a very minor role in allele-specific variability. It is unclear whether the few genes that do show over-dispersion whether they have significant levels of
Referência(s)