RNA sequencing reveals two major classes of gene expression levels in metazoan cells
2011; Springer Nature; Volume: 7; Issue: 1 Linguagem: Inglês
10.1038/msb.2011.28
ISSN1744-4292
AutoresDaniel Hebenstreit, Miaoqing Fang, Muxin Gu, Varodom Charoensawan, Alexander van Oudenaarden, Sarah A. Teichmann,
Tópico(s)Genomics and Phylogenetic Studies
ResumoReport7 June 2011Open Access RNA sequencing reveals two major classes of gene expression levels in metazoan cells Daniel Hebenstreit Corresponding Author Daniel Hebenstreit Structural Studies Division, MRC Laboratory of Molecular Biology, Cambridge, UK Search for more papers by this author Miaoqing Fang Miaoqing Fang Department of Biological Engineering, Massachusetts Institute of Technology, MA, USA Search for more papers by this author Muxin Gu Muxin Gu Structural Studies Division, MRC Laboratory of Molecular Biology, Cambridge, UK Search for more papers by this author Varodom Charoensawan Varodom Charoensawan Structural Studies Division, MRC Laboratory of Molecular Biology, Cambridge, UK Search for more papers by this author Alexander van Oudenaarden Alexander van Oudenaarden Department of Physics and Department of Biology, Massachusetts Institute of Technology, MA, USA Search for more papers by this author Sarah A Teichmann Corresponding Author Sarah A Teichmann Structural Studies Division, MRC Laboratory of Molecular Biology, Cambridge, UK Search for more papers by this author Daniel Hebenstreit Corresponding Author Daniel Hebenstreit Structural Studies Division, MRC Laboratory of Molecular Biology, Cambridge, UK Search for more papers by this author Miaoqing Fang Miaoqing Fang Department of Biological Engineering, Massachusetts Institute of Technology, MA, USA Search for more papers by this author Muxin Gu Muxin Gu Structural Studies Division, MRC Laboratory of Molecular Biology, Cambridge, UK Search for more papers by this author Varodom Charoensawan Varodom Charoensawan Structural Studies Division, MRC Laboratory of Molecular Biology, Cambridge, UK Search for more papers by this author Alexander van Oudenaarden Alexander van Oudenaarden Department of Physics and Department of Biology, Massachusetts Institute of Technology, MA, USA Search for more papers by this author Sarah A Teichmann Corresponding Author Sarah A Teichmann Structural Studies Division, MRC Laboratory of Molecular Biology, Cambridge, UK Search for more papers by this author Author Information Daniel Hebenstreit 1, Miaoqing Fang2, Muxin Gu1, Varodom Charoensawan1, Alexander van Oudenaarden3 and Sarah A Teichmann 1 1Structural Studies Division, MRC Laboratory of Molecular Biology, Cambridge, UK 2Department of Biological Engineering, Massachusetts Institute of Technology, MA, USA 3Department of Physics and Department of Biology, Massachusetts Institute of Technology, MA, USA *Corresponding authors. Structural Studies Division, MRC Laboratory of Molecular Biology, Hills Road, Cambridge, CB2 0QH, UK. Tel.: +44 122 340 2479; Fax: +44 122 321 3556; E-mail: [email protected] or Tel.: +44 122 325 2947; Fax: +44 122 321 3556; E-mail: [email protected] Molecular Systems Biology (2011)7:497https://doi.org/10.1038/msb.2011.28 PDFDownload PDF of article text and main figures. Peer ReviewDownload a summary of the editorial decision process including editorial decision letters, reviewer comments and author responses to feedback. ToolsAdd to favoritesDownload CitationsTrack CitationsPermissions Figures & Info The expression level of a gene is often used as a proxy for determining whether the protein or RNA product is functional in a cell or tissue. Therefore, it is of fundamental importance to understand the global distribution of gene expression levels, and to be able to interpret it mechanistically and functionally. Here we use RNA sequencing (RNA-seq) of mouse Th2 cells, coupled with a range of other techniques, to show that all genes can be separated, based on their expression abundance, into two distinct groups: one group comprised of lowly expressed and putatively non-functional mRNAs, and the other of highly expressed mRNAs with active chromatin marks at their promoters. These observations are confirmed in many other microarray and RNA-seq data sets of metazoan cell types. Introduction Expression level is frequently used as a way of characterizing gene function, by northern blotting, PCR, microarrays, and, more recently, RNA-sequencing (Wang et al, 2009a; RNA-seq). Therefore, it is a central issue in molecular biology to know how many transcripts are expressed in a cell at what levels. This question was studied very early in the history of molecular biology using methods such as reassociation kinetics (Hastie and Bishop, 1976), which indicated the existence of distinct abundance classes, and recently, we pointed out that separate peaks are visible in the abundance distributions of a number of microarray data sets (Hebenstreit et al, 2011). At the same time, microarrays or RNA-seq data have been described as displaying broad, roughly lognormal distributions of expression levels with no clear separation into discrete classes (Hoyle et al, 2002; Lu and King, 2009; Ramskold et al, 2009). There are several reasons for this: many samples are heterogeneous in terms of cell type (Hebenstreit and Teichmann, 2011) or are based on a previous generation of less sensitive microarrays, many are from unicellular organisms rather than animals, and finally, data processing and plotting methods can obscure the presence of distinct abundance classes. Here, we provide experimental and computational support for two overlapping major mRNA abundance classes. Our findings hold for metazoan data sets including human, mouse and Drosophila sources. Results and discussion We initially based our analysis on murine Th2 cells (Zhu et al, 2010), as these cells can be obtained in large quantities ex vivo and can be prepared as a pure and homogeneous cell population. Furthermore, there is a well-characterized set of genes whose proteins are known to be expressed and functional in Th2 cells, as well as a set of genes known to be not expressed in these cells (Supplementary Table S1 lists the genes we used in our study, Supplementary Figure S1 shows expression of two marker proteins in the cells). We generated Th2 poly(A)+ RNA-seq data for two biological replicates and calculated gene expression levels using the standard measure of Reads Per Kilobase per Million (RPKM; Mortazavi et al, 2008, Supplementary Table S2 gives the number of reads and mappings we obtained). The expression levels of the biological replicates are highly correlated (r2=0.94, Supplementary Figure S2). We then calculated the mean RPKMs of the two samples for all genes and log2 transformed these values. Displaying the distribution of all gene expression levels as a kernel density estimate (KDE) reveals an interesting structure: the majority of genes follow a normal distribution, which is centered at a value of ∼4 log2 RPKM (∼16 RPKM), whereas the remaining genes form a shoulder to the left of this main distribution (Figure 1A, solid line). This was conserved under different KDE bandwidths (Supplementary Figure S3, left panel) or different histogram representations (Supplementary Figure S3, right panel). As genes with zero reads cannot be included on the log scale, we prepared an alternative version of Figure 1A, where we assigned low RPKM values to these. This helps to illustrate the fraction of zero-read genes (Supplementary Figure S4). As a comparison, we studied microarray data for the same cell type from a recent publication (Wei et al, 2009). The correlation between the microarray and the RNA-seq data was very good and highly statistically significant (Pearson r2=0.83, Spearman ρ=0.84, Supplementary Figure S5). Surprisingly, displaying the distribution of microarray expression levels results in a clearly bimodal distribution (Figure 1B). Again, the appearance of the distribution was insensitive to the KDE bandwidth choice or histogram bin size (Supplementary Figure S6). The bimodality was conserved when alternative normalization and processing schemes were used, independent of KDE bandwidths (Supplementary Figure S7). Figure 1.Distribution of gene expression levels. (A) Kernel density estimates of RPKM distributions of RNA-seq data within exons, introns and intergenic regions as indicated. The fragments used to estimate intron and intergenic RPKM were based on randomizations using the same length distribution as the exonic parts of genes. The 90% quantile of the intergenic distribution is indicated. (B) Kernel density estimate of expression level distribution of microarray data (Wei et al, 2009). (C) Expectation-maximization-based curve fitting of RNA-seq data of A. Download figure Download PowerPoint Visual inspection of both microarray and RNA-seq data thus reveals two overlapping main components of the distribution of gene expression levels. Quantifying this by curve fitting confirms a good fit to two distributions: the goodness-of-fit (measured by Akaike Information criterion, AIC (Akaike, 1974), Bayesian Information Criterion, BIC (Schwarz, 1978) or Likelihood ratio tests (Casella and Berger, 2001)) shows strong increases for both microarray and RNA-seq data when two-component models are fit by expectation-maximization (compared to single- or more-component models; Supplementary Figure S8). We designate the two groups of genes as the lowly expressed (LE) and highly expressed (HE) genes (Figure 1C), because we will present evidence below that the LE genes are expressed rather than simply being experimental background. Our findings are not limited to Th2 cells and hold for virtually all recently published metazoan RNA-seq data sets (e.g., Marioni et al, 2008; Mortazavi et al, 2008; Mudge et al, 2008; Wang et al, 2008; Supplementary Figure S9; Cloonan et al, 2008; and Supplementary Figure S10A and B) and all microarray data sets (e.g., Cui et al, 2009; GNF Atlas 3 (Chintapalli et al, 2007); FlyAtlas (Lattin et al, 2008); and Supplementary Figure S11) we have studied. The existence of further minor groups of genes cannot be excluded, but is not clear at this point due to the diverse curve-fitting results for the different data sets if higher-order (more than two components) models are considered. The difference between the microarray and RNA-seq distributions is explained by the fact that the microarrays yield a signal for all genes, part of which is due to cross-hybridization of oligonucleotide probes if the gene is not strongly expressed. RNA-seq on the other hand yields a signal for a gene only if at least one sequencing read is found. The accuracy of RNA-seq is biased toward longer and more highly expressed genes, e.g., 5% of all genes account for 50% of all reads in our data as well as in other data sets (Mortazavi et al, 2008; Oshlack and Wakefield, 2009; Bullard et al, 2010). To explore how this accuracy bias affects the shape of the LE distribution, we studied the RNA-seq detection limit. We first plotted the number of genes with zero reads as a function of the total number of reads (taking subsets of the total reads). The number of genes without reads decreases slowly, with no change in slope and hence no indication of reaching a plateau. Even at a total of 25 million reads, ∼30% of all genes are undetected (Figure 2A). We further estimated the numbers of genes remaining undetected at each expression level by assuming Poisson-distributed read numbers (Jiang and Wong, 2009) and by determining the expected frequency of zeroes. This confirms the sensitivity drop at the lower end of the LE peak (Figure 2B). Extrapolating the numbers of expressed genes including the undetected ones reveals an emerging LE peak (Figure 2B). Thus, the smaller portion of LE genes in the RNA-seq data compared with the microarray data is at least partially due to the RNA-seq detection limit, although this only becomes a problem for genes at less than approximately −3 to −4 log2 RPKM. It should be noted that these low expression levels correspond to an absence of transcripts in the majority of cells, as we demonstrate further below. Figure 2.Sensitivity of RNA-seq. (A) Detection of genes in dependency of the total read numbers on linear scale and log2 scale (inset). Random subsets of the total reads for the two RNA-seq replicates were taken and the number of genes with zero reads were plotted versus the total read numbers used. The figure represents an average of five independent subsets for each data point. (B) Prediction of genes remaining undetected due to Poisson statistics underlying RNA-seq. The theoretically expected fraction of genes remaining undetected (red, y axis on the right side of the figure in red) was determined for each expression level. This was used to infer the expressed genes including the undetected ones (blue) from the actual expression data (black, bins indicated by tick marks across top). In addition to the RPKM scale, the reads per kilobase (RPK) scale (without normalization to the total number of mapped reads) is shown on top, which was used for the calculation of the (integer-) Poisson statistic and which, in contrast to the RPKM scale, depends on the total number of sequencing reads. (C) RT–PCR for the genes are listed in Supplementary Table S1. The RNA-seq expression levels of the genes are plotted versus the negative threshold cycles (Ct) of the PCRs. The plot is overlaid (with the same x axis scaling) upon the kernel density estimate of the RNA-seq expression level distribution (black line) to show the positions of the genes in the total expression distribution. Genes either in the LE peak of the RNA-seq distribution or which have been previously characterized as not expressed in Th2 cells are shown in orange. Genes known to be expressed are shown in purple. Error bars indicate s.e.m. from three independent biological replicates. Please refer to Supplementary Tables S1 and S6 for details of genes and PCR primers. (D) Correlation of RPKM within exons and introns based on the RNA-seq data from Figure 1A. Correlation and significance of correlation were calculated for the whole distribution (gray) or for LE and HE genes separately. Division into LE and HE was performed along a line (white) perpendicular to a fitted trendline (gray), centered at Exon RPKM=1. The data points are shown as 2D kernel density estimate. Download figure Download PowerPoint To further confirm that the LE genes correspond to low expression and not experimental noise, we performed real-time PCRs. We tested amplification by exon-spanning primers of a set of genes that are known to be expressed or not expressed in Th2 cells, plus five random genes that we detected between −3.7 and −5 log2 RPKM in the RNA-seq experiment (Supplementary Table S1). We were able to successfully PCR-amplify all genes with high specificity. The expressed genes map to the HE peak, while almost all unexpressed genes map to the LE peak, if we align the PCR results with the microarray/RNA-seq data (Figure 2C). We also tested the extent to which genomic DNA can be detected in our polyA-purified mRNA sample, as proposed by Ramskold et al (2009) as a means of quantifying experimental background. We randomly selected intergenic fragments with the same length distribution as genes, 10 kb away from genes. The resulting RPKM distribution contains a high number of zero-RPKM fragments (79%), while the majority of non-zero fragments peaks slightly left of the LE shoulder (Figure 1A). The 90% quantile of this intergenic background distribution is at −4.97 log2 RPKM, which means that we can be quite confident (with probability >90%) that genes with an RPKM value above this level are truly expressed rather than representing experimental background noise (Figure 1A). Further, the overlap between the intergenic and the normalized LE fit is small (Supplementary Figure S12). We cannot rule out that detection of intergenic DNA corresponds to transcription as well, which would make the case for transcription of LE genes even stronger. Analysis of the strand-specific mRNA-sequencing data of ES cells by Cloonan et al (2008) yields similar conclusions. The poly(A)-purification protocol selects for reads that are antisense to genes (the antisense reads correspond to mRNA). In the distribution of 'sense' reads (corresponding to antisense transcripts in genic regions), more than 50% of genic regions have zero reads. This distribution is unimodal and shifted by ∼2 log2 RPKM with respect to the LE distribution, and overlaps almost perfectly with the distribution of reads in intergenic regions (Supplementary Figure S10A). We next determined the distribution of RPKM within introns, again using fragments with the same length distribution as transcripts (please note that our intronic read densities are not enriched at 5′ or 3′ ends of the intronic regions (Supplementary Figure S13)). The resulting intronic distribution is significantly higher than the intergenic background (two-sided Wilcoxon rank-sum test, P<2.2 × 10−16) and peaks at roughly −1 log2 RPKM (Figure 1A). Introns thus have one- to two orders of magnitude lower read density than exons. This suggests that we are detecting incompletely processed transcripts at a low but significant and uniform level across the whole range of transcript abundances. As introns are one- to two orders of magnitude longer than exons, introns should be detectable with roughly the same accuracy as exons, if the full-length set of introns of a gene is used. If we plot the RPKM in exonic regions versus the RPKM in intronic regions for each gene, there is significant correlation (r2=0.86, P<2.2 × 10−16) across the whole spectrum of expression levels. Calculating the correlation for LE and HE genes separately yields only slightly lower correlations among LE genes compared with HE genes, and both correlations are highly significant (Figure 2D). This provides evidence that confirms that LE genes are transcribed rather than experimental background: there would not be such a high correlation between introns and exons, particularly in the low-abundance region, if their detection were due to noise. We next studied gene expression using a single-cell approach by performing single-molecule RNA-fluorescence in situ hybridization (FISH; Raj et al, 2008) for five genes that are expressed at different levels according to the literature and our RNA-seq data. The distributions of mRNA numbers per cell were very broad for expressed genes (e.g., Gata3), while low mRNA numbers from 'not-expressed' genes (e.g., Il2) were still detected (Figure 3A). All genes had Fano factors (σ2/μ) larger than 4, indicating that they had extra-Poisson variation (a Poisson random variable would have σ2/μ=1) and therefore burst-like transcription (Raj and van Oudenaarden, 2009) (Supplementary Table S3). Importantly, cells expressing Tbx21 were not anticorrelated with cells expressing Gata3 (Figure 3B), meaning that we do not have a sub-population of Th1 cells in our Th2 cell populations. This demonstrates that LE expression is not due to a contaminating cell type, as the same cells express groups of genes at HE and others at LE levels. Figure 3.(A) Distribution of mRNA numbers among single cells. Histograms for Gata3 and Tbx21 (with an inset histogram starting from 1 instead of 0 to better illustrate higher expressions) and a sample fluorescence microscopy image are shown. Tbx21 transcripts are marked with white arrows to ease identification. (B) Correlation between Gata3 and Tbx21 expression. Correlation coefficient and significance are inset. (C) Plot of mean mRNA numbers per cell versus RNA-seq RPKM of five genes. Error bars indicate s.e.m. from two RNA-seq biological replicates. (D, E) 2D kernel density estimates of gene expression level versus ChIP-seq signal for each gene for RNA-seq (D) and microarray (E) data. Divisions between background and signal for the ChIP-seq component were determined by curve fitting with the software EpiChIP (Hebenstreit et al, 2011) and are indicated. Divisions between LE and HE groups of genes are indicated. (F) Scheme summarizing the results. Download figure Download PowerPoint It should be further noted that the LE/HE groups cannot theoretically result from a mixture of different cell types. Mixing of different cell types leads to gene expression levels for each gene that are an average across cell types. Hence, such distributions will become more unimodal, not less so (following the central limit theorem). As the RPKM as measured by RNA-seq should be proportional to the mean mRNA numbers per cell, we can use the RNA-FISH results to estimate how our RPKM values translate into mRNA numbers. We find that one RPKM corresponds to an average of roughly one transcript per cell in our Th2 data set (Figure 3C). Please note that the value of one RPKM/one transcript on average per cell serves as an estimate only as it is based on a limited number of data points. See Supplementary Figure S14 for log-transformed versions of Figure 3A–C. To study the nature of the LE and HE groups in more detail, we prepared Th2 ChIP-seq data for the activating H3K9/14 acetylation histone modification (Roh et al, 2005; Wang et al, 2009b) (H3K9/14ac) and one IgG control. We calculated the histone-modification level at each gene by identifying a globally enriched window around the transcription start sites of genes, and using reads in this window as a measure of each gene's modification level, normalized by the total reads (giving the normalized locus-specific chromatin state, NLCS, as used in Hebenstreit et al (2011)). Thus, we were able to plot histone-modification levels of each gene against expression levels from the RNA-seq or microarray data using a heatmap representation (Figure 3D, RNA-seq and Figure 3E, microarrays). Supplementary Figure S15 is an alternative version of this figure, where we randomly assigned low RPKM values to the zero-read genes. This strikingly confirms the two groups of gene expression levels, as there is a very good agreement between LE genes and absence of histone marks on one hand, and HE genes and presence of H3K9/14ac marks on the other hand (Figure 3D–E). This is seen for both the microarrays as well as the RNA-seq data. This extends previous findings of the relationship between H3K9/14ac and transcriptional activation by revealing an on/off-type of correlation between this histone mark and the LE/HE groups of genes. It should be noted that there is a very weak correlation within the LE and HE groups. The strongest correlation is within the RNA-seq HE group with a correlation coefficient r2=0.29 in log space and r2=0.097 on linear space. As the LE group of genes is still expressed at low levels and contains at least five genes that are characterized as not expressed and non-functional in Th2 cells, it seems likely that the HE group of genes represents the active and functional transcriptome of cells. This is supported by SILAC proteomics data (Graumann et al, 2008), which is available for the embryonic stem cell data we presented earlier (Supplementary Figure S10) and which indicates protein expression of HE genes only (Supplementary Figure S10C). The tight correlation recently observed between RNA and protein levels in three mammalian cell lines also supports this (Lundberg et al, 2010). Gene ontology (GO) analysis of LE and HE genes in the Th2 cells supports the notion that HE comprises the functional transcriptome, as many T-cell-specific processes (e.g., GO:0050863, GO:0045582, GO:0042110) and housekeeping processes are enriched (Supplementary Table S4). On the other hand, many GO terms referring to differentiation of other cell types (e.g., ear development GO:0043583, neuron fate commitment GO:0048663) are enriched among the LE set of genes (Supplementary Table S5). In conclusion, our data show that two large groups of genes can be discriminated based on the distribution of expression levels. RNA-FISH indicates that the boundary between the groups is found at an expression level of roughly one transcript per cell. In addition, H3K9/14ac marks are associated with the promoters of HE genes only (Figure 3F). It thus seems likely that the LE/HE groups reflect different transcription kinetics depending on the chromatin state or vice versa. The LE group is likely to correspond to 'leaky' expression, producing non-functional transcripts. The majority of LE genes are expressed at less than one copy per cell on average, and it would be interesting to know whether such stochastic expression has any function, e.g., in cell differentiation, or any deleterious effects. There may be a trade-off between the cost of repressing expression entirely and unwanted consequences of stochastic expression. Regulation of gene expression is mostly mediated by transcription factor binding events at promoters and enhancers (Heintzman et al, 2009). Often, differential regulation induces only small changes in expression levels, probably serving to fine-tune expression and shifting genes within the HE group. Our data suggest that in addition to this, there is a key decision about whether a gene becomes 'switched on' and expressed, which coincides with a boost in both transcription and H3K9/14ac histone modification. Materials and methods Th2 cell differentiation culture Spleens of C57BL/6 mice aged from 7 weeks to 4 months were removed and softly homogenized through a nylon mesh. The medium used throughout the cell cultures was IMDM supplemented with 10% FCS, 2 μM L-glutamine, penicillin, streptomycin and 50 μM β-mercaptoethanol. Cells were washed twice and purified by a Ficoll density gradient centrifugation. CD4+CD62L+ cells were isolated by a two-step MACS purification using the CD4+CD62L+ T Cell Isolation Kit II (Miltenyi Biotec). Cells were seeded into 24-well plates that had been coated with a mix of anti-CD3 (1 μg/ml, clone 145-2C11, eBioscience) and anti-CD28 (5 μg/ml, clone 37.51, eBioscience) antibodies overnight, at a density of 250 000 cells/ml and a total volume of 2 ml. The following cytokines and antibodies, respectively, were added to the Th2 culture: recombinant murine IL-4 (10 ng/ml, R&D Systems), and neutralizing IFN-γ (5 μg/ml, Sigma). Cells were cultured for 4–5 days at 37°C, 5% CO2. After this, cells were taken away from the activation stimulus, diluted 1:2 in fresh medium containing the same cytokine concentration as before. After 2–3 days of resting time, cells were directly crosslinked in formaldehyde for preparing ChIP-seq samples. For FACS stainings, cells were restimulated with phorbol dibutyrate and ionomycin (both used at 500 ng/ml, both from Sigma) for 4 h in the presence of Monensin (2 μM, eBioscience) for the last 2 h after the resting phase. For real-time PCRs, the cells were lysed in Trizol. FACS staining After restimulation, cells were washed in PBS and fixed in IC fixation buffer from the Foxp3 staining kit (eBioscience). Staining for intracellular transcription factor expression was carried out according to the eBioscience protocol, using Permeabilization buffer (eBioscience) and the following antibodies: anti-GATA3-Alexa647 (one test, TWAJ, eBioscience), anti-Tbx21-PE (1/400, clone eBio4B10, eBioscience). Stained cells were analyzed on a FACSCalibur (BD Biosciences) flow cytometer using Cellquest Pro and FlowJo software. Real-time PCR RNA of ∼106 cells was isolated with Trizol (Invitrogen) according to the manufacturer's protocol. cDNA was produced using Superscript III reverse transcriptase (Invitrogen), following the protocol supplied by the manufacturer. The cDNA was subjected to real-time PCR, using the SYBR green PCR master mix (Applied Biosystems) and a 7900 HT Real-Time PCR system (Applied Biosystems). The threshold cycles (Ct) were determined. The primer sequences used are listed in Supplementary Table S6 and were mostly obtained from 'Primerbank' (http://pga.mgh.harvard.edu/primerbank/; Spandidos et al, 2010). RNA-seq data generation Poly-(A)+ RNA was purified from ∼500 000 cells using the Oligotex kit (Qiagen). The manufacturer's protocol was slightly modified to include additional final elution steps resulting in a larger volume. After precipitation of RNA to concentrate it, first- and second-strand cDNA synthesis was performed using the Just cDNA kit (Stratagene), skipping the blunting step and directly proceeding to PCI extraction. Quality of the cDNA was tested by real-time PCR for a housekeeping gene. After this, the cDNA was sonicated for a total of 45 min using the Diagenode Bioruptor at maximum power settings, cycling 30 s sonications with 30 s breaks. After precipitation, the sample was processed using the ChIP-seq sample prep kit (Illumina) with a slightly modified protocol (PCR before gel extraction). Sequencing for 36 or 41 bp was carried out on an Illumina GAII genome analyzer. The data were deposited at Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/), accession number GSE28666. RNA-seq data processing Reads were mapped to the mouse genome (mm9) with Bowtie (Langmead et al, 2009) using the command options -m 1--best –strata--solexa1.3-quals, and were assigned to exons of RefSeq genes using custom perl scripts. We used the gene symbol as the primary identifier. Supplementary Table S2 shows the numbers of mapped reads. We further generated a library of splice junctions based on RefSeq genes, mapped unmapped reads to these and added the numbers of hits to the genes. The numbers of mapped reads per gene were corrected for mapability based on the 'CRG' tracks of the UCSC genome browser. RPKM were then calculated. In the case that multiple splice variants existed, the most highly expressed one was selected as representative for a gene's expression level. For generating the RPKM distributions of intergenic regions, we considered regions with a distance of at least 10 kb to any RefSeq or Ensembl gene. The distribution was based on random fragments of the same length distribution as gene lengths. Mapability was accounted for, and the randomization was perfo
Referência(s)