Prevalence of transcription promoters within archaeal operons and coding sequences
2009; Springer Nature; Volume: 5; Issue: 1 Linguagem: Inglês
10.1038/msb.2009.42
ISSN1744-4292
AutoresTie Koide, David J. Reiss, J Christopher Bare, Wyming Lee Pang, Marc T. Facciotti, Amy K. Schmid, Min Pan, Bruz Marzolf, Phu T. Van, Fang‐Yin Lo, Abhishek Pratap, Eric W. Deutsch, Amelia C. Peterson, Dan Martin, Nitin S. Baliga,
Tópico(s)Genomics and Chromatin Dynamics
ResumoArticle16 June 2009Open Access Prevalence of transcription promoters within archaeal operons and coding sequences Tie Koide Tie Koide Institute for Systems Biology, Seattle, WA, USAPresent address: Departamento de Bioquímica e Imunologia, Faculdade de Medicina de Ribeirão Preto, Universidade de São Paulo, Brazil. E-mail: [email protected]Search for more papers by this author David J Reiss David J Reiss Institute for Systems Biology, Seattle, WA, USA Search for more papers by this author J Christopher Bare J Christopher Bare Institute for Systems Biology, Seattle, WA, USA Search for more papers by this author Wyming Lee Pang Wyming Lee Pang Institute for Systems Biology, Seattle, WA, USA Search for more papers by this author Marc T Facciotti Marc T Facciotti Institute for Systems Biology, Seattle, WA, USA Department of Biomedical Engineering and UC Davis Genome Center, One Shields Avenue, University of California, Davis, CA, USA Search for more papers by this author Amy K Schmid Amy K Schmid Institute for Systems Biology, Seattle, WA, USA Search for more papers by this author Min Pan Min Pan Institute for Systems Biology, Seattle, WA, USA Search for more papers by this author Bruz Marzolf Bruz Marzolf Institute for Systems Biology, Seattle, WA, USA Search for more papers by this author Phu T Van Phu T Van Institute for Systems Biology, Seattle, WA, USA Search for more papers by this author Fang-Yin Lo Fang-Yin Lo Institute for Systems Biology, Seattle, WA, USA Search for more papers by this author Abhishek Pratap Abhishek Pratap Institute for Systems Biology, Seattle, WA, USA Search for more papers by this author Eric W Deutsch Eric W Deutsch Institute for Systems Biology, Seattle, WA, USA Search for more papers by this author Amelia Peterson Amelia Peterson Divisions of Human Biology and Clinical Research, Fred Hutchinson Cancer Research Center, Seattle, WA, USA Search for more papers by this author Dan Martin Dan Martin Institute for Systems Biology, Seattle, WA, USA Divisions of Human Biology and Clinical Research, Fred Hutchinson Cancer Research Center, Seattle, WA, USA Search for more papers by this author Nitin S Baliga Corresponding Author Nitin S Baliga Institute for Systems Biology, Seattle, WA, USA Departments of Microbiology, and Molecular and Cellular Biology, University of Washington, Seattle, WA, USA Search for more papers by this author Tie Koide Tie Koide Institute for Systems Biology, Seattle, WA, USAPresent address: Departamento de Bioquímica e Imunologia, Faculdade de Medicina de Ribeirão Preto, Universidade de São Paulo, Brazil. E-mail: [email protected]Search for more papers by this author David J Reiss David J Reiss Institute for Systems Biology, Seattle, WA, USA Search for more papers by this author J Christopher Bare J Christopher Bare Institute for Systems Biology, Seattle, WA, USA Search for more papers by this author Wyming Lee Pang Wyming Lee Pang Institute for Systems Biology, Seattle, WA, USA Search for more papers by this author Marc T Facciotti Marc T Facciotti Institute for Systems Biology, Seattle, WA, USA Department of Biomedical Engineering and UC Davis Genome Center, One Shields Avenue, University of California, Davis, CA, USA Search for more papers by this author Amy K Schmid Amy K Schmid Institute for Systems Biology, Seattle, WA, USA Search for more papers by this author Min Pan Min Pan Institute for Systems Biology, Seattle, WA, USA Search for more papers by this author Bruz Marzolf Bruz Marzolf Institute for Systems Biology, Seattle, WA, USA Search for more papers by this author Phu T Van Phu T Van Institute for Systems Biology, Seattle, WA, USA Search for more papers by this author Fang-Yin Lo Fang-Yin Lo Institute for Systems Biology, Seattle, WA, USA Search for more papers by this author Abhishek Pratap Abhishek Pratap Institute for Systems Biology, Seattle, WA, USA Search for more papers by this author Eric W Deutsch Eric W Deutsch Institute for Systems Biology, Seattle, WA, USA Search for more papers by this author Amelia Peterson Amelia Peterson Divisions of Human Biology and Clinical Research, Fred Hutchinson Cancer Research Center, Seattle, WA, USA Search for more papers by this author Dan Martin Dan Martin Institute for Systems Biology, Seattle, WA, USA Divisions of Human Biology and Clinical Research, Fred Hutchinson Cancer Research Center, Seattle, WA, USA Search for more papers by this author Nitin S Baliga Corresponding Author Nitin S Baliga Institute for Systems Biology, Seattle, WA, USA Departments of Microbiology, and Molecular and Cellular Biology, University of Washington, Seattle, WA, USA Search for more papers by this author Author Information Tie Koide1,‡, David J Reiss1,‡, J Christopher Bare1, Wyming Lee Pang1, Marc T Facciotti1,2, Amy K Schmid1, Min Pan1, Bruz Marzolf1, Phu T Van1, Fang-Yin Lo1, Abhishek Pratap1, Eric W Deutsch1, Amelia Peterson3, Dan Martin1,3 and Nitin S Baliga 1,4 1Institute for Systems Biology, Seattle, WA, USA 2Department of Biomedical Engineering and UC Davis Genome Center, One Shields Avenue, University of California, Davis, CA, USA 3Divisions of Human Biology and Clinical Research, Fred Hutchinson Cancer Research Center, Seattle, WA, USA 4Departments of Microbiology, and Molecular and Cellular Biology, University of Washington, Seattle, WA, USA ‡These authors contributed equally to this work *Corresponding author. Institute for Systems Biology, Departments of Microbiology, and Molecular and Cellular Biology, University of Washington, 1441 N 34th Street, Seattle, WA 98103, USA. Tel.: +1 206 732 1266; Fax: +1 206 732 1299; E-mail: [email protected] Molecular Systems Biology (2009)5:285https://doi.org/10.1038/msb.2009.42 Present address: Departamento de Bioquímica e Imunologia, Faculdade de Medicina de Ribeirão Preto, Universidade de São Paulo, Brazil. E-mail: [email protected] PDFDownload PDF of article text and main figures. ToolsAdd to favoritesDownload CitationsTrack CitationsPermissions Figures & Info Despite the knowledge of complex prokaryotic-transcription mechanisms, generalized rules, such as the simplified organization of genes into operons with well-defined promoters and terminators, have had a significant role in systems analysis of regulatory logic in both bacteria and archaea. Here, we have investigated the prevalence of alternate regulatory mechanisms through genome-wide characterization of transcript structures of ∼64% of all genes, including putative non-coding RNAs in Halobacterium salinarum NRC-1. Our integrative analysis of transcriptome dynamics and protein–DNA interaction data sets showed widespread environment-dependent modulation of operon architectures, transcription initiation and termination inside coding sequences, and extensive overlap in 3′ ends of transcripts for many convergently transcribed genes. A significant fraction of these alternate transcriptional events correlate to binding locations of 11 transcription factors and regulators (TFs) inside operons and annotated genes—events usually considered spurious or non-functional. Using experimental validation, we illustrate the prevalence of overlapping genomic signals in archaeal transcription, casting doubt on the general perception of rigid boundaries between coding sequences and regulatory elements. Synopsis Evidence is mounting that the standard model of transcription factor (TF) binding to intergenic regions is not always the rule. Although there is isolated prior evidence for functional consequences of TF binding inside coding sequences, this issue had not been systematically evaluated genome wide. We have conducted a study to investigate the genome-wide consequence of internal TF binding for nearly 10% of all TFs in an archaeal extremophile, Halobacterium salinarum NRC-1. We show that a significant number of TF-binding sites (TFBS) inside the coding sequences are functional and have marked consequences, such as by conditionally modulating the architecture of at least 43% of all operons in this organism. We present the integrated analysis of complementary systems-wide data on TFBS locations and dynamic modulation of transcriptome structure that led to this striking discovery. Using ChIP–chip and the MeDiChI algorithm (Reiss et al, 2008), we precisely located TFBSs and determined their corresponding local false discovery rates (LFDRs) from new and previously reported genome-wide ChIP–chip measurements for 11 TFs: all TFBs (TFBa, TFBb, TFBc, TFBd, TFBe, TFBf and TFBg), one TBP (TBPb) and three transcriptional regulators (TRs) (Trh3, Trh4, VNG1451C) in H. salinarum NRC-1. Our conclusion from this analysis was that as many as 10% of all multi-TFBS loci were within coding regions. To show that these TFBS have significant functional consequences on transcriptional regulation and cellular physiology, we used high-density genome tiling arrays to analyze the transcriptome structure (TS) of H. salinarum NRC-1 at different phases of growth in a batch culture, which is associated with differential regulation of over 65% of all genes. Through this analysis we assigned transcription start sites (TSSs) to 64% of all annotated genes, termination sites (TTSs) to 46% of the genes, verified the expression of 203 operons and discovered 5′and 3′ UTRs for ∼65% of all genes and operons. Further, by correlating the transcribed units with chromosomal coordinates of predicted genes (Ng et al, 2000) and experimentally mapped peptides from large-scale proteomics studies (Van et al, 2008), we revised the translation start site for 61 genes, detected 10 new protein-coding genes, and discovered 61 new putative ncRNAs. Although the physiological roles and mechanisms of action of specific ncRNAs remain to be uncovered, the bimodal distribution of correlations between the expression of ncRNAs and that of their antisense strands are consistent with the characterized roles of ncRNAs in the regulation of their cognate antisense transcripts. Finally, this analysis also showed a large mRNA population that has variable 3′-end locations and transcripts with extensive overlaps in their 3′ termini. By integrating TFBS locations with the TS, we identified internal binding sites that are functional in the conditional modulation of operon organization. We assessed the global prevalence of such operons by devising a quantitative measure for classifying operons as conditional. Specifically, we found that 43% of all operons are conditionally modulated by integrating probe intensities of transcripts hybridized to the genome tiling array with gene-expression correlations derived from expression analysis of H. salinarum NRC-1 in 719 microarray experiments. Remarkably, there was a strong functional link between transcription-factor binding inside operons and their classification as 'conditional' (P<10−9). We transcriptionally fused two of these conditionally activated promoters inside coding sequences to a reporter gene encoding a fast-degrading GFP variant optimized for the high-salt cytoplasm of halophilic archaea. FACS analysis of cells harboring these internal promoter–reporter transcriptional fusions provided in vivo validation of growth-phase regulated transcription initiation inside coding sequences. Although earlier studies have discovered internal promoters within a single gene or operon (Tsui et al, 1994; Guillot and Moran, 2007), we have significantly extended these findings to a genome-wide scale to show that biologically meaningful promoters do exist inside coding sequences at a frequency that is much higher than was previously appreciated. Further, this discovery also shows how a simple prokaryote can use the same set of genes in different combinations to elicit complex responses according to an environmental challenge. Irrespective of the specific underlying mechanisms, our observations of widespread modulation of operon architecture, as well as transcription initiation and termination inside genes, etc. all constitute evidence that archaea can intersperse regulatory logic within their coding sequence and thus blur the boundaries between coding and non-coding elements. We have shown that it is possible to use new high-throughput technologies to find these biologically important instances where transcriptional regulation does occur within coding sequences and, furthermore, that it is possible to globally characterize specific regulatory mechanisms responsible for these phenomena. Combined with new high-throughput sequencing technologies, our results will expand the view of genetic-information processing that can be investigated at high resolution (Nagalakshmi et al, 2008; Wilhelm et al, 2008). These data will enable construction of mechanistically accurate models for reliable systems re-engineering of biological circuits. Moreover, these findings suggest that the incorporation of mechanistic accuracy into GRN models would require operons, promoters, and terminators to be treated as dynamic entities. Introduction Systems-biology approaches have been successfully applied to construct quantitative and predictive models of biological networks (Bonneau et al, 2007; Faith et al, 2007). However, a significant amount of information is missing from these models because of incomplete parts lists (unannotated genes, non-coding RNAs (ncRNAs), poorly understood protein modifications and so on) as well as a lack of molecular detail associated with these processes. Incorporating such detail will make these models mechanistically accurate and useful for synthetic-biology approaches targeting large-scale biological-circuit re-engineering. Among the current systems-scale models most amenable for such large-scale redesign are those that describe gene-regulatory networks (GRNs). GRN models are usually built upon transcriptome data, in which typically genes or gene modules (with similar expression patterns and shared regulatory motifs) are associated with their transcriptional regulators through linear or Bayesian models. However, although these models can be predictive (Bonneau et al, 2007), they often rely on approximations of the transcription process and lack finer details of dynamic environment-dependent assembly of transcription complexes at each of the numerous promoters in the genome. High-density tiling arrays can be used to define transcribed regions (David et al, 2006), start sites (McGrath et al, 2007), and protein–DNA interaction sites (Reiss et al, 2008), which can be used to identify some of these missing details associated with transcriptional regulation, and thereby enable us to construct systems-scale predictive models of GRNs that are also mechanistically accurate. We recently constructed a model of an environment and gene-regulatory influence network (EGRIN) for the halophilic archaeon Halobacterium salinarum NRC-1. This model accurately predicts the transcriptional changes in 80% of all genes to new environmental and genetic perturbations (Bonneau et al, 2007). Using an integrated biclustering algorithm to identify regulons and their putative cis-regulatory motifs (Reiss et al, 2006), and a sparse regression procedure to statistically pair these regulons with their putative regulators (Bonneau et al, 2006), we were able to discover the combinatorial and conditional regulation of genes by multiple TFs and EFs (environmental factors) (Bonneau et al, 2007). Although several of the statistically inferred influences in this network were shown to be likely mediated through direct interactions with the promoters of regulated genes, a large number of influences are thought to be indirect. The logical next step is to make this quantitative and predictive network also mechanistically accurate on a systems scale. Construction of a mechanistically accurate systems-scale model is a reasonable expectation for Halobacterium salinarum NRC-1, as its transcription is driven by a simplified version of a eukaryotic RNA polymerase (RNAP) II (Hirata et al, 2008) in a genome with prokaryotic organization. The archaeal RNAP requires only two general transcription factors – GTFs (TATA binding protein –TBP and transcription factor B –TFB) for promoter recruitment and basal transcription initiation. Furthermore, only ∼130 putative transcriptional regulators (TRs) are present among the ∼2400 genes encoded in the genome of H. salinarum NRC-1 (Ng et al, 2000). A relatively small number of genes and few TFs (GTFs and TRs) together make H. salinarum NRC-1 an attractive model system for characterizing gene-regulatory mechanisms at all promoters. Notably, the combinatorial action of multiple TFBs and TBPs (H. salinarum NRC-1 possesses 6 TBPs and 7 TFBs) in defining basal promoter architecture in most archaea (Baliga et al, 2000; Facciotti et al, 2007) provides a unique opportunity to characterize dynamic conditional regulation of a large fraction of genes during cellular responses to complex changes. Here, we report a significant step toward a mechanistically accurate EGRIN model by characterizing the dynamic remodeling of the transcriptome structure of H. salinarum NRC-1 during a complex cellular response, and correlating these changes to genome-wide binding locations of 50% of all predicted GTFs as well as several specific TRs. By integrating diverse data types, we identified: (i) transcription start sites (TSSs) and termination sites (TTSs) for ∼64% of the genes, including new and revised protein-coding genes; (ii) 61 new ncRNA candidates; (iii) 5′ and 3′ untranslated regions (UTRs) of mRNAs; (iv) functional promoters upstream and internal to coding regions; (v) instances of transcription termination inside coding sequences; (vi) mRNA populations with variable 3′-end locations; (vii) transcripts with extensive overlaps in their 3′ termini; and (viii) operon-encoding transcripts of variable length. Significantly, these findings suggest that the incorporation of mechanistic accuracy into GRN models would require genes, operons, promoters, and terminators to be treated as dynamic entities. Results Genome-wide protein–DNA binding data show TF binding inside genes and operons A detailed map of genomic locations where TFs bind DNA and modulate transcription is essential to model mechanisms of gene regulation on a systems scale. Chromatin immunoprecipitation of transcription complexes coupled to microarray (ChIP–chip Ren et al (2000)) or sequencing (ChIP–seq (Robertson et al (2007)) is a commonly used approach to construct such maps. In ChIP–chip, the resolution to which the protein–DNA binding sites (TFBSs) can be identified is often limited by the genomic spacing of the probes in the array. We utilized the MeDiChI algorithm (Reiss et al, 2008) to estimate precise TFBS locations and their corresponding local false discovery rates (LFDRs) from new and previously reported genome-wide ChIP–chip measurements for 11 TFs (with two or more biological replicates for each): all TFBs (TFBa, TFBb, TFBc, TFBd, TFBe, TFBf, and TFBg), one TBP (TBPb) and three TRs (Trh3, Trh4, and VNG1451C) in H. salinarum NRC-1 (see Materials and methods). On the basis of simulations similar to those of Reiss et al (2008), with a noise model customized to mimic the data used in this study, we estimated that the average positional uncertainty in TFBS locations identified by MeDiChI averaged ∼50 nucleotides (nt) (1SE) over all ChIP–chip data sets used in this study. We found that the 3072 significant (LFDR<0.1) individual TFBSs for all data sets often fell within distinct loci where at least three different TFs were observed within a ±50 nt window (P<10−8). We therefore refined this TFBS list to a conservative set of 318 such distinct 'multi-TF-binding loci', hereafter TFBS loci throughout the genome (Table I; see Supplementary Table 1 for each loci). As we applied to each individual data set an LFDR cutoff of 0.1, which by itself is rather stringent, the joint LFDR of these 318 TFBS loci is significantly smaller than that. Although each individual TF had a significant bias of binding in annotated intergenic regions (∼60%, on average, versus ∼16% expected), this fraction increased to ∼70% (276) when considering the 318 TFBS loci (P∼10−31). Monte Carlo simulations of TFBSs placed only in non-coding regions in the genome with a ∼50–75 nt positional uncertainty and an LFDR between 0.1–0.01 show that 80–85% of detected TFBSs should fall in intergenic regions (for more details, see Materials and methods). Thus, our assessment was that a small but significant fraction of these significant TFBS loci in our ChIP–chip data sets (as many as ∼10% of the multi-TFBS loci) fell within coding regions. Here onwards, we present detailed and systematic experimental validation that shows that many of these TF-binding events inside coding sequences have significant consequences on the transcriptional regulation of diverse aspects of cellular physiology. Table 1. Numbers of TFBS loci comprised of varying numbers of individual TFBS and their distribution in annotated coding sequences, predicted operons, and conditional predicted operons Number of loci Total In annotated coding sequence In predicted operons In conditional predicted operons (P-value) With ⩾1 TFBS 1249 368 82 58 (1.4 × 10−10) With ⩾2 TFBS 649 231 34 28 (4.3 × 10−8) With ⩾3 TFBS 318 96 13 13 ( 3 TFBS 196 56 10 10 (<1 × 10−30) The reported results of this paper utilize the 318 very stringent ⩾3 TFBS loci but clearly the same conclusion holds (although the numbers increase) as this threshold is relaxed. The P-values were estimated for the probability of observing as many TFBS loci internal to conditional operons (column 4), given the number of TFBS loci observed internal to all operons (column 5), and the estimated fraction of conditional operons (∼43%; see Results and Discussion). Analysis of transcriptome structure shows new expression features The location of a TFBS in the vicinity of a TSS or a TTS could indicate whether a given binding event is functional, especially for the interactions localized within a gene or operon. We investigated this by systematically mapping transcript boundaries and their dynamic changes at the whole-genome level using genome-wide tiling array data and then integrating this information with the TF-binding information. We define transcriptome structure as the collection of TSSs and TTSs that together characterize transcriptional units (mono- and polycistronic mRNAs, tRNAs, rRNAs, and other ncRNAs). Sequence signatures for these features are yet to be characterized in archaea, and computational predictions based on known signatures in bacteria and eukaryotes remain error prone due to incomplete understanding of transcription processes in all organisms (Jones, 2006). Therefore, we experimentally mapped the transcriptome structure of H. salinarum NRC-1 by hybridizing total RNA (including RNA species <200 nt) to genome-wide high-density tiling arrays (60mer probes with 40 nt overlap between contiguous probes). We first applied a segmentation algorithm based on regression trees (see Materials and methods) to map transcript boundaries in cells cultured under standard laboratory growth conditions (mid-logarithmic phase, 37°C, 225 r.p.m. shaking—hereafter 'reference RNA') (Figure 1A). Although this approach effectively mapped TSSs for mRNAs, tRNAs, rRNAs and probable ncRNAs with significant expression levels, it was ambiguous for genes with low expression levels. Moreover, TTSs proved difficult to determine in general, even for highly expressed genes, because no sharp boundaries were observed for most transcripts at the 3′ termini (Figure 1A; Supplementary Figure 1B). We overcame these challenges and recovered further information by analyzing dynamic modulation of the transcriptome structure during typical growth of a batch culture under standard conditions (Figure 1B). Figure 1.Transcriptome structure and growth-phase-dependent changes in Halobacterium salinarum NRC-1. (A) Genome map of a segment of the main chromosome of H. salinarum NRC-1 (NC_002607) with corresponding signal intensity of total RNA from a mid-log phase culture ('reference RNA') hybridized to 60mer overlapping probes in a high-density tiling array. Genes in the forward and reverse strands are shown in yellow and orange, respectively. Each blue dot represents probe intensity (in log2 scale) at the given genomic location in the forward (upper panel) or reverse (lower panel) strands. The overlaid red line is the result of a segmentation algorithm that was applied to determine transcription start sites (TSS and black arrows), transcription termination sites (TTS), untranslated regions in mRNAs (3′ UTR), and putative non-coding RNAs. (B) Dynamic changes in transcriptome structure were evaluated (Figure 2) at different phases of growth in a standard laboratory batch culture. Important physiological changes that are reflected in differential expression of corresponding mRNAs during the various phases of growth are indicated with a heat map (Facciotti et al, submitted). Download figure Download PowerPoint H. salinarum NRC-1 presents a number of interesting switches in metabolism during growth (Facciotti et al, submitted) because of complex changes in EFs, including pH, oxygen, nutrition, and so on (Schmid et al, 2007). Although most single perturbations (radiation, oxygen, metals, and so on) affect the expression of only ∼10% of all genes (Baliga et al, 2004; Kaur et al, 2006; Whitehead et al, 2006), the changes during growth resulted in differential regulation of a significantly higher proportion of genes (∼63%, 1518 genes) (Figure 1B). These conditions thus enabled the investigation of a wider transcriptional landscape, which includes not only modulation of transcript levels (Figure 1B), but also extensive changes in transcriptome structure. We observed altered TSSs, TTSs, operon organizations, and differential regulation of putative ncRNAs (Supplementary Figure 1). By integrating hybridization signals (Figures 1A and 2B) with dynamic growth-related changes (Figure 2C and D), we estimated the probability that each tiling array probe was complementary to a transcribed region, mapped locations of putative transcript boundaries (Figure 2E; see Materials and methods) and identified 1574 TSSs and 1952 TTSs for most genes with some transcriptional variation. Subsequently, we manually assessed and curated gene assignment to each TSS and TTS. The error of these assignments is given by the resolution of probes on the tiling array (20 nt). In sum, TSSs were assigned to 64% (1156 singletons and 544 genes in 203 operons) of all annotated genes and TTSs were assigned to 1114 genes and 202 operons. A TSS and a TTS together define a unit of transcription (Supplementary Table 2). We describe below, how by correlating locations of these transcriptional units to predicted coding sequences in the genome, we were able to characterize and discover new features within the transcriptome structure. Transcription of mono- and polycistronic mRNAs. In many organisms, especially prokaryotes, genes of related function are often co-transcribed as a single polycistronic mRNA (operons). Operon predictions based on genome-specific distance models, combined with comparative genomics and functional features identified 299 operons in H. salinarum NRC-1 (Price et al, 2005). According to our analysis of 1,698 genes with significant transcription signal, at least 544 (32%) genes were transcribed as polycistronic mRNAs in 203 operons. Comparative analysis with the predicted operon structures identified 123 new or truncated operons, which are dynamically regulated during growth. Discovery of leaderless transcripts and 5′ and 3′ UTRs. UTRs in the proximal (5′) end of transcripts often contain signals, such as the Shine–Dalgarno (SD) sequence signature for ribosome loading (Sartorius-Neef and Pfeifer, 2004). Although some mRNAs spanned short distance beyond the coding-sequence boundaries, others were significantly longer (greater than the error in transcript-boundary assignment-20 nt), with 5′ (457 transcripts, 40% of genes assigned to an experimentally determined TSS) and/or 3′ (857 transcripts, 77% of the genes assigned to an experimentally determined TTS) UTRs (Supplementary Table 2 and Supplementary Figure 2). We validated the TSS and 5′ UTR lengths by comparing our observed UTR lengths with those experimentally measured in a closely related strain -H. salinarum R-1 (Brenneis et al, 2007). We found that, on average, the predicted 5′-UTR lengths correlate strongly with those determined by Brenneis et al (2007) (P<0.001); however the predicted NRC-1 3′ UTRs are usually longer (on average 1.8 ±1.3 times longer than those of R-1) (Supplementary Figure 2). Interestingly, 137 transcript pairs had overlapping 3′ ends (Supplementary Table 3) ranging from 25 to 788 nt in length, with a median length of 264 nt. Distance between newly mapped TSSs and GTF-binding sites agrees with earlier knowledge of GTF binding. It is known that the archaeal pre-initiation complex lies between 25–30 nt upstream of the TSS (Bell et al, 1999). Although the relatively large uncertainty in the MeDiChI-mapped TFBSs precludes the quantification of this distance for individual TSSs, we found that the 318 TFBS loci (defined above) lie at an average of 24 nt (95% probability that the average falls between 35 and 16 nt) upstream of the nearest TSS. This may be compared with an average upstream distance of 59 nt (95% probability that the average lies between 69 and 49 nt) between the TFBS loci and the first (annotated) translation codon. This difference is further evidence of the significant number of genes with 5′ UTRs (see above). Revisions of predicted translation start sites and discovery of new protein-codin
Referência(s)