Open reading frame dominance indicates protein‐coding potential of RNAs
2022; Springer Nature; Volume: 23; Issue: 6 Linguagem: Inglês
10.15252/embr.202154321
ISSN1469-3178
AutoresYusuke Suenaga, Mamoru Kato, Momoko Nagai, Kazuma Nakatani, Hiroyuki Kogashi, Miho Kobatake, Takashi Makino,
Tópico(s)RNA and protein synthesis mechanisms
ResumoArticle19 April 2022Open Access Transparent process Open reading frame dominance indicates protein-coding potential of RNAs Yusuke Suenaga Corresponding Author Yusuke Suenaga [email protected] orcid.org/0000-0001-6902-5386 Department of Molecular Carcinogenesis, Chiba Cancer Centre Research Institute, Chiba, Japan Contribution: Conceptualization, Data curation, Formal analysis, Supervision, Funding acquisition, Validation, Investigation, Visualization, Methodology, Writing - original draft, Project administration, Writing - review & editing Search for more papers by this author Mamoru Kato Mamoru Kato orcid.org/0000-0002-8485-8316 Division of Bioinformatics, National Cancer Centre Research Institute, Tokyo, Japan Contribution: Conceptualization, Data curation, Supervision, Validation, Investigation, Methodology, Writing - original draft, Project administration, Writing - review & editing Search for more papers by this author Momoko Nagai Momoko Nagai Division of Bioinformatics, National Cancer Centre Research Institute, Tokyo, Japan Contribution: Data curation, Software, Formal analysis, Validation, Methodology, Writing - original draft, Writing - review & editing Search for more papers by this author Kazuma Nakatani Kazuma Nakatani Department of Molecular Carcinogenesis, Chiba Cancer Centre Research Institute, Chiba, Japan Department of Molecular Biology and Oncology, Chiba University School of Medicine, Chiba, Japan Innovative Medicine CHIBA Doctoral WISE Program, Chiba University School of Medicine, Chiba, Japan Contribution: Data curation, Formal analysis, Validation, Investigation Search for more papers by this author Hiroyuki Kogashi Hiroyuki Kogashi orcid.org/0000-0002-3855-6309 Department of Molecular Carcinogenesis, Chiba Cancer Centre Research Institute, Chiba, Japan Department of Molecular Biology and Oncology, Chiba University School of Medicine, Chiba, Japan Contribution: Data curation, Formal analysis, Validation, Investigation Search for more papers by this author Miho Kobatake Miho Kobatake Department of Molecular Carcinogenesis, Chiba Cancer Centre Research Institute, Chiba, Japan Contribution: Data curation, Formal analysis, Validation, Investigation Search for more papers by this author Takashi Makino Takashi Makino orcid.org/0000-0003-4600-9353 Laboratory of Evolutionary Genomics, Graduate School of Life Sciences, Tohoku University, Sendai, Japan Contribution: Data curation, Formal analysis, Supervision, Methodology, Writing - original draft Search for more papers by this author Yusuke Suenaga Corresponding Author Yusuke Suenaga [email protected] orcid.org/0000-0001-6902-5386 Department of Molecular Carcinogenesis, Chiba Cancer Centre Research Institute, Chiba, Japan Contribution: Conceptualization, Data curation, Formal analysis, Supervision, Funding acquisition, Validation, Investigation, Visualization, Methodology, Writing - original draft, Project administration, Writing - review & editing Search for more papers by this author Mamoru Kato Mamoru Kato orcid.org/0000-0002-8485-8316 Division of Bioinformatics, National Cancer Centre Research Institute, Tokyo, Japan Contribution: Conceptualization, Data curation, Supervision, Validation, Investigation, Methodology, Writing - original draft, Project administration, Writing - review & editing Search for more papers by this author Momoko Nagai Momoko Nagai Division of Bioinformatics, National Cancer Centre Research Institute, Tokyo, Japan Contribution: Data curation, Software, Formal analysis, Validation, Methodology, Writing - original draft, Writing - review & editing Search for more papers by this author Kazuma Nakatani Kazuma Nakatani Department of Molecular Carcinogenesis, Chiba Cancer Centre Research Institute, Chiba, Japan Department of Molecular Biology and Oncology, Chiba University School of Medicine, Chiba, Japan Innovative Medicine CHIBA Doctoral WISE Program, Chiba University School of Medicine, Chiba, Japan Contribution: Data curation, Formal analysis, Validation, Investigation Search for more papers by this author Hiroyuki Kogashi Hiroyuki Kogashi orcid.org/0000-0002-3855-6309 Department of Molecular Carcinogenesis, Chiba Cancer Centre Research Institute, Chiba, Japan Department of Molecular Biology and Oncology, Chiba University School of Medicine, Chiba, Japan Contribution: Data curation, Formal analysis, Validation, Investigation Search for more papers by this author Miho Kobatake Miho Kobatake Department of Molecular Carcinogenesis, Chiba Cancer Centre Research Institute, Chiba, Japan Contribution: Data curation, Formal analysis, Validation, Investigation Search for more papers by this author Takashi Makino Takashi Makino orcid.org/0000-0003-4600-9353 Laboratory of Evolutionary Genomics, Graduate School of Life Sciences, Tohoku University, Sendai, Japan Contribution: Data curation, Formal analysis, Supervision, Methodology, Writing - original draft Search for more papers by this author Author Information Yusuke Suenaga *,1,†, Mamoru Kato2,†, Momoko Nagai2, Kazuma Nakatani1,3,4, Hiroyuki Kogashi1,3, Miho Kobatake1 and Takashi Makino5 1Department of Molecular Carcinogenesis, Chiba Cancer Centre Research Institute, Chiba, Japan 2Division of Bioinformatics, National Cancer Centre Research Institute, Tokyo, Japan 3Department of Molecular Biology and Oncology, Chiba University School of Medicine, Chiba, Japan 4Innovative Medicine CHIBA Doctoral WISE Program, Chiba University School of Medicine, Chiba, Japan 5Laboratory of Evolutionary Genomics, Graduate School of Life Sciences, Tohoku University, Sendai, Japan †These authors contributed equally to this work *Corresponding author. Tel: +81 43 264 5431; E-mail: [email protected] EMBO Reports (2022)23:e54321https://doi.org/10.15252/embr.202154321 PDFDownload PDF of article text and main figures. Peer ReviewDownload a summary of the editorial decision process including editorial decision letters, reviewer comments and author responses to feedback. ToolsAdd to favoritesDownload CitationsTrack CitationsPermissions ShareFacebookTwitterLinked InMendeleyWechatReddit Figures & Info Abstract Recent studies have identified numerous RNAs with both coding and noncoding functions. However, the sequence characteristics that determine this bifunctionality remain largely unknown. In the present study, we develop and test the open reading frame (ORF) dominance score, which we define as the fraction of the longest ORF in the sum of all putative ORF lengths. This score correlates with translation efficiency in coding transcripts and with translation of noncoding RNAs. In bacteria and archaea, coding and noncoding transcripts have narrow distributions of high and low ORF dominance, respectively, whereas those of eukaryotes show relatively broader ORF dominance distributions, with considerable overlap between coding and noncoding transcripts. The extent of overlap positively and negatively correlates with the mutation rate of genomes and the effective population size of species, respectively. Tissue-specific transcripts show higher ORF dominance than ubiquitously expressed transcripts, and the majority of tissue-specific transcripts are expressed in mature testes. These data suggest that the decrease in population size and the emergence of testes in eukaryotic organisms allowed for the evolution of potentially bifunctional RNAs. Synopsis The open reading frame dominance score correlates with translation efficiency of coding transcripts and with translation of noncoding RNAs. The score distributions in eukaryotes show overlap between coding and noncoding transcripts. ORF dominance score correlates with translation efficiency of coding transcripts and with translation of noncoding RNAs Coding and noncoding transcripts of eukaryotes show broader ORF dominance distributions, with overlap between coding and noncoding transcripts The extent of overlap between coding and noncoding transcripts positively and negatively correlates with the mutation rate of genomes and the effective population size of species, respectively. Introduction Recent advances in RNA-sequencing technology have revealed that most of the eukaryotic genome is transcribed, primarily producing noncoding RNAs (Okazaki et al, 2002; Djebali et al, 2012; Ulitsky & Bartel, 2013; Kopp & Mendell, 2018). Noncoding RNAs longer than 200 nucleotides are long noncoding RNAs (lncRNAs) and are not translated into proteins (Ulitsky & Bartel, 2013; Kopp & Mendell, 2018). lncRNAs have been reported to participate in multiple biological phenomena, including the regulation of transcription, modulation of protein or RNA functions, and nuclear organization (Ulitsky & Bartel, 2013; Kopp & Mendell, 2018). However, paradoxically, a large fraction of lncRNAs is associated with ribosomes and translated into peptides (Frith et al, 2011; Ingolia et al, 2011; Bazzini et al, 2014; Ingolia, 2014; Ruiz-Orera et al, 2014). Peptides translated from transcripts annotated as lncRNAs have multiple biological functions in several eukaryotes (Li & Liu, 2019; Huang et al, 2021), and some of these translations are specific to the cellular context (Douka et al, 2021). Conversely, known protein-coding genes, such as TP53, can also function as RNAs (Candeias, 2011; Kloc et al, 2011; Huang et al, 2021). The discovery of these RNAs with binary functions has blurred the distinction between coding and noncoding RNAs, so the characteristics of RNA sequences that explain the continuum between noncoding and coding transcripts remain unclear. During evolution, new genes originate from preexisting genes via gene duplication or from nongenic regions via the generation of new open reading frames (ORFs) (Ohno, 1970; Chen et al, 2013; Zhang & Long, 2014; McLysaght & Guerzoni, 2015; McLysaght & Hurst, 2016; Holland et al, 2017). The latter are de novo genes (Begun et al, 2006, 2007; Levine et al, 2006; Knowles & McLysaght, 2009; Toll-Riera et al, 2009; Li et al, 2009, 2010a), which have been shown to regulate biological processes and diseases (Chen et al, 2013; Zhang & Long, 2014; McLysaght & Guerzoni, 2015), including brain function and carcinogenesis in humans (Li et al, 2010b; Suenaga et al, 2014). lncRNAs can serve as sources of de novo genes (Ruiz-Orera et al, 2014), some of which evolve to encode proteins. In addition to ORFs exposed to natural selection, neutrally evolving ORFs are also translated from lncRNAs that stably express peptides (Ruiz-Orera et al, 2018), providing a basis for the development of new functional peptides/proteins. High levels of lncRNA expression (Ruiz-Orera et al, 2018), hexamer frequencies of ORFs (Sun et al, 2013; Wang et al, 2013; Ruiz-Orera et al, 2014), and intrinsic disorder protein products (Heames et al, 2020) have been proposed as determinants of coding potential; however, the molecular mechanisms by which lncRNAs evolve into new coding transcripts remain unclear (Van Oss & Carvunis, 2019). In the present study, we sought to identify a new indicator for determining RNA protein-coding potential. First, we defined primary ORF as the longest of all ORFs of a given RNA and the indicator using the fraction of the primary ORF length constitutes the sum of all putative ORF lengths. We subsequently examined the associations between this indicator and protein-coding potential. More than 3.4 million transcripts in 100 organisms belonging to all three domains of life were analyzed to investigate the relationship between this indicator and protein-coding potential over evolutionary history. Results Coding transcripts show higher ORF dominance in humans and mice We previously identified a de novo gene, NCYM, and revealed its biochemical function (Suenaga et al, 2014, 2020; Kaneko et al, 2015; Shoji et al, 2015; Matsuo et al, 2021). However, NCYM was previously registered as a noncoding RNA in the National Center for Biotechnology Information (NCBI) nucleotide database, and the coding potential assessment tool (CPAT), which is the established predictor for protein-coding potential (Wang et al, 2013), showed NCYM had a coding probability of 0.022, labeling it as a noncoding RNA (Appendix Fig S1). Therefore, we sought to identify a new indicator for coding potential by comparing NCYM with a small subset of coding and noncoding RNAs to determine whether its sequence features would allow NCYM to be registered as a coding transcript. We found that predicted ORFs, other than major ORFs, were short in coding RNAs. In addition, it has been reported that upstream ORFs inhibit the translation of major ORFs (Calvo et al, 2009). Therefore, we hypothesized that the predicted ORFs may reduce the translation of major ORFs, thereby becoming short in the coding transcripts, including NCYM, during evolution. Major ORFs are often the longest ORFs (hereafter primary ORFs or pORFs) in coding transcripts. Thus, to investigate the importance of pORFs relative to other ORFs (hereafter secondary ORFs or secORFs) for the evolution of coding genes, we defined ORF dominance as the occupancy of the pORF length relative to the total ORF length (Fig 1A and B) and assumed that ORF dominance was high in coding transcripts. To examine this hypothesis, we first calculated ORF dominance for all human transcripts. We analyzed the human transcripts in the NCBI nucleotide database, including both coding and noncoding (RefSeq accession numbers starting with NM and NR, respectively) transcripts. The data were downloaded using the Table Browser (https://genome.ucsc.edu/cgi-bin/hgTables) after setting the track tab as “RefSeq Genes”. A total of 50,052 coding (NM) and 13,550 noncoding (NR) RNAs were registered in the database in 2018 (Dataset EV1). To analyze putative lncRNAs with protein-coding potential, we excluded small RNAs (shorter than 200 nucleotides) or RNAs with a short pORF (less than 20 amino acids) from the NR transcripts, as reported previously (Bazzini et al, 2014; Ruiz-Orera et al, 2014; Schmitz et al, 2018), focusing on the remaining 12,827 transcripts. Figure 1. ORF dominance predicts the protein-coding potential of human transcripts Conceptual schematic representation of ORFs in the three reading frames of an RNA and definition of ORF dominance. Black and white rectangles indicate primary and secondary ORFs, respectively. The primary ORF is the longest ORF, while secondary ORFs are all others; l is ORF length. Schematic representation of ORF distributions in RNAs with low (0-0.5), medium (0.5), and high (1) ORF dominance. Relative frequencies of ORF dominance of coding, f(x), and noncoding, g(x), transcripts (upper) and of random controls (bottom). Explanation of F(x) for a ORF dominance of 0.15. ORF dominance correlations with protein-coding potential, F(x), at ORF dominance ≤ 0.65 (upper) and those in random controls (lower). Relationship between ORF dominance and percentages of NR transcripts reregistered as NM during the past 3 years. N.D., not detected. Relationship between ORF dominance and F(x) in human transcripts syntenic to chimpanzee (upper left) and mouse (bottom left). The relative frequency of transcripts with negative selection, h(x), is plotted for each ORF dominance (upper and bottom right). The transcripts are syntenic to the genome of chimpanzee (upper right) and mouse (bottom right). The open circles indicate NR transcripts, and the full circles indicate NM transcripts. Download figure Download PowerPoint We analyzed the relative frequencies of NM and NR transcripts, designated as f(x) and g(x), respectively (Fig 1C), where x indicates ORF dominance. In human transcripts, g(x) showed a distribution shifted to the left with an apex of 0.15; in contrast, the distribution of f(x) shifted to the right with an apex of 0.55 (Fig 1C, upper panel). We generated nucleic acid control sequences in which A/T/G/C bases were randomly assigned with equal probabilities. In these controls, the relative frequencies of ORF dominance shifted to the left in both coding and noncoding transcripts (Fig 1C, bottom panel). The controls that randomly shuffled the original sequence without affecting the number of A/T/C/G bases in each transcript also had relative frequencies of ORF dominance shifted to the left in both coding and noncoding transcripts (Appendix Fig S2A). Similar results were obtained using a dataset from the Ensembl database (Appendix Fig S2B). We also calculated the ORF dominance of mouse transcripts from RefSeq and Ensembl and found that the distribution of f(x) was shifted to the right with an apex of 0.55 (Appendix Fig S2C), similar to that of human transcripts. ORF dominance correlates with protein-coding potential in human and mouse Next, we examined the relationship between ORF dominance and protein-coding potential. Based on the ORF dominance distributions of coding and noncoding transcripts, protein-coding potential, F(x), was defined as the probability of a transcript being a coding RNA given an ORF dominance of x. A sample F(0.15) calculation for human transcripts is shown in Fig 1D. This result indicates that any given human RNA transcript with a calculated ORF dominance of 0.15 has a protein-coding potential F(x) of 0.183. F(x) was correlated with ORF dominance ≤ 0.65 (Fig 1E and Appendix Fig S3A). The protein-coding potentials of the sequences in the RefSeq database slightly decreased after peaking at 0.65 (Fig 1E), whereas those of sequences in the Ensembl database remained high (Appendix Fig S3A). The F(x) of the human transcripts was estimated using the following linear regressions: For Ensembl data, F ( x ) = 1.301 x + 0.0072 ( x ≤ 0.65 ) , R 2 = 0.984 ; For RefSeq data, F ( x ) = 1.313 x + 0.0189 ( x ≤ 0.65 ) , R 2 = 0.990 . The intercepts were near zero, and the slopes were approximately 1.3 for both equations. Using these equations, the F(x) of any given human transcript with an ORF dominance ≤ 0.65 can be calculated. For example, the F(x) of NCYM was estimated to be 0.746 or 0.765 based on Ensembl or RefSeq data, respectively (Appendix Fig S1D). In contrast, the F(x) of the control sequences was not correlated with ORF dominance (Fig 1E, bottom panel, and Appendix Fig S3A). Similar results were obtained for the mouse transcripts (Appendix Fig S3B). The F(x) of the mouse transcripts (ORF dominance ≤ 0.65) was estimated as follows: For Ensembl data, F ( x ) = 1.142 x + 0.067 , R 2 = 0.982 For RefSeq data, F ( x ) = 1.482 x - 0.061 , R 2 = 0.990 For both human and mouse transcripts, ORF dominance correlated linearly with the protein-coding potential at ORF dominance ≤ 0.65. Moreover, when the ORF dominance limit approached 0, the probability of the transcript being a coding RNA was 0 (Fig 1E and Appendix Fig S3). Characterization of high-scoring human lncRNAs Next, we investigated whether ORF dominance is useful for identifying coding RNAs among NR transcripts. From the 7,144 transcripts registered as noncoding genes in 2015, we excluded small RNAs (< 200 nucleotides) and those with short pORFs (< 20 amino acids). Among the remaining 6,617 NR genes, 219 were reassigned as NM over the past 3 years (Dataset EV2), including the previously identified de novo gene MYCNOS/NCYM (Suenaga et al, 2014). The percentage of reclassification increased for NR transcripts with high ORF dominance (Fig 1F). Thus, high ORF dominance is a useful indicator of coding transcripts. NR transcripts with high protein-coding potential (0.6 ≤ ORF dominance < 0.8) were then extracted, and the domain structure of each pORF amino acid sequence was assessed using the basic local alignment search tool for protein sequences (BLASTP). A total of 217 transcripts showed putative domain structures in the pORF, whereas 310 did not (Dataset EV3). Transcripts with domain structures often derive from transcript variants, pseudogenes, or readthrough of coding genes; those without domain structures often derive from antisense or long intergenic noncoding RNAs (lincRNAs) (Table 1). Table 1. Numbers of original transcripts that produced NR transcripts with high coding frequency (0.6 ≤ ORF dominance < 0.8). Transcript Domain Total P-value With Without Antisense 4 61 65 7.79E-08 lincRNA 3 65 68 7.60E-09 Pseudogene 50 17 67 4.32E-07 Readthrough 7 0 7 6.00E-03 Transcript variant of coding gene 146 35 181 1.05E-19 Divergent 0 2 2 N.S. Intronic 0 6 6 N.S. Small nuclear RNA 0 3 3 N.S. miRNA host gene 0 3 3 N.S. Other lncRNA 7 118 125 1.12E-13 Total 217 310 527 P-values were calculated using Yate’s continuity correction. N.S., not significant. We next examined the functions of the genes originating NR transcripts with high coding potential (0.6 ≤ ORF dominance < 0.8). We divided the NR transcripts into those with and without putative domains to investigate novel coding gene candidates, originating either from preexisting genes or from nongenic regions. Analysis using the Database for Annotation, Visualization, and Integrated Discovery (DAVID) functional annotation tool (Huang et al, 2009a, 2009b) showed that NR transcripts without domain structures were derived from genes related to transcriptional regulation, multicellular organismal processes, and developmental processes (Dataset EV4). Among the target genes of transcription factors, NMYC (MYCN), TGIF, and ZIC2 were ranked in the top three, and are all necessary for forebrain development (Dataset EV4) (Brown et al, 1998; Gripp et al, 2000; van Bokhoven et al, 2005). NR transcripts with domain structures originating from genes with alternative splicing were related to organelle function and are expressed in multiple cancers, including respiratory tract tumors, gastrointestinal tumors, retinoblastomas, and medulloblastomas (Dataset EV5). Similar analyses were conducted for mouse (Datasets EV6–8) and Caenorhabditis elegans (Datasets EV9–11). In mouse, original genes related to protein dimerization activity (Dataset EV7) and nucleotide binding or organelle function (Dataset EV8) were enriched and showed high ORF dominance lncRNAs with and without conserved domains, respectively. In C. elegans, original genes related to embryo development (Dataset EV10) and chromosome V or single-organism cellular processes (Dataset EV11) were enriched. Therefore, the relationship between brain development and cancer in the function of lncRNAs with high ORF dominance seems to be specific to humans. ORF dominance affects the protein-coding potential predicted by Ka/Ks To examine the relationship between ORF dominance and natural selection in the prediction of protein-coding potential, we calculated the ratio of nonsynonymous (Ka) to synonymous (Ks) mutations by comparing human transcripts with syntenic–genomic regions of chimpanzee and mouse (Fig 1G). Transcripts were selected based on syntenically conserved regions: 44,593 (vs. chimpanzee) and 14,016 (vs. mouse). We found a linear relationship between F(x) and ORF dominance in the conserved transcripts (Fig 1G, left panels). As predicted, coding transcripts exhibited Ka/Ks < 0.5 at a higher frequency than noncoding transcripts, with large differences observed for ORF dominance > 0.9 or < 0.1 and the smallest difference for ORF dominance between 0.35 and 0.45, approximately (Fig 1G, right panels). These results indicated that for transcripts with ORF dominance near the highest or lowest values, the conservation of pORF sequences (negative selection, Ka/Ks < 0.5) determines the coding potential. Therefore, noncoding transcripts showing both negative selection (Ka/Ks < 0.5) and the highest ORF dominance may correspond to new coding transcript candidates. We list 23 such transcripts in Dataset EV12, including four transcript variants of a previously identified lncRNA that encodes a tumor-suppressive small peptide, HOXB-AS3 (Huang et al, 2017). Translation of small peptides shifts ORF dominance distributions To investigate the effect of translation on ORF dominance, we calculated the ORF dominance of lincRNAs with translation registered in two independent databases (SmProt and sORFs.org) and compared them with that of lincRNAs without evidence of translation. Results showed that lincRNAs with translation had higher ORF dominance than those without translation evidence (Fig 2A, top left panel). Figure 2. Effects of translation on the distributions of ORF dominance The ORF dominance distribution for lincRNAs with translation registered in the SmProt database (http://bioinfo.ibp.ac.cn/SmProt/) (red line, n = 87) or sORF database (http://www.sorfs.org/) (blue line, n = 594) shifted to higher scores relative to lincRNAs without evidence of translation (black line, n = 11,657, not registered in these databases) (top left). The relative frequency of corresponding ORF coverage (top center), transcript length (top right), ORF size (bottom left), and sum of secORF length (bottom right) are also shown. Cluster I genes (n = 1,149) show higher ORF dominance than cluster III genes (n = 2,918).Central bands, whiskers, and boxes are median values, ranges, and interquartile ranges, respectively. P-values were calculated by the Mann–Whitney U-test. ***P < 10−9. Genes with translation of multiple ORFs (n = 7,961) show lower or higher percentage of cluster I or cluster III genes, respectively, than genes without evidence of translation of multiple ORFs (n = 1,786, not registered in sORF databases). P-values were calculated using Yate’s continuity correction. ****P = 1.46E-68 and ***P = 4.86E-18. Download figure Download PowerPoint Transcript length and the coverage, and size (pORF length) of ORFs have been used as indicators to predict the coding potential of transcripts (Wang et al, 2013; Zeng & Hamada, 2018), including de novo genes (Schmitz et al, 2018). We calculated these three values for lincRNAs with translation products, and their distributions were compared with those of lincRNAs without evidence of translation. The comparison revealed a slight shift in the high values of ORF coverage in the lincRNAs registered in SmProt, whereas negligible changes were found in the distribution of lincRNAs registered in sORF.org (Fig 2A, top center panel). In addition, there was no shift in ORF size (Fig 2A, bottom left panel), and transcripts were rather short in lincRNAs with translation (Fig 2A, top right panel), reducing the sum of secORFs length (Fig 2A, bottom right panel). Therefore, the translated lincRNAs showed high ORF dominance, to which contributed their shorter transcript lengths by reducing the sum of secORFs. Next, we examined whether ORF dominance was associated with translation efficiency in coding RNAs. Transcript translation in spermatocytes and spermatids is strongly downregulated on average. However, Wang et al (2020) identified gene sets (cluster I genes) efficiently translated in the spermatocytes and spermatids of mouse that therefore escaped the overall translational repression (Wang et al, 2020). We found that cluster I genes had higher ORF dominance than cluster III genes showing translational repression in spermatocytes and spermatids (Fig 2B). Furthermore, coding transcripts with translation from multiple ORFs showed significantly low or high percentages of cluster I or cluster III genes, respectively, compared with those without evidence of translation (Fig 2C). These results supported the hypothesis that ORF dominance is associated with translation efficiency in coding transcripts. Relationship between ORF dominance and relative frequencies of coding/noncoding transcripts in 100 organisms To analyze the relationship between ORF dominance and protein-coding potential in a broad lineage of 100 organisms, we selected five bacteria, ten archaea, and 85 eukaryote species (Dataset EV1) and calculated ORF dominance for more than 3.4 million transcripts (Dataset EV1). Phylogenetic trees of the cellular organisms are presented using a logarithmic timescale and display the number of species in each lineage (Fig 3). To examine the evolutionary conservation of the linear relationship between ORF dominance and protein-coding potential in human being and mouse, we selected a relatively large number of mammalian species (36). Species with fewer than three lncRNAs were not used to calculate g(x) and were not included in the histograms illustrating the relationship between g(x) and ORF dominance (Figs 4 and 5). For all organisms, the relative frequency of coding transcripts, f(x), was shifted to the right (higher ORF dominance) compared with random or random shuffling controls (Figs 4 and 5A–C; Appendix Figs S4 and S5). Figure 3. Phylogenetic tree Numbers of species are indicated in each lineage. The lineages of five species, including one archaea (Nitrososphaera viennensis EN76), two fungi (Puccinia graminis f. sp. Tritici and Pyricularia oryzae), and two animals (Strongylocentrotus purpuratus and Lingula anatine) are unknown and therefore were excluded from the figure. Download figure Download PowerPoint Figure 4. Relationships between ORF dominance and the relative frequencies of coding and noncoding transcripts from bacteria to mammals Histograms of f(x) (white) or g(x) (black) in observed data (left) and in nucleic acid-scrambled controls (right) for each species analyzed. ORF dominance with the highest f(x) is presented in the histograms. Odom was calculated using the ORF dominance distribution from observed data, and it is indicated in the left panels. LC, Least Concern; NT, Near Threatened; CR, Critically Endangered; and EX, Extinct in International Union for Conservation of Nature (IUCN) Red List. Download figure Download PowerPoint Figure 5. Relationships between ORF dominance and the relative freque
Referência(s)