Generation of an isoform-level transcriptome atlas of macrophage activation
2021; Elsevier BV; Volume: 296; Linguagem: Inglês
10.1016/j.jbc.2021.100784
ISSN1083-351X
AutoresApple Cortez Vollmers, Honey Mekonen, Sophia Campos, Susan Carpenter, Christopher Vollmers,
Tópico(s)RNA Research and Splicing
ResumoRNA-seq is routinely used to measure gene expression changes in response to cell perturbation. Genes upregulated or downregulated following some perturbation are designated as genes of interest, and their most expressed isoform(s) would then be selected for follow-up experimentation. However, because of its need to fragment RNA molecules, RNA-seq is limited in its ability to capture gene isoforms and their expression patterns. This lack of isoform-specific data means that isoforms would be selected based on annotation databases that are incomplete, not tissue specific, or do not provide key information on expression levels. As a result, minority or nonexistent isoforms might be selected for follow-up, leading to loss in valuable resources and time. There is therefore a great need to comprehensively identify gene isoforms along with their corresponding levels of expression. Using the long-read nanopore-based R2C2 method, which does not fragment RNA molecules, we generated an Isoform-level transcriptome Atlas of Macrophage Activation that identifies full-length isoforms in primary human monocyte-derived macrophages. Macrophages are critical innate immune cells important for recognizing pathogens through binding of pathogen-associated molecular patterns to toll-like receptors, culminating in the initiation of host defense pathways. We characterized isoforms for most moderately-to-highly expressed genes in resting and toll-like receptor–activated monocyte-derived macrophages, identified isoforms differentially expressed between conditions, and validated these isoforms by RT-qPCR. We compiled these data into a user-friendly data portal within the UCSC Genome Browser (https://genome.ucsc.edu/s/vollmers/IAMA). Our atlas represents a valuable resource for innate immune research, providing unprecedented isoform information for primary human macrophages. RNA-seq is routinely used to measure gene expression changes in response to cell perturbation. Genes upregulated or downregulated following some perturbation are designated as genes of interest, and their most expressed isoform(s) would then be selected for follow-up experimentation. However, because of its need to fragment RNA molecules, RNA-seq is limited in its ability to capture gene isoforms and their expression patterns. This lack of isoform-specific data means that isoforms would be selected based on annotation databases that are incomplete, not tissue specific, or do not provide key information on expression levels. As a result, minority or nonexistent isoforms might be selected for follow-up, leading to loss in valuable resources and time. There is therefore a great need to comprehensively identify gene isoforms along with their corresponding levels of expression. Using the long-read nanopore-based R2C2 method, which does not fragment RNA molecules, we generated an Isoform-level transcriptome Atlas of Macrophage Activation that identifies full-length isoforms in primary human monocyte-derived macrophages. Macrophages are critical innate immune cells important for recognizing pathogens through binding of pathogen-associated molecular patterns to toll-like receptors, culminating in the initiation of host defense pathways. We characterized isoforms for most moderately-to-highly expressed genes in resting and toll-like receptor–activated monocyte-derived macrophages, identified isoforms differentially expressed between conditions, and validated these isoforms by RT-qPCR. We compiled these data into a user-friendly data portal within the UCSC Genome Browser (https://genome.ucsc.edu/s/vollmers/IAMA). Our atlas represents a valuable resource for innate immune research, providing unprecedented isoform information for primary human macrophages. The use of RNA-seq is a primary strategy in biomedical research to identify genes involved in biological processes of interest and how gene expression is impacted upon gene editing or use of chemical or biological agonists. Notably, short-read sequencing technology has reliably been used to quantify changes in gene expression levels or the inclusion level of individual exons and splice junctions. However, because short-read RNA-seq relies on fragmenting RNA molecules before sequencing, even advanced computational tools fail at leveraging this ubiquitous data type into isoform-level information (1Pertea M. Pertea G.M. Antonescu C.M. Chang T.-C. Mendell J.T. Salzberg S.L. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads.Nat. Biotechnol. 2015; 33: 290-295Crossref PubMed Scopus (4506) Google Scholar, 2Bankevich A. Nurk S. Antipov D. Gurevich A.A. Dvorkin M. Kulikov A.S. Lesin V.M. Nikolenko S.I. Pham S. Prjibelski A.D. Pyshkin A.V. Sirotkin A.V. Vyahhi N. Tesler G. Alekseyev M.A. et al.SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing.J. Comput. Biol. 2012; 19: 455-477Crossref PubMed Scopus (13451) Google Scholar, 3Grabherr M.G. Haas B.J. Yassour M. Levin J.Z. Thompson D.A. Amit I. Adiconis X. Fan L. Raychowdhury R. Zeng Q. Chen Z. Mauceli E. Hacohen N. Gnirke A. Rhind N. et al.Full-length transcriptome assembly from RNA-seq data without a reference genome.Nat. Biotechnol. 2011; 29: 644-652Crossref PubMed Scopus (12578) Google Scholar). Short-read RNA-seq ultimately falls short in providing comprehensive and accurate full-length isoform structures as well as the level of expression of each isoform under specific conditions. More recently, long-read technologies from Pacific Biosciences and Oxford Nanopore Technologies (ONT) have been used for sequencing and analyzing full-length cDNA molecules at the transcriptome scale (4Gupta I. Collier P.G. Haase B. Mahfouz A. Joglekar A. Floyd T. Koopmans F. Barres B. Smit A.B. Sloan S.A. Luo W. Fedrigo O. Ross M.E. Tilgner H.U. Single-cell isoform RNA sequencing characterizes isoforms in thousands of cerebellar cells.Nat. Biotechnol. 2018; https://doi.org/10.1038/nbt.4259Crossref Scopus (134) Google Scholar, 5Workman R.E. Tang A.D. Tang P.S. Jain M. Tyson J.R. Razaghi R. Zuzarte P.C. Gilpatrick T. Payne A. Quick J. Sadowski N. Holmes N. de Jesus J.G. Jones K.L. Soulette C.M. et al.Nanopore native RNA sequencing of a human poly(A) transcriptome.Nat. Methods. 2019; 16: 1297-1305Crossref PubMed Scopus (200) Google Scholar, 6Lebrigand K. Magnone V. Barbry P. Waldmann R. High throughput error corrected nanopore single cell transcriptome sequencing.Nat. Commun. 2020; 11: 4025Crossref PubMed Scopus (54) Google Scholar, 7Wyman D. Balderrama-Gutierrez G. Reese F. Jiang S. Rahmanian S. Zeng W. Williams B. Trout D. England W. Chu S. Spitale R.C. Tenner A. Wold B. Mortazavi A. A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification.bioRxiv. 2019; ([preprint])https://doi.org/10.1101/672931Crossref Scopus (0) Google Scholar). In contrast to RNA-seq, this technology can determine which isoforms, down to the exact transcription start and poly(A) sites, are expressed at what level by each gene. The comprehensive transcriptome scale isoform information these technologies provide has the potential to remove the need for targeted and work intensive methods like RT-PCR and 5'/3' RACE to identify and characterize transcript and/or protein isoforms expressed by a gene. Therefore, comprehensive transcriptome scale isoform information is bound to simplify and improve the outcome of single gene focused follow-up studies which include knock-down and knock-out experiments, overexpression assays, Western Blots, ELISAs, pull-downs, and many more. This is because of the fact that these assays rely on prior knowledge of what isoform(s) the gene of interest actually expresses in the condition and experimental system being investigated. Finally, detailed knowledge of transcription start sites (TSSs) for each expressed gene in a cell type will also improve the use of CRISPR interference technology to knock down genes because guide RNAs can be targeted to TSSs with greater accuracy. To build on our previous work (8Byrne A. Beaudin A.E. Olsen H.E. Jain M. Cole C. Palmer T. DuBois R.M. Forsberg E.C. Akeson M. Vollmers C. Nanopore long-read RNAseq reveals widespread transcriptional variation among the surface receptors of individual B cells.Nat. Commun. 2017; 8: 16027Crossref PubMed Scopus (196) Google Scholar, 9Robinson E.K. Jagannatha P. Covarrubias S. Cattle M. Safavi R. Song R. Viswanathan K. Shapleigh B. Abu-Shumays R. Jain M. Cloonan S.M. Wakeland E. Akeson M. Brooks A.N. Carpenter S. Inflammation drives alternative first exon usage to regulate immune genes including a novel iron regulated isoform of Aim2.bioRxiv. 2020; ([preprint])https://doi.org/10.1101/2020.07.06.190330Crossref Scopus (0) Google Scholar) and further push the limits of long-read technology to provide a resource for the innate immune research community, we set out to generate an isoform-level transcriptome atlas of macrophage activation by determining (1) what isoform of a given gene is expressed, (2) at what level, and (3) how isoform and gene expression change following toll-like receptor (TLR) activation. Macrophages are a key cellular component of the innate immune system which represents the first line of host defense against infection and is critical for the development of adaptive immunity (10Medzhitov R. Horng T. Transcriptional control of the inflammatory response.Nat. Rev. Immunol. 2009; 9: 692-703Crossref PubMed Scopus (766) Google Scholar, 11Carpenter S. Aiello D. Atianand M.K. Ricci E.P. Gandhi P. Hall L.L. Byron M. Monks B. Henry-Bezy M. Lawrence J.B. O'Neill L.A.J. Moore M.J. Caffrey D.R. Fitzgerald K.A. A long noncoding RNA mediates both activation and repression of immune response genes.Science. 2013; 341: 789-792Crossref PubMed Scopus (727) Google Scholar). Macrophages recognize conserved structures of microbial-derived molecules or pathogen associated molecular patterns using TLRs. The regulation of this TLR repertoire fundamentally alters the response to infection (12Kawai T. Akira S. Toll-like receptor and RIG-I-like receptor signaling.Ann. N. Y. Acad. Sci. 2008; 1143: 1-20Crossref PubMed Scopus (752) Google Scholar). TLR activation induces the expression of hundreds of genes that encode inflammatory response genes including cytokines, type I interferons, antimicrobial proteins, and regulators of metabolism and regeneration; these molecules in turn mediate inflammation, antimicrobial immunity, and tissue regeneration. Here, we investigated transcriptional responses of human macrophages treated with lipopolysaccharide (LPS), Pam3CSK4 (PAM), R848, and poly(I:C) which activate TLR4, TLR1/2, TLR7/8, and TLR3, respectively. Using our ONT-based R2C2 method, we then generated a total of ∼15 million full-length cDNA reads at a median accuracy >99% (Q20) and processed this data into isoforms which we characterized in depth and provide alongside deep Smart-seq2 short-read RNA-seq data as a UCSC Genome Browser session for easy exploration. To generate a comprehensive isoform-level transcriptome atlas of TLR-dependent macrophage activation, we collected peripheral blood mononuclear cells from two individuals (Rep1 and Rep2) from which we isolated monocytes. From these monocytes, we generated monocyte-derived macrophages (MDMs). We treated these MDMs with TLR ligands LPS, PAM, R848, or poly(I:C) and included a no treatment (NoStim) control. After 6 h, we collected the stimulated and nonstimulated MDMs and proceeded to extract RNA from each sample. We reverse transcribed the poly(A) fraction of this RNA using a modified oligo(dT) primer and a template switch oligo to generate full-length cDNA with known sequences on both ends. We then amplified this cDNA using PCR and used the resulting double-stranded full-length cDNA as input for both Illumina-based Smart-seq2 (13Picelli S. Faridani O.R. Björklund A.K. Winberg G. Sagasser S. Sandberg R. Full-length RNA-seq from single cells using Smart-seq2.Nat. Protoc. 2014; 9: 171-181Crossref PubMed Scopus (1952) Google Scholar) and ONT-based R2C2 (14Volden R. Palmer T. Byrne A. Cole C. Schmitz R.J. Green R.E. Vollmers C. Improving nanopore read accuracy with the R2C2 method enables the sequencing of highly multiplexed full-length single-cell cDNA.Proc. Natl. Acad. Sci. U. S. A. 2018; 115: 9726-9731Crossref PubMed Scopus (90) Google Scholar) sequencing protocols (Fig. 1). To identify genes differentially expressed upon TLR activation following treatment with LPS, PAM, R848, or poly(I:C), we performed Illumina-based Smart-seq2 (15Picelli S. Björklund A.K. Reinius B. Sagasser S. Winberg G. Sandberg R. Tn5 transposase and tagmentation procedures for massively scaled sequencing projects.Genome Res. 2014; 24: 2033-2040Crossref PubMed Scopus (378) Google Scholar) sequencing as previously described (16Cole C. Byrne A. Adams M. Volden R. Vollmers C. Complete characterization of the human immune cell transcriptome using accurate full-length cDNA sequencing.Genome Res. 2020; 30: 589-601Crossref PubMed Scopus (17) Google Scholar, 17Byrne A. Supple M.A. Volden R. Laidre K.L. Shapiro B. Vollmers C. Depletion of hemoglobin transcripts and long-read sequencing improves the transcriptome annotation of the polar bear (Ursus maritimus).Front. Genet. 2019; 10: 643Crossref PubMed Scopus (11) Google Scholar) (Fig. S1, see Experimental procedures). We generated approximately 15 to 30 million reads per sample (Table S1) and processed the resulting data using a standard workflow which includes STAR (18Dobin A. Davis C.A. Schlesinger F. Drenkow J. Zaleski C. Jha S. Batut P. Chaisson M. Gingeras T.R. Star: Ultrafast universal RNA-seq aligner.Bioinformatics. 2013; 29: 15-21Crossref PubMed Scopus (19120) Google Scholar), featureCounts (19Liao Y. Smyth G.K. Shi W. FeatureCounts: An efficient general purpose program for assigning sequence reads to genomic features.Bioinformatics. 2014; 30: 923-930Crossref PubMed Scopus (8883) Google Scholar), and DEseq2 (20Love M.I. Huber W. Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.Genome Biol. 2014; 15: 550Crossref PubMed Scopus (32294) Google Scholar). Taking advantage of our biological replicates, we individually compared the LPS, PAM, R848, and poly(I:C) conditions to the NoStim control. Each ligand caused the differential expression (DE) of 1000 to 2000 genes (Tables S2 and S3). Between all conditions, 454 genes were shared, a varying number overlapped between three and two stimulants, and some were exclusive to one stimulant (Fig. 2A, Table S3). Using Panther Gene Ontology analysis (21Thomas P.D. Campbell M.J. Kejariwal A. Mi H. Karlak B. Daverman R. Diemer K. Muruganujan A. Narechania A. PANTHER: A library of protein families and subfamilies indexed by function.Genome Res. 2003; 13: 2129-2141Crossref PubMed Scopus (2184) Google Scholar, 22Mi H. Dong Q. Muruganujan A. Gaudet P. Lewis S. Thomas P.D. PANTHER version 7: Improved phylogenetic trees, orthologs and collaboration with the Gene Ontology Consortium.Nucleic Acids Res. 2010; 38: D204-D210Crossref PubMed Scopus (466) Google Scholar), we observed that genes shared between all stimulants were strongly enriched for biological processes including "response to cytokine" and inflammatory response" (Table S4). Notably, the genes exclusively responding to R848 were enriched for the "response to organic cyclic compound" biological process, likely because R848 is an organic cyclic compound. To supplement this gene-level DE analysis, we sequenced the same full-length double-stranded cDNA using our ONT-based R2C2 protocol (14Volden R. Palmer T. Byrne A. Cole C. Schmitz R.J. Green R.E. Vollmers C. Improving nanopore read accuracy with the R2C2 method enables the sequencing of highly multiplexed full-length single-cell cDNA.Proc. Natl. Acad. Sci. U. S. A. 2018; 115: 9726-9731Crossref PubMed Scopus (90) Google Scholar, 17Byrne A. Supple M.A. Volden R. Laidre K.L. Shapiro B. Vollmers C. Depletion of hemoglobin transcripts and long-read sequencing improves the transcriptome annotation of the polar bear (Ursus maritimus).Front. Genet. 2019; 10: 643Crossref PubMed Scopus (11) Google Scholar, 23Volden R. Vollmers C. Highly multiplexed single-cell full-length cDNA sequencing of human immune cells with 10X genomics and R2C2.bioRxiv. 2020; ([preprint])https://doi.org/10.1101/2020.01.10.902361Crossref Scopus (0) Google Scholar) (Fig. 1). R2C2 circularizes cDNA and then amplifies the resulting circular DNA to generate long linear DNA molecules containing concatemeric copies of the initial cDNA. The resulting long DNA is then sequenced and computationally separated into subreads. We then combine these subreads to generate accurate consensus reads of the initial cDNA molecules. To make the creation of this macrophage isoform atlas feasible, we introduced several improvements to the R2C2 method. First, we enabled multiplexing of cDNA samples by introducing 10 nt sample barcodes into the oligo(dT) primers as well as using highly distinct DNA splints for cDNA circularization (Fig. 1). Throughout the study, we used these two different indexing strategies to separate technical and biological replicates (Table S5). Indexing samples allowed us to sequence samples in pools on the same ONT flow cells, thereby achieving equal sequencing coverage between samples and minimizing batch effects. Including pilot experiments that sequenced regular and size-selected cDNA of NoStim and LPS samples, we generated 14,961,450 R2C2 reads at a median length of 942 nt across multiple ONT MinION flow cells (Fig. 2B). Second, after performing the pilot experiments, we improved R2C2 per read accuracy by increasing the raw read length of R2C2 libraries. We accomplished this by developing a gentle agarose gel extraction protocol that doubled the raw read length of our R2C2 libraries from ∼5 kb to ∼11 kb. Combined with a new ONT basecaller, this increased the median per base accuracy of our R2C2 libraries. Previously, this base accuracy was 97.9% (16Cole C. Byrne A. Adams M. Volden R. Vollmers C. Complete characterization of the human immune cell transcriptome using accurate full-length cDNA sequencing.Genome Res. 2020; 30: 589-601Crossref PubMed Scopus (17) Google Scholar). Here, our most recent sequencing runs show an increased base accuracy of 99.45% (Fig. S1). Overall, and including less accurate pilot experiments, the ∼15 million reads generated for this study had a median accuracy of 99.19% (Q21). Third, to take advantage of this improved accuracy and refine the identification of isoforms from the R2C2 reads we generated, we developed a new version of our Mandalorion pipeline (Episode 3.5 - Rogue Isoform) that, among several changes, includes improved consensus generation using the Medaka polishing tool and improved handling of isoform ends. When applying this pipeline on the combined ∼15 million read data set, we identified 29,637 high confidence isoforms with a median length of 1554 nt and a per base accuracy of 99.94% (Q32) which matches the current state-of-the-art for ONT-only consensus accuracy (24Shafin K. Pesout T. Lorig-Roach R. Haukness M. Olsen H.E. Bosworth C. Armstrong J. Tigyi K. Maurer N. Koren S. Sedlazeck F.J. Marschall T. Mayes S. Costa V. Zook J.M. et al.Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes.Nat. Biotechnol. 2020; 38: 1044-1053Crossref PubMed Scopus (134) Google Scholar) (Fig. 2B). Next, we determined which genes these 29,637 isoforms were transcribed from and to what extent isoform identification was dependent on gene expression levels. Our Smart-seq2–based analysis identified 20,915 expressed genes (Average reads per million [RPM] across all conditions and replicates >0.05). Our R2C2-based analysis identified at least one isoform for 9688 (46%) of these genes. Our ability to identify at least one isoform for a gene increased if the gene was expressed at higher levels, i.e., had a higher average RPM (Fig. 2C). We identified at least one isoform for 13% of genes between 0.05 and 3 RPM. This percentage increased with increasing RPM to 49% (3–5 RPM), 64% (5–10 RPM), 80% (10–20 RPM), and 91% (>20 RPM). Because differentially expressed genes tended to have higher average expression (Fig. 2C), we identified at least one isoform for 1986 of the 2873 (69%) genes differential expressed in any condition, and 363 of 454 (80%) genes differential expressed in all conditions. The number of isoforms identified per gene also increased with average RPM (Fig. 2D). However, the median number of isoforms per gene did not exceed three even in very highly expressed genes. This is likely because of the Mandalorion filtering settings we used which discarded isoforms with less than 1% of all the R2C2 reads at a locus and thereby excluded minority isoforms. Next, we established a pipeline to—in addition to gene-level DE—detect genes whose isoforms were differentially expressed between conditions (Fig. 3A). While DE pipelines like DEseq2 could be applied to this problem, they would likely just detect isoforms from differentially expressed genes. To detect genes whose relative isoform usage differed between conditions (e.g., Gene A isoform 1 decreases, while Gene A isoform 2 increases following stimulation), we applied a Chi-squared contingency table test. For each gene, we first determined all the isoforms transcribed from that gene. Then, we calculated the relative usage (%) for each isoform in each condition by dividing the number of R2C2 reads associated with that isoform by the number of R2C2 reads associated with all the isoforms in that condition. We then applied the Chi-squared contingency test to the resulting isoform-by-condition relative usage table. To reduce noise, we only tested the 1872 genes with at least 50 R2C2 reads in at least two experimental conditions. After Bonferroni multiple testing correction, this stringent test produced 47 genes with differential isoform usage with α = 0.05 (Table S6). Twenty-five of these 47 (53%) genes were not identified as differentially expressed by the Smart-seq2 short-read workflow. The 47 genes are likely to contain isoforms whose usage varies strongly between conditions. To confirm this, we determined the standard deviation of this isoform usage between conditions for each isoform in each gene (Fig. 3B). By sorting the 1872 genes we tested by the largest standard deviation among their isoforms, we showed that this Chi-square contingency table test did indeed identify genes with isoforms that have highly variable usage between conditions. To further evaluate the robustness of this approach, we validated the changes in the relative usage of alternative individual exons for the top ten genes with significant differential isoform usage following LPS treatment using macrophages from an additional donor. Using RT-qPCR, we found that in all ten cases we tested, relative exon usage in the third donor changed in the same direction (upregulated or downregulated), with many showing fold change similar to that determined by R2C2 in the two original donors (Fig. S2, Table S7). The majority of the 47 genes featured alternative TSSs which in several cases were located in different and sometimes unannotated first exons (Fig. 3C, Table S6). To evaluate how frequent the use of unannotated exons is across all isoforms and how this affects their coding potential, we first categorized the 29,637 isoforms we identified. To this end, we used the SQANTI (25Tardaguila M. de la Fuente L. Marti C. Pereira C. Pardo-Palacios F.J. Del Risco H. Ferrell M. Mellado M. Macchietto M. Verheggen K. Edelmann M. Ezkurdia I. Vazquez J. Tress M. Mortazavi A. et al.SQANTI: Extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification.Genome Res. 2018; 28: 396-411Crossref PubMed Scopus (124) Google Scholar) algorithm which associates isoforms with genes and categorizes them as full-splice matches (FSM), novel in catalog (NIC), novel not in catalog (NNC), incomplete splice-matches (ISM), and other less abundant categories (Fig. 4A). FSM and ISM isoforms are defined as fully (FSM) or incompletely (ISM) matching the splice-junction chain of an annotated GENCODE transcript. NIC are defined as isoforms that use annotated splice sites in unannotated configurations. NNCs are defined as isoforms that use at least one unannotated splice site (Fig. 4A). FSM isoforms represented 65% (19,161), NIC isoforms represented 19% (5576), NNC isoforms represented 7% (2136), and ISM isoforms represented 6% (1878) of the 29,637 isoforms we identified (Table S8). If they were associated with a protein-coding gene, isoforms of different categories had different likelihoods to contain a full coding sequence (CDS) of the gene they were associated with. 94% of FSM, 58% of NIC, 47% of NNC, and 36% of ISM isoforms, which contained more than one exon, contained a full CDS of the protein-coding gene they were transcribed from (Fig. 4B). FSM isoforms which did not contain a full CDS likely had to differ from the GENCODE transcript they matched in their TSS and polyA site positions. Indeed, the 5' and 3' ends of FSM isoforms varied in their distance to the annotated TSS and polyA sites of the GENCODE transcript they were associated with (Fig. 4C). Taking into account that 5' and 3' ends of a FSM isoform may be closer to the TSS or polyA site of another GENCODE transcript in their respective gene, we determined that of the 19,161 FSM isoforms we identified, 276 had 5' end and 2068 had a 3' end more than 500 nt away from any annotated TSS or polyA site. The larger number of distant 3' ends compared with 5' ends in FSM isoforms could at least in part be explained by last exons being much longer on average than first exons (Fig. 4D). NIC isoforms, which did not contain a full CDS of the gene they were associated with, likely contained a new splice junction between Start and Stop codons which would modify a CDS. NNC isoforms which did not contain a full CDS of the gene they were associated with might differ from the CDS by a few base pairs to encode a slightly different splice-junction or contain entirely new exons (Fig. 4A). Next, we focused on NNC isoforms to annotate new exons. We defined a new exon as a part of a transcript whose genomic location does not overlap with a known exon at all (Fig. 4A). In the 2136 NNC isoforms we identified, 721 new exons of which 203 were first and 294 were last exons. If new exons were distributed equally among the exons of the 29,637 isoforms we identified, we would expect 89 new first and last exons, indicating that first and last exons are overrepresented in this set of new exons (Fig. 4D, left). Further, these new exons were shorter than exons in the GENCODE annotation (v34), or all exons identified by Mandalorion (Fig. 4D, right). Importantly, the length of new first, middle, and last exons followed the trend of annotated exons with last exons being 2 to 3× longer than first and middle exons. Finally, the vast majority of new exons could be validated with short read Smart-seq2 data. Splice junctions leading into these exons (one for first/last exons, two for internal exons) were present in Smart-seq2 reads generated from the same cDNA pool in 665 of 721 (92%) exons. This established that these new exons are highly likely to be present in the cDNA we generated. In addition to identifying new exons, NNC isoforms were particularly helpful in redefining long noncoding RNA (lncRNA) loci. 7% of NNC isoforms were associated with lncRNA genes as compared with 3% for both FSM and NIC isoforms. This is likely because of the fact that lncRNA are often expressed at lower levels in a more tissue specific manner than protein-coding genes which complicates their comprehensive annotation (26Derrien T. Johnson R. Bussotti G. Tanzer A. Djebali S. Tilgner H. Guernec G. Martin D. Merkel A. Knowles D.G. Lagarde J. Veeravalli L. Ruan X. Ruan Y. Lassmann T. et al.The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution, and expression.Genome Res. 2012; 22: 1775-1789Crossref PubMed Scopus (3626) Google Scholar). Our data set enables the investigation of these lncRNAs and their role in macrophage activation. In addition to NNC isoforms, isoforms falling into the SQANTI "intergenic" category are also likely to represent noncoding transcripts since protein-coding genes have been exhaustively mapped in the human genome. In total, 184 isoforms were categorized as "intergenic" defined as not overlapping any locus in the Gencode (v34) annotation (Table S8). Of these 184 isoforms, 57 contained more than one exon. These 57 multiexon isoforms in turn grouped into distinct 38 loci. Only six of these 38 loci overlapped with putative lncRNA loci assembled from deep short-read data (27Cabili M.N. Trapnell C. Goff L. Koziol M. Tazon-Vega B. Regev A. Rinn J.L. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses.Genes Dev. 2011; 25: 1915-1927Crossref PubMed Scopus (2576) Google Scholar). This showed that investigating specific cell types under different treatments has the potential to identify new previously unobserved lncRNA loci. While annotations like GENCODE are indispensable for any genomic experiment, they are insufficient when aiming to experimentally follow-up potential hits from a screen or an RNA-seq experiment. Our data set addresses this by providing information on which of the potentially numerous isoforms present in (or absent from) the Gencode annotation is actually expressed by a gene of interest and at what level it is expressed compared with other isoforms of that gene. To enable this type of exploration of the data set is available as a custom UCSC genome browser session (https://genome.ucsc.edu/s/vollmers/IAMA). This session contains gene expression information, isoform models, and quantification, as well as R2C2 and Smart-seq2 read tracks. To demonstrate the user interface of the Isoform-level transcriptome Atlas of Macrophage Activation browser session, we highlight an overlapping pair of lncRNA—LINC01181 and BAALC-AS—which both appear to be differentially upregulated by all TLR ligands based on short-read RNA-seq data. Inspecting these overlapping loci in the genome browser shows that only a very small number of R2C2 reads align to BAALC-AS and that short nondirectional RNA-seq reads assigned to BAALC-AS are therefore likely derived from LINC01181 cDNA. Inspecting the isoforms based on these R2C2 reads also shows no match for BAALC-AS but several for LINC01181 (Fig. 5). Of the eight spliced isoforms in the LINC01181 locus, none match but some fully contain the annotated LINC01181 transcript. Hovering over the isoforms shows that Isoform_136603_75 is expressed the highest in the LPS condition, taking up 48% of R2C2 reads in that condition. The reference-corrected sequence of Isoform_136603_75 can then be retrieved for downstream analysis by clicking its model. For investigators interested in the in-depth analysis of a gene's expression and function in a specific cell-type under defined experimental conditions, current efforts to annotate and quantify transcriptomes fall short. This is because of two major limitations of these efforts: (1) the use of short-read RNA-seq methods which precludes the identification and quantification of isoforms and (2) efforts using long-read methods are mostly limited to the analysis of a limited number of cell-types, almost always at baseline, which will miss isoforms specific to a different cell-type and under different experimental conditions. The data set and exploration options we present here will be of real-world use to researchers investigating human macrophages at both baseline and after TLR activation and could provide a blueprint for future studies combining short and long-read transcriptome analysis. The analysis of the transcriptome we generated shows that differential isoform expression between conditions exists but is limited and most often associated with the differential usage of TSSs which is similar to observations we have previously made in mouse and human macrophages (9Robinson E.K. Jagannatha P. Covarrubias S. Cattle M. Safavi R. Song R. Viswanathan K. Shapleigh B. Abu-Shumays R. Jain M. Cloonan S.M. Wakeland E. Akeson M. Brooks A.N. Carpenter S. Inflammation drives alternative first exon usage to regulate immune genes including a novel iron regulated isoform of Aim2.bioRxiv. 2020; ([preprint])https://doi.org/10.1101/2020.07.06.190330Crossref Scopus (0) Google Scholar). This shows that the splicing of genes itself is very similar between the conditions we investigated. We further show that most isoforms we identify match the splice-junction chain of an annotated isoform exactly but often not its TSS and polyA sites. We also detect hundreds of new exons, enriched for first and last exons. The absence of these exons from the Gencode annotation may be caused by technical or biological limitations of previous studies and annotation efforts. Short-read RNA-seq, which is most often used as the foundation for annotation, is characteristically struggling with capturing transcript ends. Further, these new exons might only be expressed in naive or activated macrophages and even then at fairly low levels. While we believe the entire data set presents a unique window into macrophage biology, the main purpose in creating it was to enable researchers to better understand their gene or genes of interest. We hope that researchers that, for example, might be interested in a particular gene expressed in macrophages after LPS activation will select the isoform with the highest expression in that condition from our data set to synthesize its exact sequence for follow-up studies or to, for example, locate its promoter for CRISPR interference experiments.
Referência(s)