Ch IP ‐Atlas: a data‐mining suite powered by full integration of public Ch IP ‐seq data
2018; Springer Nature; Volume: 19; Issue: 12 Linguagem: Inglês
10.15252/embr.201846255
ISSN1469-3178
AutoresShinya Oki, Tazro Ohta, Go Shioi, Hideki Hatanaka, Osamu Ogasawara, Yoshihiro Okuda, Hideya Kawaji, Ryo Nakaki, Jun Sese, Chikara Meno,
Tópico(s)Molecular Biology Techniques and Applications
ResumoResource9 November 2018Open Access Transparent process ChIP-Atlas: a data-mining suite powered by full integration of public ChIP-seq data Shinya Oki Corresponding Author Shinya Oki [email protected] orcid.org/0000-0002-4767-3259 Department of Developmental Biology, Graduate School of Medical Sciences, Kyushu University, Fukuoka, Japan Search for more papers by this author Tazro Ohta Tazro Ohta orcid.org/0000-0003-3777-5945 Database Center for Life Science, Joint-Support Center for Data Science Research, Research Organization of Information and Systems, Mishima, Shizuoka, Japan Search for more papers by this author Go Shioi Go Shioi Genetic Engineering Team, RIKEN Center for Life Science Technologies, Kobe, Japan Search for more papers by this author Hideki Hatanaka Hideki Hatanaka orcid.org/0000-0002-0587-2460 National Bioscience Database Center, Japan Science and Technology Agency, Tokyo, Japan Search for more papers by this author Osamu Ogasawara Osamu Ogasawara orcid.org/0000-0001-6001-3397 DNA Data Bank of Japan, National Institute of Genetics, Mishima, Shizuoka, Japan Search for more papers by this author Yoshihiro Okuda Yoshihiro Okuda DNA Data Bank of Japan, National Institute of Genetics, Mishima, Shizuoka, Japan Search for more papers by this author Hideya Kawaji Hideya Kawaji orcid.org/0000-0002-0575-0308 Preventive Medicine and Applied Genomics Unit, RIKEN Center for Integrative Medical Sciences, Kanagawa, Japan RIKEN Preventive Medicine and Diagnosis Innovation Program, Saitama, Japan Search for more papers by this author Ryo Nakaki Ryo Nakaki Genome Science Division, Research Center for Advanced Science and Technology, The University of Tokyo, Tokyo, Japan Rhelixa Inc., Tokyo, Japan Search for more papers by this author Jun Sese Jun Sese orcid.org/0000-0003-3495-4382 Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo, Japan Humanome Lab Inc., Tokyo, Japan Search for more papers by this author Chikara Meno Corresponding Author Chikara Meno [email protected] orcid.org/0000-0002-0869-6642 Department of Developmental Biology, Graduate School of Medical Sciences, Kyushu University, Fukuoka, Japan Search for more papers by this author Shinya Oki Corresponding Author Shinya Oki [email protected] orcid.org/0000-0002-4767-3259 Department of Developmental Biology, Graduate School of Medical Sciences, Kyushu University, Fukuoka, Japan Search for more papers by this author Tazro Ohta Tazro Ohta orcid.org/0000-0003-3777-5945 Database Center for Life Science, Joint-Support Center for Data Science Research, Research Organization of Information and Systems, Mishima, Shizuoka, Japan Search for more papers by this author Go Shioi Go Shioi Genetic Engineering Team, RIKEN Center for Life Science Technologies, Kobe, Japan Search for more papers by this author Hideki Hatanaka Hideki Hatanaka orcid.org/0000-0002-0587-2460 National Bioscience Database Center, Japan Science and Technology Agency, Tokyo, Japan Search for more papers by this author Osamu Ogasawara Osamu Ogasawara orcid.org/0000-0001-6001-3397 DNA Data Bank of Japan, National Institute of Genetics, Mishima, Shizuoka, Japan Search for more papers by this author Yoshihiro Okuda Yoshihiro Okuda DNA Data Bank of Japan, National Institute of Genetics, Mishima, Shizuoka, Japan Search for more papers by this author Hideya Kawaji Hideya Kawaji orcid.org/0000-0002-0575-0308 Preventive Medicine and Applied Genomics Unit, RIKEN Center for Integrative Medical Sciences, Kanagawa, Japan RIKEN Preventive Medicine and Diagnosis Innovation Program, Saitama, Japan Search for more papers by this author Ryo Nakaki Ryo Nakaki Genome Science Division, Research Center for Advanced Science and Technology, The University of Tokyo, Tokyo, Japan Rhelixa Inc., Tokyo, Japan Search for more papers by this author Jun Sese Jun Sese orcid.org/0000-0003-3495-4382 Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo, Japan Humanome Lab Inc., Tokyo, Japan Search for more papers by this author Chikara Meno Corresponding Author Chikara Meno [email protected] orcid.org/0000-0002-0869-6642 Department of Developmental Biology, Graduate School of Medical Sciences, Kyushu University, Fukuoka, Japan Search for more papers by this author Author Information Shinya Oki *,1, Tazro Ohta2, Go Shioi3, Hideki Hatanaka4, Osamu Ogasawara5, Yoshihiro Okuda5, Hideya Kawaji6,7, Ryo Nakaki8,9, Jun Sese10,11 and Chikara Meno *,1 1Department of Developmental Biology, Graduate School of Medical Sciences, Kyushu University, Fukuoka, Japan 2Database Center for Life Science, Joint-Support Center for Data Science Research, Research Organization of Information and Systems, Mishima, Shizuoka, Japan 3Genetic Engineering Team, RIKEN Center for Life Science Technologies, Kobe, Japan 4National Bioscience Database Center, Japan Science and Technology Agency, Tokyo, Japan 5DNA Data Bank of Japan, National Institute of Genetics, Mishima, Shizuoka, Japan 6Preventive Medicine and Applied Genomics Unit, RIKEN Center for Integrative Medical Sciences, Kanagawa, Japan 7RIKEN Preventive Medicine and Diagnosis Innovation Program, Saitama, Japan 8Genome Science Division, Research Center for Advanced Science and Technology, The University of Tokyo, Tokyo, Japan 9Rhelixa Inc., Tokyo, Japan 10Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo, Japan 11Humanome Lab Inc., Tokyo, Japan *Corresponding author. Tel: +81 92 642 6259; E-mail: [email protected] *Corresponding author. Tel: +81 92 642 6259; E-mail: [email protected] EMBO Reports (2018)19:e46255https://doi.org/10.15252/embr.201846255 PDFDownload PDF of article text and main figures. Peer ReviewDownload a summary of the editorial decision process including editorial decision letters, reviewer comments and author responses to feedback. ToolsAdd to favoritesDownload CitationsTrack CitationsPermissions ShareFacebookTwitterLinked InMendeleyWechatReddit Figures & Info Abstract We have fully integrated public chromatin chromatin immunoprecipitation sequencing (ChIP-seq) and DNase-seq data (n > 70,000) derived from six representative model organisms (human, mouse, rat, fruit fly, nematode, and budding yeast), and have devised a data-mining platform—designated ChIP-Atlas (http://chip-atlas.org). ChIP-Atlas is able to show alignment and peak-call results for all public ChIP-seq and DNase-seq data archived in the NCBI Sequence Read Archive (SRA), which encompasses data derived from GEO, ArrayExpress, DDBJ, ENCODE, Roadmap Epigenomics, and the scientific literature. All peak-call data are integrated to visualize multiple histone modifications and binding sites of transcriptional regulators (TRs) at given genomic loci. The integrated data can be further analyzed to show TR–gene and TR–TR interactions, as well as to examine enrichment of protein binding for given multiple genomic coordinates or gene names. ChIP-Atlas is superior to other platforms in terms of data number and functionality for data mining across thousands of ChIP-seq experiments, and it provides insight into gene regulatory networks and epigenetic mechanisms. Synopsis ChIP-Atlas (http://chip-atlas.org) is an easy-to-use Web service for visualization and data mining of genome-wide binding data for transcriptional regulators and modified histones that is based entirely on public ChIP-seq data. This service allows: To browse protein binding at gene loci and genomic regions of interest. To identify potential target genes of and factors that colocalize with given transcriptional regulators. To search for proteins that are enriched at given sets of genes or genomic loci. Introduction Chromatin immunoprecipitation sequencing (ChIP-seq) is a powerful method to analyze genome-wide binding of modified histones, RNA polymerases, and other proteins involved in transcription or regulation of gene expression such as transcription factors that recognize specific DNA sequences 1, chromatin-remodeling factors, and histone modification enzymes (collectively referred to as transcriptional regulators [TRs] in this paper). A large amount of data for both ChIP-seq and DNase-seq—a method for profiling regions of open chromatin accessible to DNase 2, 3—has been compiled by a panel of ENCODE consortia for representative model organisms (human, mouse, fruit fly, and nematode) and has served as a global resource for understanding gene regulatory mechanisms and epigenetic modifications 4-7. In addition, several tools and databases have been developed for visualization and analysis of public ChIP-seq data in a manner largely dependent on the ENCODE project data for human and mouse 8-11. On the other hand, a substantial amount of ChIP-seq data has been presented by various smaller projects (Fig 1A). Although such data are publicly available from Sequence Read Archive (SRA) of NCBI (https://www.ncbi.nlm.nih.gov/sra), they have been made use of to a lesser extent by the research community than have the ENCODE data for several reasons: (i) Unlike the ENCODE data, only the raw sequence data are archived in most cases, necessitating extensive bioinformatics analysis; (ii) metadata such as antigen and cell type names are often ambiguous as a result of the use of orthographic variants such as abbreviations and synonyms; and (iii) integrative analysis of such data requires skills for data mining and abundant computational resources. Figure 1. Overview of the ChIP-Atlas data set and computational processing Numbers of ChIP-seq and DNase-seq experiments recorded in ChIP-Atlas (as of May 2018), indicating the proportion of the data for each species derived from ENCODE, Roadmap Epigenomics, and other projects. Cumulative number of SRX-based experiments recorded in ChIP-Atlas. Data published before and after the launch of ChIP-Atlas in December 2015 are shown in gray and black, respectively. Numbers of experiments according to antigen (top) or cell type (bottom) classes for human, fruit fly, and nematode data. PSC, pluripotent stem cell; CDV, cardiovascular. Overview of data processing. Raw sequence data are downloaded from NCBI SRA, aligned to a reference genome, and subjected to peak calling, all of which can be monitored with the genome browser IGV. All peak-call data are then integrated for browsing via the "Peak Browser" function, and they can be analyzed for TR–gene ("Target Genes") or TR–TR ("Colocalization") interactions as well as subjected to enrichment analysis ("Enrichment Analysis"). All of the results are tagged with curated sample metadata such as antigen and cell type names. In the diagrams, gray components (circles, TRs; arrows, genes) indicate queries by the user, with colored components representing the returned results. Download figure Download PowerPoint Given this background, we launched a project in 2014 to fully exploit and reuse the ChIP-seq and DNase-seq data in the public domain, and we released to the public in December 2015 an easy-to-use database and associated data-mining tools that we designated ChIP-Atlas (http://chip-atlas.org; Fig EV1A). We here present the data content and features of ChIP-Atlas as compared with existing relevant tools published in the same period. Click here to expand this figure. Figure EV1. Web pages of ChIP-Atlas A, B. A snapshot of the ChIP-Atlas top page is shown in (A). From this page, users are able not only to access the four main functions of ChIP-Atlas but also to search for data of interest with a given SRX ID (A, top right) or with keywords such as antigen and cell type names (B). C. Snapshot of the Web page for the ChIP-Atlas "Peak Browser" function. Results for the settings shown are presented in Fig 2. D. Detailed information for SRX187209, including the sample metadata described by ChIP-Atlas curators and the original data submitter, processing logs, and read quality from DBCLS SRA (http://sra.dbcls.jp). Blue buttons at the top are controllers for showing the alignment and peak-call data in IGV ("View on IGV"), for downloading these data ("Download"), for viewing the analyzed data by ChIP-Atlas "Target Genes" and "Colocalization" ("View Analysis"), and for opening external pages showing details for the experimental conditions and materials ("Link Out"). This type of Web page appears on clicking the bars in the "Peak Browser" view (Fig 2) as well as by clicking SRX IDs shown in Web pages for a keyword search (B) or for "Target Genes" (Fig 3A), "Colocalization" (Fig 3B), or "Enrichment Analysis" (Fig EV3) results. Download figure Download PowerPoint Results and Discussion Overview of the data set and design of analyses SRA of NCBI is a huge data resource that collects all public raw sequencing data from high-throughput sequencing experiments including ChIP-seq and DNase-seq. It thus covers all sequence data presented in academic papers; data produced by large consortia (such as ENCODE and Roadmap Epigenomics 4-7, 12) and some other grant-in-aid projects; and data deposited in NCBI GEO (https://www.ncbi.nlm.nih.gov/gds), EBI ArrayExpress (https://www.ebi.ac.uk/arrayexpress), and DDBJ (https://www.ddbj.nig.ac.jp). ChIP-Atlas collects all ChIP-seq and DNase-seq data archived in NCBI SRA, with samples derived not only from human and mouse, which are also covered in other existing databases (ReMap, Cistrome DB, and GTRD 8, 9, 11), but also from four other model organisms (rat, fruit fly, nematode, and budding yeast; Fig 1A and Table 1). In NCBI SRA, each experiment is assigned an ID with a prefix of SRX, DRX, or ERX (hereafter collectively referred to as SRX), which are also adopted in ChIP-Atlas for unified management of the records. The number of SRXs collected in ChIP-Atlas is 76,217 for the six organisms, which corresponds to 89.9% of the total number of ChIP-seq and DNase-seq SRXs in NCBI SRA for all organisms (n = 84,826 as of May 2018). Since the public release of ChIP-Atlas, the data have been updated monthly concurrent with the monthly update of NCBI SRA (Fig 1B). We manually curate the names of antigens and cell types according to commonly or officially adopted nomenclature. The antigens and cell types are further sorted into "antigen classes" and "cell type classes", allowing categorization and extraction of data for given classes (Figs 1C and EV1B). To complete the monthly curation in an expeditious and precise manner, we developed a database and conversion tool that are specialized to return controlled vocabularies from given synonyms of TRs and cell lines or other keywords (such as catalog numbers of antibodies and abbreviations of cell or tissue names) described in SRA sample metadata by original data submitters. The sequence data are aligned to a reference genome with Bowtie2 13 and subjected to peak calling with MACS2 14, and the results are readily downloaded and browsed in the genome browser IGV 15 (Figs 1D and 2 top) by entering the SRX ID or a given keyword (or keywords) in the corresponding search page of ChIP-Atlas (Fig EV1A, B and D). Table 1. Comparison of ChIP-Atlas with other ChIP-seq databases ChIP-Atlas Cistrome DB ReMap GTRD Data source NCBI SRA fully encompassing GEO, ArrayExpress, DDBJ, ENCODE, and Roadmap Epigenomics data GEO, ENCODE, and Roadmap Epigenomics GEO, ArrayExpress, and ENCODE GEO, ENCODE, and a part of SRA Experiments ChIP-seq and DNase-seq ChIP-seq, DNase-seq, and ATAC-seq ChIP-seq for TRs ChIP-seq for TRs Filtering of data for quality control No Yes Yes No Number of experiments 76,217 20,535 3,180 12,168 Organisma Hs, Mm, Rn, Dm, Ce, and Sc Hs and Mm Hs Hs and Mm Genome assembly hg19, mm9, rn6, dm3, ce10, and sacCer3 hg38 and mm10 hg38 and hg19 hg38 and mm10 Peak caller MACS2 MACS2 MACS2 MACS, SISSRs, GEM, and PICS Display format for each experiment Alignment and peaks Alignment and peaks Peaks Peaks Browsing assembled peaks Possible None Possible Possible Genome browser IGV and UCSCb UCSC UCSC, ENSEMBL, and IGV Self-developed Integrative analysis tools Search tool for target genes and colocalizing factors of given TR, and enrichment analysis tool for given genes and genomic coordinates Search tool for target genes of given single experiment Enrichment analysis tool for given genes and genomic coordinates relative to random background Search tool for target genes of given TR a Hs, Homo sapiens; Mm, Mus musclus; Rn, Rattus norvegicus; Dm, Drosophila melanogaster; Ce, Caenorhabditis elegans; Sc, Saccharomyces cerevisiae. b Track hub URL = http://fantom.gsc.riken.jp/5prim/external/ChIP-Atlas/current/hub.txt. Figure 2. Example of processed data visualized with "Peak Browser" of ChIP-AtlasChIP-Atlas peak-call data for TRs around the mouse Foxa2 locus are shown in the IGV genome browser for settings of the "Peak Browser" Web page shown in Fig EV1C. Bars represent the peak regions, with the curated names of the antigens and cell types being shown below the bars and their color indicating the score calculated with the peak-caller MACS2 (−log10[Q-value]). Detailed sample information (yellow window) appears on placing the cursor over each bar. Clicking on the bars (asterisks) enables display of the alignment data (top) and detailed information about the experiments (Fig EV1D). Download figure Download PowerPoint Furthermore, notable features of ChIP-Atlas are that it allows browsing of peak-call data of the entire data set with IGV as well as integrative analysis not only to reveal TR–gene and TR–TR interactions but also to allow enrichment analysis for given genomic intervals based on global protein–genome binding data (the four functions shown in Fig 1D), as is described below with some examples. Visualization of assembled peak-call data All peak-call data recorded in ChIP-Atlas can be graphically displayed with the "Peak Browser" function at any genomic regions of interest (ROIs). To implement this function, we integrated a large amount of peak-call data (499, 334, 1.27, 1.39, 3.87, and 0.59 million peaks for the human, mouse, rat, fruit fly, nematode, and budding yeast genomes, respectively), indexed them for IGV, and constructed a Web interface that externally controls IGV preinstalled on the user's machine (Mac, Windows, or Linux platforms). For instance, on specification of ChIP-seq data for mouse TRs on the Web page (Fig EV1C), the corresponding results are streamed into IGV as shown in Fig 2, suggesting that the mouse Foxa2 gene promoter is bound by multiple TRs in the liver (Fig 2, center), that expression of the gene is suppressed by Polycomb group 2 proteins such as Suz12 and Ezh2 in embryonic stem cells (Fig 2, left), and that the upstream region of Foxa2 may possess insulator activity due to Ctcf binding in multiple cell types (Fig 2, right). The colors of peaks indicate the statistical significance values calculated by the peak-caller MACS2 (MACS2 scores), and the names of antigen and cell types are clearly shown beneath the peaks. Clicking on a peak opens a Web page containing detailed information including sample metadata, library description, and read quality (Fig EV1D) as well as controllers to display the alignment data in IGV (Fig 2, top). Assembled peak-call data can also be browsed via the "My hubs" function of the UCSC Genome Browser (http://genome-asia.ucsc.edu/cgi-bin/hgHubConnect) by entering a URL for the ChIP-Atlas track hub (http://fantom.gsc.riken.jp/5prim/external/ChIP-Atlas/current/hub.txt) 16, 17. ChIP-Atlas thus allows not only visualization of the data for each experiment but also browsing of an integrative landscape of multiple chromatin-profiling results, potentially providing insight into the location of functional regions (enhancers, promoters, and insulators) and the corresponding regulatory factors (TRs and histone modifications). TR–gene and TR–TR interactions The large number of peak sets is further subjected to integrative analyses for data mining (Fig 1D). All TR peaks are examined for whether they are located around (±1, 5, or 10 kb) transcription start sites (TSSs) of RefSeq coding genes, with the summarized results being provided by the "Target Genes" function of ChIP-Atlas. For example, on selection of Drosophila Pc (also known as Polycomb) as a query TR, and TSS ± 1 kb as the target range (Fig EV2A), this service displays genes with TSS ± 1 kb regions bound by Pc. As the default, the potential target genes are sorted by MACS2 score averaged over all the Pc ChIP-seq data (n = 36; shown in the "Pc: Average" column of Fig 3A). The results can be resorted for an SRX of interest. For example, selection of SRX681823 (ChIP-seq data for Pc in 16- to 18-h embryos) (Fig 3A) resorts potential target genes such as alpha-Man-IIb, JYalpha, genes encoding Histone H4s, ap, dpr16, and lbe in order of MACS2 score. Of note, multiple ChIP-seq data can be compared in a single view as shown in Fig 3A, where ap and lbe loci both appear to be bound by Pc at various stages of embryonic development. It should be noted, however, that the genes listed by "Target Genes" are not necessarily functional targets of a given TR and that actual regulation of potential target genes would need to be confirmed experimentally such as by analysis of cells deficient in the TR. Click here to expand this figure. Figure EV2. Web pages for integrative analyses in ChIP-Atlas A, B. Snapshots of Web pages for ChIP-Atlas "Target Genes" (A) and "Colocalization" (B) functions. Results for the settings shown are presented in Fig 3A and B, respectively. C–E. Snapshots of Web pages for the ChIP-Atlas "Enrichment Analysis" function with submission of genomic coordinates or gene symbols are shown in (C) and (D), respectively. Results for the settings shown are presented in Fig 4A–C and D–F, respectively. At the Web page for "Enrichment Analysis", a user can submit two sets of genomic intervals in BED format (C) or gene symbols (D): data of interest in the orange area and background data for comparison in the gray area. It is also possible to filter the results according to antigen and cell type classes as well as to set a threshold for the MACS2 score. On clicking the "submit" button, the data are sent to an NIG supercomputer server for performance of the enrichment analysis, as shown in (E). For example, on submission of BED-formatted genomic regions for hepatocyte enhancers (orange) or enhancers activated in other tissues (gray), the computational server counts the overlaps with the peaks of all SRXs (E, left). After evaluation of the significance of enrichment with Fisher's exact test (E, right), the analyzed data are returned within several minutes to the machine of the user as shown in Fig EV3. Download figure Download PowerPoint Figure 3. Examples of analysis with "Target Genes" and "Colocalization" of ChIP-Atlas Potential target genes of Drosophila Pc are listed on the left with ChIP-seq data. The colors of the cells of the matrix indicate the MACS2 scores for Pc ChIP-seq peaks (columns) within TSS ± 1 kb regions of each potential target gene (rows). As the default, the matrix is sorted according to the average of MACS2 scores in each row ("Pc: Average" at top left). Resorting is also possible by clicking the triangles under the SRX of interest at the top (sorted result for SRX681823 is shown). This table was obtained with the queries shown in Fig EV2A. TRs that potentially colocalize with Drosophila Pc are listed on the left with their ChIP-seq information. Each cell of the matrix indicates the similarity between the ChIP-seq data for Pc (columns) and those for its potential colocalizing partners (rows) as shown by heat colors and calculated with CoLo. As the default, the matrix is sorted according to the average of CoLo scores in each row ("Pc: Average" at top left as shown here). Sorting by an SRX of interest is possible by clicking the triangles at the top. This table was obtained with the queries shown in Fig EV2B. IGV snapshots showing the alignment data (BigWig format) around the Drosophila ap and lbe gene loci for ChIP-seq experiments listed on the left in (B). The results suggest that both genes might be regulated by Pc together with its colocalization partners (ph-d, Scm, and pho). The y-axes range from 0–10 RPM units. Download figure Download PowerPoint Integrative analysis is also applied to search for sets of TRs that potentially colocalize in a genome-wide manner. Pairwise comparisons of all SRXs are thus performed with the CoLo algorithm (https://github.com/RyoNakaki/CoLo; R. Nakaki, in preparation), and the similarity scores are precomputed for all combinations. For instance, selection of Drosophila Pc protein in embryos as a query on the "Colocalization" Web page of ChIP-Atlas (Fig EV2B) reveals that the ChIP-seq data are similar to those for other Polycomb group proteins such as ph-d, Scm, and pho (Fig 3B). These TRs are shown to be colocalized with Pc around its target genes such as ap and lbe gene loci (Fig 3C), suggestive of cooperative repression of these genes at embryonic stages. This function is thus useful to examine TR–TR interactions and genome-wide colocalization among public ChIP-seq data. In addition, this function can compare multiple ChIP-seq data for the same TR in a single view, which is helpful to assess read quality, the applied antibodies, and other experimental conditions associated with similar or different binding profiles. Enrichment analysis for given loci and genes "Enrichment Analysis" of ChIP-Atlas is a tool that allows a search for histone modifications and TRs enriched at a batch of genomic ROIs. On submission of two sets of genomic regions (ROIs and background regions), this service evaluates all SRXs to count the overlaps between the peaks and submitted regions, before returning enrichment analysis data including SRX IDs, antigens, cell types, and P-values (Fig EV2C–E). As a proof of principle, we selected human hepatocyte-specific enhancers as the ROIs (n = 286) and those activated in other tissues as the background (n = 20,509; these enhancers were obtained from FANTOM5 "predefined enhancer data" 18), and we applied these selections to the Enrichment Analysis. Significantly enriched TRs included HNF4A/G and FOXA1/2 (P < 1 × 10−21; Figs 4A and EV3A), which are required for liver development and are able to directly reprogram skin fibroblasts into hepatocyte-like cells 19-21. Furthermore, TRs for the top 15 ranked SRXs included SP1, RXRA, CEBPB, and JUND, all of which function in the liver–biliary system according to Mouse Genome Informatics (MGI) phenotype collections (MP:0005370). In addition, the predominant cell type from which enriched SRXs were derived was Hep G2 in the Liver class, even though the number of experiments in this class is relatively small for human (Fig 1C). On submission of coordinates for FANTOM5 enhancers specifically activated in blood vessel endothelial cells, all cell types of the top 14 ranked SRXs were endothelial cells in the Cardiovascular class (Fig 4B), with the enriched TRs (TAL1, JUN, EP300, GATA2, and YAP1) being related to blood vessel morphology phenotypes (MP:0001614). In the case of FANTOM5 macrophage-specific enhancers, STAT1 in monocytes and SPI1 (also known as PU.1) in macrophages were significantly enriched (Fig 4C), consistent with the fact that SPI1 is able to reprogram fibroblasts into macrophages 22. Enrichment analysis for other cell type-specific FANTOM5 enhancers is shown in Fig EV4, with enrichment of ChIP-seq data of the Blood class being apparent for enhancers of other blood cell types such as dendritic cells, monocytes, and T cells. Figure 4. Analysis of TR enrichment at tissue-specific enhancers and genes with "Enrichment Analysis" of ChIP-Atlas A–F. The top 15 ChIP-seq experiments enriched for enhancers (A–C) or genes (D–F) specifically activated in hepatocytes (A and D), blood vessel endothelial cells (B and E), or macrophages (C and F) relative to all other FANTOM5 enhancers (A–C) or RefSeq coding genes (D–F) are shown. The bar charts indicate P-values for enrichment, with the colors indicating the cell types examined in the experiments according to the palette shown in Fig EV4, where the top 50 ChIP-seq experiments enriched for the above and other enhancers are also presented. Asterisks next to SRX IDs indicate that the ChIP-seq data originated from projects other than ENCODE or Roadmap Epigenomics. Download figure Download PowerPoint Click here to expand this figure. Figure EV3. Examples of "Enrichment Analysis" A, B. Snapshots of the results for enrichment analysis of hepatocyte-specific enhancers with the ChIP-Atlas "Enrichment Analysis" function, for which other FANTOM5 enhancers (A) or randomly permutated regions (B) were set as background, are shown. The first row in (A), for example, indicates EP300 ChIP-seq data (SRX100544) for Hep G2 cells. The total number of peaks for EP300 is 24,334, of which 80 peaks overlap with hepatocyte-specific enhancers (n = 286) and 1,147 peaks overlap with other enhancers (n = 20,509), yielding a P-value of 1 × 10−32.1 (Fisher's exact probability test), Q-value of 1 × 10−28.3 (Benjamini and Hochberg method), and fold enrichment of 5.00. The table is sorted according to P-value, with HNF4A/G and FOXA1/2 in Hep G2 being ranked 3rd, 7th, 8th, 10th, and 12th. The table is also graphically summarized in Figs 4A and EV4 (top 15 and 50 experiments, respectively), in which each row of the table is represented by a bar to indicate the P-value. Note that TR peaks overlapp
Referência(s)