Artigo Acesso aberto Revisado por pares

From chemoproteomic‐detected amino acids to genomic coordinates: insights into precise multi‐omic data integration

2021; Springer Nature; Volume: 17; Issue: 2 Linguagem: Inglês

10.15252/msb.20209840

ISSN

1744-4292

Autores

Maria F. Palafox, Heta S. Desai, Valerie A. Arboleda, Keriann M. Backus,

Tópico(s)

Genomics and Phylogenetic Studies

Resumo

Article18 February 2021Open Access Source DataTransparent process From chemoproteomic-detected amino acids to genomic coordinates: insights into precise multi-omic data integration Maria F Palafox orcid.org/0000-0002-4752-953X Department of Human Genetics, David Geffen School of Medicine, UCLA, Los Angeles, CA, USA Department of Biological Chemistry, David Geffen School of Medicine, UCLA, Los Angeles, CA, USA Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, UCLA, Los Angeles, CA, USA Search for more papers by this author Heta S Desai Department of Biological Chemistry, David Geffen School of Medicine, UCLA, Los Angeles, CA, USA Molecular Biology Institute, UCLA, Los Angeles, CA, USA Search for more papers by this author Valerie A Arboleda Corresponding Author [email protected] orcid.org/0000-0002-9687-9122 Department of Human Genetics, David Geffen School of Medicine, UCLA, Los Angeles, CA, USA Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, UCLA, Los Angeles, CA, USA Molecular Biology Institute, UCLA, Los Angeles, CA, USA Jonsson Comprehensive Cancer Center, UCLA, Los Angeles, CA, USA Eli and Edythe Broad Center of Regenerative Medicine and Stem Cell Research, UCLA, Los Angeles, CA, USA Search for more papers by this author Keriann M Backus Corresponding Author [email protected] orcid.org/0000-0001-8541-1404 Department of Biological Chemistry, David Geffen School of Medicine, UCLA, Los Angeles, CA, USA Molecular Biology Institute, UCLA, Los Angeles, CA, USA Jonsson Comprehensive Cancer Center, UCLA, Los Angeles, CA, USA Eli and Edythe Broad Center of Regenerative Medicine and Stem Cell Research, UCLA, Los Angeles, CA, USA Department of Chemistry and Biochemistry, College of Arts and Sciences, UCLA, Los Angeles, CA, USA DOE Institute for Genomics and Proteomics, UCLA, Los Angeles, CA, USA Search for more papers by this author Maria F Palafox orcid.org/0000-0002-4752-953X Department of Human Genetics, David Geffen School of Medicine, UCLA, Los Angeles, CA, USA Department of Biological Chemistry, David Geffen School of Medicine, UCLA, Los Angeles, CA, USA Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, UCLA, Los Angeles, CA, USA Search for more papers by this author Heta S Desai Department of Biological Chemistry, David Geffen School of Medicine, UCLA, Los Angeles, CA, USA Molecular Biology Institute, UCLA, Los Angeles, CA, USA Search for more papers by this author Valerie A Arboleda Corresponding Author var[email protected] orcid.org/0000-0002-9687-9122 Department of Human Genetics, David Geffen School of Medicine, UCLA, Los Angeles, CA, USA Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, UCLA, Los Angeles, CA, USA Molecular Biology Institute, UCLA, Los Angeles, CA, USA Jonsson Comprehensive Cancer Center, UCLA, Los Angeles, CA, USA Eli and Edythe Broad Center of Regenerative Medicine and Stem Cell Research, UCLA, Los Angeles, CA, USA Search for more papers by this author Keriann M Backus Corresponding Author [email protected] orcid.org/0000-0001-8541-1404 Department of Biological Chemistry, David Geffen School of Medicine, UCLA, Los Angeles, CA, USA Molecular Biology Institute, UCLA, Los Angeles, CA, USA Jonsson Comprehensive Cancer Center, UCLA, Los Angeles, CA, USA Eli and Edythe Broad Center of Regenerative Medicine and Stem Cell Research, UCLA, Los Angeles, CA, USA Department of Chemistry and Biochemistry, College of Arts and Sciences, UCLA, Los Angeles, CA, USA DOE Institute for Genomics and Proteomics, UCLA, Los Angeles, CA, USA Search for more papers by this author Author Information Maria F Palafox1,2,3, Heta S Desai2,4, Valerie A Arboleda *,1,3,4,5,6 and Keriann M Backus *,2,4,5,6,7,8 1Department of Human Genetics, David Geffen School of Medicine, UCLA, Los Angeles, CA, USA 2Department of Biological Chemistry, David Geffen School of Medicine, UCLA, Los Angeles, CA, USA 3Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, UCLA, Los Angeles, CA, USA 4Molecular Biology Institute, UCLA, Los Angeles, CA, USA 5Jonsson Comprehensive Cancer Center, UCLA, Los Angeles, CA, USA 6Eli and Edythe Broad Center of Regenerative Medicine and Stem Cell Research, UCLA, Los Angeles, CA, USA 7Department of Chemistry and Biochemistry, College of Arts and Sciences, UCLA, Los Angeles, CA, USA 8DOE Institute for Genomics and Proteomics, UCLA, Los Angeles, CA, USA *Corresponding author: Tel: +1 310 983 3358; E-mail: [email protected] *Corresponding author: Tel: +1 310 206 8617; E-mail: [email protected] Mol Syst Biol (2021)17:e9840https://doi.org/10.15252/msb.20209840 PDFDownload PDF of article text and main figures. Peer ReviewDownload a summary of the editorial decision process including editorial decision letters, reviewer comments and author responses to feedback. ToolsAdd to favoritesDownload CitationsTrack CitationsPermissions ShareFacebookTwitterLinked InMendeleyWechatReddit Figures & Info Abstract The integration of proteomic, transcriptomic, and genetic variant annotation data will improve our understanding of genotype–phenotype associations. Due, in part, to challenges associated with accurate inter-database mapping, such multi-omic studies have not extended to chemoproteomics, a method that measures the intrinsic reactivity and potential "druggability" of nucleophilic amino acid side chains. Here, we evaluated mapping approaches to match chemoproteomic-detected cysteine and lysine residues with their genetic coordinates. Our analysis revealed that database update cycles and reliance on stable identifiers can lead to pervasive misidentification of labeled residues. Enabled by this examination of mapping strategies, we then integrated our chemoproteomics data with computational methods for predicting genetic variant pathogenicity, which revealed that codons of highly reactive cysteines are enriched for genetic variants that are predicted to be more deleterious and allowed us to identify and functionally characterize a new damaging residue in the cysteine protease caspase-8. Our study provides a roadmap for more precise inter-database mapping and points to untapped opportunities to improve the predictive power of pathogenicity scores and to advance prioritization of putative druggable sites. SYNOPSIS Multi-omic data integration maps Chemoproteomic Detected (CpD) amino acids to genomic-level predictions of variant pathogenicity. Highly reactive cysteine and lysine residues are enriched for high pathogenicity (CADD) scores and disease-causing pathogenic variants. A comparison of multi-omic mapping strategies identifies common issues with data integration, including those that result from reference sequence updates and redundancy. Chemoproteomic-detected cysteines show no significant enrichment of disease-associated variants in CLINVAR, whereas lysines show significant enrichment for pathogenic variants. Proof-of-concept functional validations reveal that chemoproteomics measures of cysteine reactivity complement genetic scores to accurately predict functional residues in the cysteine protease caspase-8. Introduction Understanding how proteins work is the bedrock of functional biology and drug development. The identification of amino acids that directly regulate a protein's activity (e.g., catalytic residues, residues that drive interactions, or residues important for folding or stability) is an essential step to functionally characterize a protein. Delineation of amino acid-specific functions is typically accomplished using site-directed mutagenesis (Hemsley et al, 1989; Starita et al, 2015). While such studies can identify functional hotspots in human proteins, they are typically limited in scope and largely restricted to proteins easily expressed in vitro. With the advent of next-generation sequencing and CRISPR-based mutagenesis, deep mutational analysis can now be scaled to individual genes (e.g., TP53 and BRCA1) (Starita et al, 2015; Boettcher et al 2019), but such studies have not been extended genome-wide. This problem of identifying the functional properties of a specific amino acid parallels one of the central challenges of modern genetics: interpreting the pathogenicity of the millions of genetic variants found in an individual's genome. Many computational methods, such as M-CAP (Jagadeesh et al, 2016), Combined Annotation Dependent Depletion (CADD) (Kircher et al, 2014), PolyPhen (Adzhubei et al, 2010), and SIFT (Vaser et al, 2016) integrate data such as sequence conservation, metrics of sequence constraint, and other functional annotations to provide a quantitative assessment of variant deleteriousness. In the absence of experimental data, these scores provide a metric to rank genetic variants for their effect on a phenotype, something particularly important in the era of genome-wide association and sequencing studies. Beyond genetic variation, a frequently overlooked parameter that defines functional hotspots in the proteome is amino acid side chain reactivity, which can fluctuate depending on the residue's local and 3-dimensional protein microenvironment. Mass spectrometry-based chemoproteomics methods have been developed that can assay the intrinsic reactivity of thousands of amino acid side chains in native biological systems (Weerapana et al, 2010; Backus et al, 2016; Hacker et al, 2017). Using these methods, previous studies, including our own, revealed that "hyper-reactive" or pKa-perturbed cysteine and lysine residues are enriched in functional pockets. These chemoproteomics methods can even be extended to measure the targetability or "druggability" of amino acid side chains, which has revealed that a surprising number of cysteine and lysine side chains can also be irreversibly labeled by small drug-like molecules (Weerapana et al, 2010; Backus et al, 2016; Hacker et al, 2017). Complicating matters, for the vast majority of these chemoproteomic-detected amino acids (CpDAA), the functional impact of a missense mutation or chemical labeling remains unknown. Integrating chemoproteomics data with genomic-based annotations represents an attractive approach to stratify CpDAA functionality and to identify therapeutically relevant disease-associated pockets in human proteins. Such multi-omic studies require mapping a protein's sequence back to genomic coordinates, through the transcript isoforms, in essence reverse engineering the central dogma of molecular biology. Accurate mapping between amino acid positions and genomic coordinates remains particularly challenging, due in part to the diversity of cell type-specific transcript and protein isoforms and the non-linear relationship between gene, transcript, and protein sequences. One approach to address these challenges is through proteogenomics (Ruggles et al, 2017), where custom FASTA files are generated from whole exome or RNA-sequencing data. However, such approaches are not scalable or cost-effective. Furthermore, many proteomic datasets, particularly previously acquired and public datasets, lack matched genomic data, precluding proteogenomic analysis. Many computational tools have been developed for inter-database mapping, including using unique identifiers (Durinck et al, 2009; Smith et al, 2019; Agrawal & Prabakaran, 2020), methods to map genomic coordinates to protein sequences and structures (David & Yip, 2008; Sehnal et al, 2017; Sivley et al, 2018; Stephenson et al, 2019), and tools for codon-centric-based annotation of genetic variants (Gong et al, 2014; Schwartz et al, 2019). One key application of these tools is the improved prediction of variant pathogenicity (Guo et al, 2017). However, while many predictive genetic scores are built on the GRCh37 genome assembly (frozen in 2014), the UniProt Knowledge Base (UniProtKB) (McGarvey et al, 2019) proteomic reference is based on genome assembly GRCh38. Further complicating data integration, the unsynchronized and frequent updates to widely used databases, such as UniProtKB and Ensembl, result in a constantly evolving landscape of genome-, transcriptome- and proteome-level sequences and annotations, which further confounds multi-omic data integration, particularly for residue-level analyses. Focusing initially on previously identified CpDAAs (Weerapana et al, 2010; Backus et al, 2016; Hacker et al, 2017), we first assess how choice of databases, including release dates, and the use of isoform-specific, versioned or stable identifiers impact residue-coordinate mapping and the fidelity of data integration. We then apply an optimized mapping strategy to annotate CpDAA positions with predictions of genetic variant pathogenicity, for both previously published and newly generated chemoproteomic analyses of amino acid reactivity. Our study uncovers key sources of inaccurate mapping and provides fundamental guidelines for multi-omic data integration. We also reveal that highly reactive cysteines, including those identified previously (Weerapana et al, 2010) and newly identified CpDAAs, are enriched for genetic variants that have high predicted pathogenicity (high deleteriousness), which supports both the utility of predictive scores to further power proteomics datasets and the use of chemoproteomics to add another layer of interpretation to missense genetic variants. As many databases move to GRCh38, we anticipate that our findings will provide a roadmap for more precise inter-database comparisons, which will have wide-ranging applications for both the proteomics and genetics communities. Results Characterizing the dynamic mapping landscape relevant to CpDAA data integration Our first step to achieve high-fidelity multi-omic data integration was to establish a comprehensive set of test data. For this, we aggregated publicly available cysteine and lysine chemoproteomics datasets (Weerapana et al, 2010; Backus et al, 2016; Hacker et al, 2017), resulting in a total of 6,510 CpD cysteines and 9,327 CpD lysines detected in 4,119 unique proteins. These 15,837 CpDAAs are further sub-categorized by the residues labeled by cysteine- or lysine-reactive probes (iodoacetamide alkyne [IAA] or pentynoic acid sulfotetrafluorophenyl ester [STP], respectively) and those residues with additional measures of intrinsic reactivity (categorized as high-, medium-, and low-reactive residues; Dataset EV1). As our overarching objective was to characterize CpDAAs using functional annotations based on different versions of protein, transcript, and DNA sequences (Fig 1A), our next step was to develop a high-fidelity data analysis pipeline for intra- and inter-database mapping. To guide our analyses, we first referenced established methods for such data mapping, including ID mapping (Huang et al, 2008; Meyer, Geske, & Yu, 2016; Xin et al, 2016), residue–residue mapping (Martin, 2005; David & Yip, 2008; Dana et al, 2019), and residue–codon mapping (Zhou et al, 2015; Li et al, 2016) (See Appendix Table S1 for detailed descriptions of each type of mapping). Figure 1. Landscape of sequence annotation information updates Schematic representation of mapping chemoproteomic-detected amino acids (CpDAAs) to pathogenicity scores. Timeline of gene annotation database release dates and project-specific datasets, including Ensembl releases tested for compatibility (Fig 2) to CpDAA coordinates based on canonical UniProtKB protein sequences and the database reference corresponding to the genomic pathogenicity scores (Fig 3). Average database release cycle length for releases between August 2013 and July 2019. All values are mean ± SD. Total of 25 Ensembl, 13 GENCODE, six CCDS (homo sapien only), and five NCBI releases were counted. UniProtKB value was calculated by taking the average of release cycle lengths reported on the UniProt website. Download figure Download PowerPoint We suspected that the frequent and unsynchronized update cycles of independent databases (Fig 1B; Dataset EV2) might complicate accurate residue-level mapping. Supporting this hypothesis, quantification of the average update cycle for each database across this time period revealed that UniProtKB has the shortest mean update cycle (~ 6 weeks; Fig 1C). In contrast, NCBI is only updated yearly. These different update cycles can create a lag between versions of databases used to create identifier cross-reference (a.k.a. External Reference [xref]) files (Appendix Table S1). For example, ID mapping files provided by Ensembl for UniProtKB proteins may not share identical sequences if not used within the short 4-week window between UniProtKB updates. To enable further characterization of how database update cycles and mapping strategy impact the fidelity of data integration, we collected a test set of Ensembl releases (Appendix Fig S1 and Dataset EV3). Specific releases were prioritized that (i) represented reference releases based on the GRCh37 or GRCh38 reference genome, (ii) were compatible with the latest Consensus Coding Sequence (CCDS) update for the human genome (release 22), (iii) were used in database for nonsynonymous functional predictions (dbNSFP) v4.0a and CADDv1.4, two resources that integrate functional annotations for all possible nonsynonymous single nucleotide variants (SNV) (Kircher et al, 2014; Liu et al, 2016; Rentzsch et al, 2019), and (iv) were associated with a commonly used version of the Ensembl Variant Effect Predictor (VEP) (McLaren et al, 2016). With these prioritized datasets in hand, we next tracked the loss of CpDAA-containing protein IDs during intra-database mapping of UniProtKB releases and inter-database ID mapping to different Ensembl releases. Gratifyingly, only a handful of the original 4,119 protein IDs were lost due to database updates, both for Ensembl (e.g., 37 IDs for v97 release of Ensembl) and for UniProtKB (e.g., 26 IDs for 2012 UniProtKB; Appendix Fig S1, Datasets EV1 and EV4). The greatest identifier loss was observed from mapping UniProtKB-based legacy data to the 2018 UniProtKB-SwissProt CCDS cross-referenced curation of the human proteome, with 119 IDs not found in the 2018 dataset. We ascribe this identifier loss to both UniProtKB updates and to the higher level of curation for proteins in the 2018 dataset, which includes only Swiss-Prot canonical protein sequences with a cross-referenced ("xref") entry term in the CCDS database. Of note, CCDS gene IDs are manually reviewed and linked to UniProtKB-SwissProt. The TREMBL database is comprised of automatically generated protein IDs, which, as a result, comprises a substantially larger set of UniProtKB IDs, when compared to the manually curated SwissProt CCDS subset (Appendix Fig S2). From these analyses, we concluded that using the CCDS UniProtKB release was optimal for integrating functional annotations with chemoproteomics datasets. Updates to canonical sequences assigned to UniProtKB stable identifiers can lead to intra-database mismapping of CpDAAs Proteomics datasets, including published CpDAA datasets, are routinely searched against FASTA files containing only canonical UniProtKB proteins (Appendix Table S1), for two main reasons. First, canonical proteins reduce the redundancy and complexity of proteome search databases. Second, these sequences are identified by stable identifiers (also known as the UniProtKB primary accessions) and offer the seeming advantage of remaining constant through database update cycles. However, one particularly confusing aspect of the stable identifier is that the word "stable" in this context does not mean permanent or immutable. Specifically, the associated sequence linked to a stable identifier can change over database releases. Therefore, we next assessed whether and to what extent updates to the canonical sequences assigned to UniProtKB stable identifiers resulted in mismapping. To confirm the integrity of our CpDAA dataset, we started this process by validating that over 99% of the CpDAA protein IDs and residue positions matched with those found in a 2012 UniProt FASTA file, corresponding to the reference proteome originally used to process the datasets (see Materials and Methods and Dataset EV1). The small fraction of data lost was due to missing stable identifiers and mis-matched CpDAA positions, which likely stems from slight inconsistencies between the original processing pipeline and our current workflow. We then mapped the 6,404 CpD cysteines and 9,213 CpD lysines from 4,084 canonical proteins identified in the 2012 dataset to the 2018 UniProtKB CCDS canonical sequence subset of the human proteome. Mapping to CCDS sequences enabled us to take advantage of the extensive array of tools that facilitate forward and reverse annotation between gene, transcript, and protein sequences and would allow for residue-specific mapping to genomic functional annotations (Dataset EV5) (Zhou et al, 2015; Meyer, Geske, & Yu, 2016; McGarvey et al, 2019). Updating to the 2018 release was a requisite step for using these tools, as they overwhelmingly require recent cross-reference files using the newest reference genome GRCh38. For all CpDAA positions, we performed residue–residue mapping—defined as a one-to-one correspondence between amino acids in proteins from different databases or release dates—to match the 2012 canonical UniProtKB sequences with their 2018 counterparts (Dataset EV4). This dataset mapping resulted in the loss of 121 protein IDs, with 108 simply not found in the 2018 reference file and the remaining 13 found to have different canonical sequences, resulting in mismapping or loss of the originally identified CpDAA residues. The high concordance between these two UniProtKB releases, separated by 6 years, indicates that for the vast majority of UniProtKB updates, differences in release date should not complicate re-mapping legacy proteomics data to more recently released gene, transcript, and protein sequences. However, we were surprised to find that several widely studied proteins, including protein arginine N-methyltransferase 1 (PRMT1 or ANM1, Q99873), serine/threonine protein kinase, (SIK3; Q9Y2K2) (Walkinshaw et al, 2013), and tropomyosin alpha-3 chain (TPM3, P06753), had canonical protein sequence differences resulting in all or nearly all CpDAA positions to be missed using the 2018 position index (Dataset EV4). We observed two main reasons for these losses: (i) changes to the canonical sequence associated with the UniProtKB stable ID and (ii) changes to which isoform is assigned as the canonical sequence. While both 2012 and 2018 sequences of PRMT1 are associated with UniProtKB stable ID Q99873, the 2018 sequence contains an additional short N-terminal sequence, not present in the 2012 sequence (Fig 2A). As a result, all 13 PRMT1 CpDAAs failed to map to the 2018 UniProtKB release. In the 2012 release of UniProtKB, the canonical sequence of the peptidyl-prolyl cis-trans isomerase FKBP7 is associated with the versioned (isoform) ID Q9Y680-1, whereas in the 2018 release, the canonical sequence is associated with the versioned (isoform) ID Q9Y680-2, which lacks a short sequence (AAΔ125:162) in the middle of the protein. For FKBP7, this update fortuitously does not result in loss of CpD Lys83, as it is located N-terminal to the deletion. These updates to the protein sequence are, in essence, masked by the stable IDs, which do not flag sequence updates or changes to which isoform sequence is assigned as the canonical. Exemplifying this problem, we identified 45 stable identifiers with non-identical canonical protein sequences in the 2012 and 2018 UniProtKB releases (Dataset EV4). Figure 2. Challenges with residue-level mapping and UniProtKB canonical protein sequences Schematic depiction of mapping scenarios from updating chemoproteomic-detected protein sequences using stable or versioned identifiers. Distribution of number of isoforms per stable UniprotKB ID for 3,953 detected proteins. Frequency of specific isoform name for 2,487 multi-isoform UniProtKB canonical proteins. Schematic depiction of glucose-6-phosphate dehydrogenase (G6PD, UniProtKB ID P11413) cross-referencing both identical and non-identical sequences of Ensembl stable IDs from five releases. Heatmap of protein sequence distance scores for detected UniProtKB and cross-referenced Ensembl proteins from five releases. Each gene name corresponds to one unique stable Ensembl protein ID. Download figure Download PowerPoint To further understand how the presence or absence of protein isoforms impacts the fidelity of data mapping during intra-database (UniProtKB) mapping, we identified all isoforms associated with CpDAA stable protein IDs. Analysis of this dataset revealed that 58% of protein stable IDs have between 2–5 associated isoform sequences (Fig 2B). Catenin delta-1 protein (CTNND1, O60716) had 32 isoforms, which was the greatest number of isoforms in our dataset (Dataset EV6). Protein isoforms are identified by the "-X" after the UniProtKB ID, where X represents the isoform name. A common assumption of most mapping tools and proteomics databases is that the "-1" sequence is the canonical sequence. However, a key finding from our isoform analysis is that the canonical sequence does not always correspond to the "-1" isoform ID provided by UniProtKB. In fact, for 288 proteins in the UniProtKB 2018 release, the non-"-1" entry corresponds to the canonical isoforms, and for 55 CpDAA-containing proteins in our dataset (~ 2%), the canonical sequence is not the "-1" isoform (Fig 2C and Dataset EV7). Strikingly, the canonical sequence can even be the "-10" isoform, as is the case for the Ras-associated and pleckstrin homology domains-containing protein (RAPH1, Q70E73). In the context of database mapping, all of these non-"-1" canonical proteins will likely result in mismapping using established tools. Accurate residue-level inter-database mapping between UniProtKB and Ensembl is dependent on database update cycles To investigate how sequence versions impact inter-database mapping, we next turned to ID cross-reference files (Dataset EV3) that are released by Ensembl and UniProtKB. Cross-reference files can be used to convert between UniProtKB and Ensembl ID types. Three major challenges arise with ID cross-referencing: (i) when cross-reference stable IDs match, but corresponding sequences are not identical, (ii) multi-mapping, where a UniProtKB ID maps to many Ensembl protein (ENSP), transcript, and gene IDs, and (iii) when the origin, both the time of the releases and the specific database provided cross-reference files used, determines the mapping accuracy of datasets. Glucose-6-phosphate dehydrogenase (G6PD, P11413) exemplifies how sequence updates associated with a stable ID can lead to mismapping of gene-, transcript-, and protein-level annotations for CpDAAs (Fig 2D). For G6PD, the same UniProtKB ID maps to four unique ENSP IDs with identical sequences (see first row in "Identical") as well as four different ENSP IDs with non-identical sequences (see second row in "Non-identical"). For G6PD, this significant redundancy is also observed at the gene and transcript level, both for stable and versioned IDs (Fig EV1A; Dataset EV8). Overall, genes undergo the highest frequency of sequence re-annotation due to continual refinement of the reference genome. In contrast, protein IDs remain largely fixed across releases (Fig EV1B; Dataset EV9). Click here to expand this figure. Figure EV1. Mapping of Ensembl IDs to UniprotKB shows heterogeneity at gene, transcript, and protein levels A. Number of stable and versioned Ensembl gene, transcript, and protein IDs for G6PD across all five Ensembl releases. B. Cumulative sequence re-annotations for Ensembl gene, transcript, and protein IDs since the v85 release. C, D. Average number of Ensembl gene, transcript, and protein IDs for (C) single isoform (n = 1,466) and (D) multi-isoform (n = 2,487) CpDAA UniProt entries. Bar plots represent mean values ± SD for the number of Ensembl IDs per stable UniProtKB ID. Statistical significance was calculated using an unpaired Student's t-test, ****P-value < 0.0001. Download figure Download PowerPoint To assess how pervasive multi-mapping is across the entire CpDAA dataset, we quantified the mean number of Ensembl IDs per UniProtKB ID. We counted both versioned and stable Ensembl IDs types (gene, transcript, and protein IDs), for all CpD UniProtKB proteins grouped by single (Fig EV1C) or multi-isoform (Fig EV1D; Dataset EV10) associated stable IDs. We suspected that database updates for all data types (gene, transcript, and protein) and the presence of UniProtKB isoforms would contribute to the observed multi-mapping of CpD protein IDs in our dataset. Of note, Ensembl versioned IDs indicate changes to the associated sequence rather than the presence of isoforms. For example, for protein tropomyosin alpha-4 chain (TPM4, P67936), during the update from v96 to v97, the stable protein identifier showed version change from ".3" to ".4" (ENSP00000300933.3 to ENSP00000300933.4), which corresponds to a difference of 165 amino acids in the primary sequence caused by the update (Dataset EV11). Not surprisingly, we found that UniProtKB stable identifiers with multiple associated protein isoforms have a higher average of cross-referenced Ensembl ID types per UniProtKB stable identifier, when compared to UniProtKB stable IDs associated with only one protein isoform. In addition, single isoform UniProtKB stable IDs are more likely to cross-reference identical ENSPs, when compared to multi-isoform UniProtKB stable IDs (Appendix Figs S3 and S4). One last challenge we identified is that the origin of the cross-reference file (whether it was created by UniProtKB or by Ensembl) affected the outcome of our mapping procedures. Acros

Referência(s)