High‐throughput discovery of functional disordered regions: investigation of transactivation domains
2018; Springer Nature; Volume: 14; Issue: 5 Linguagem: Inglês
10.15252/msb.20188190
ISSN1744-4292
AutoresCharles N. J. Ravarani, Tamara Y. Erkina, Greet De Baets, Daniel C. Dudman, Alexandre M. Erkine, M. Madan Babu,
Tópico(s)14-3-3 protein interactions
ResumoArticle14 May 2018Open Access Transparent process High-throughput discovery of functional disordered regions: investigation of transactivation domains Charles NJ Ravarani Corresponding Author Charles NJ Ravarani [email protected] orcid.org/0000-0003-0952-3396 MRC Laboratory of Molecular Biology, Cambridge, UK Search for more papers by this author Tamara Y Erkina Tamara Y Erkina Butler University, Indianapolis, IN, USA Search for more papers by this author Greet De Baets Greet De Baets MRC Laboratory of Molecular Biology, Cambridge, UK Search for more papers by this author Daniel C Dudman Daniel C Dudman Butler University, Indianapolis, IN, USA Search for more papers by this author Alexandre M Erkine Corresponding Author Alexandre M Erkine [email protected] orcid.org/0000-0002-1880-4854 Butler University, Indianapolis, IN, USA Search for more papers by this author M Madan Babu Corresponding Author M Madan Babu [email protected] orcid.org/0000-0003-0556-6196 MRC Laboratory of Molecular Biology, Cambridge, UK Search for more papers by this author Charles NJ Ravarani Corresponding Author Charles NJ Ravarani [email protected] orcid.org/0000-0003-0952-3396 MRC Laboratory of Molecular Biology, Cambridge, UK Search for more papers by this author Tamara Y Erkina Tamara Y Erkina Butler University, Indianapolis, IN, USA Search for more papers by this author Greet De Baets Greet De Baets MRC Laboratory of Molecular Biology, Cambridge, UK Search for more papers by this author Daniel C Dudman Daniel C Dudman Butler University, Indianapolis, IN, USA Search for more papers by this author Alexandre M Erkine Corresponding Author Alexandre M Erkine [email protected] orcid.org/0000-0002-1880-4854 Butler University, Indianapolis, IN, USA Search for more papers by this author M Madan Babu Corresponding Author M Madan Babu [email protected] orcid.org/0000-0003-0556-6196 MRC Laboratory of Molecular Biology, Cambridge, UK Search for more papers by this author Author Information Charles NJ Ravarani *,1, Tamara Y Erkina2, Greet De Baets1, Daniel C Dudman2, Alexandre M Erkine *,2 and M Madan Babu *,1 1MRC Laboratory of Molecular Biology, Cambridge, UK 2Butler University, Indianapolis, IN, USA *Corresponding author. Tel: +44 1223 267836; E-mail: [email protected]mrc-lmb.cam.ac.uk *Corresponding author. Tel: +1 317 940 8569; E-mail: [email protected] *Corresponding author. Tel: +44 1223 267066; E-mail: [email protected] Molecular Systems Biology (2018)14:e8190https://doi.org/10.15252/msb.20188190 PDFDownload PDF of article text and main figures. Peer ReviewDownload a summary of the editorial decision process including editorial decision letters, reviewer comments and author responses to feedback. ToolsAdd to favoritesDownload CitationsTrack CitationsPermissions ShareFacebookTwitterLinked InMendeleyWechatReddit Figures & Info Abstract Over 40% of proteins in any eukaryotic genome encode intrinsically disordered regions (IDRs) that do not adopt defined tertiary structures. Certain IDRs perform critical functions, but discovering them is non-trivial as the biological context determines their function. We present IDR-Screen, a framework to discover functional IDRs in a high-throughput manner by simultaneously assaying large numbers of DNA sequences that code for short disordered sequences. Functionality-conferring patterns in their protein sequence are inferred through statistical learning. Using yeast HSF1 transcription factor-based assay, we discovered IDRs that function as transactivation domains (TADs) by screening a random sequence library and a designed library consisting of variants of 13 diverse TADs. Using machine learning, we find that segments devoid of positively charged residues but with redundant short sequence patterns of negatively charged and aromatic residues are a generic feature for TAD functionality. We anticipate that investigating defined sequence libraries using IDR-Screen for specific functions can facilitate discovering novel and functional regions of the disordered proteome as well as understand the impact of natural and disease variants in disordered segments. Synopsis IDR-Screen is a high-throughput experimental and computational approach for discovering functional disordered regions in a biologically relevant context and identifying features of functional sequences through statistical learning. IDR-Screen allows discovering functional disordered regions and learning what makes them functional. Several new transactivation domains (TAD) are discovered from a library of random sequences and the effect of mutations in 13 naturally occurring TADs from various transcription factors is analyzed. Machine learning algorithms allow identification of features associated with transactivation function and can be used to design new TAD sequences. IDR-Screen can facilitate identification of functional disordered regions in naturally occurring proteins and help analyze the effect of natural variation and disease mutations within disordered protein regions. Introduction Understanding how the amino acid sequence of a protein contributes to its function (sequence–function relationship) is a problem of long-standing interest. The work of Anfinsen and colleagues in the 1960s (Anfinsen, 1973) together with the elucidation of protein structures established the sequence–structure–function paradigm (Fersht, 2008). With the availability of genomes, it has become clear that a large fraction of any eukaryotic proteome encodes protein segments that do not autonomously fold into a defined tertiary structure although they may contain secondary structural elements (van der Lee et al, 2014). Proteins typically use their intrinsically disordered regions (IDRs) to perform their function by mediating transient protein interactions (Tompa et al, 2014; Van Roey et al, 2014). Such regions can tolerate mutations; hence, they evolve rapidly and acquire functionality through both convergent evolution and divergent evolution (van der Lee et al, 2014; Tompa et al, 2014; Davey et al, 2015). Although computational approaches have estimated that there could be up to a million functional IDRs in the human proteome (Tompa et al, 2014), only a small fraction of them have been characterized so far (Gouw et al, 2017), limiting our understanding of the disordered proteome. In vitro technologies such as phage display are powerful to identify short disordered linear motifs (three to seven residues within IDRs) that can mediate interactions with specific protein domains in vitro as well as discover strong binders (Ivarsson et al, 2014; Garrido-Urbani et al, 2016; Davey et al, 2017). Such approaches require screening short peptides against specific interaction partners, thus constraining the mechanism by which they mediate function (Jones et al, 2006; Ivarsson et al, 2014; Garrido-Urbani et al, 2016; Davey et al, 2017). The screening occurs outside of the relevant cellular/biological context during the selection experiment and hence does not explicitly consider cellular specificity for binding, i.e., selection against promiscuous binding with other molecules in the cell (negative selection). Thus, there is a need for a complementary and systematic high-throughput approach to study the sequence–function relationship of IDRs in a biologically relevant cellular context. We present a framework called IDR-Screen that allows mechanism-independent discovery of disordered regions that are functional in a cellular context (Fig 1). It leverages various techniques, including mutational scanning of pooled sequences, genetic screens, and machine learning (ML; Boucher et al, 2014; Fowler & Fields, 2014; Jordan & Mitchell, 2015; Geffen et al, 2016; Nim et al, 2016; Rocklin et al, 2017), and consists of the following modular steps: (i) designing and generating libraries of sequences that code for short peptide segments, (ii) generating a cell population carrying the different sequences and screening them for a function of interest using a selection system (e.g., based on cell viability), (iii) sequencing the population before and after selection and determining functional and non-functional sequences, (iv) describing all sequences by calculating a feature vector quantifying their molecular properties, and (v) applying data analysis approaches such as ML to highlight the molecular basis of functionality of the short disordered peptides in the library (Fig 1). Here, we study transcription initiation as a model biological process to discover and learn what makes certain disordered regions functional (Appendix Figs S1–S3). Figure 1. Outline of IDR-ScreenIDR-Screen consists of a modular set of stages that can broadly be grouped into the experimental and computational phases. A library of random or designed sequences is transformed into a cell population, expressed as a part of a protein that is used for selection (survival or other readouts such as fluorescence). In this manner, the library is screened to discover sequences that are functional/non-functional based on the designed assay. Upon data processing, this dataset of experimentally validated functional and non-functional sequences are analyzed to learn the rules of functionality using machine-learning (ML) approaches. Download figure Download PowerPoint Results High-throughput screening of random sequence library for transactivation domain discovery In addition to the DNA binding domain that binds to the promoter DNA, transcription factors (TFs) also harbor transactivation domains (TAD), which are typically less than 20 residues and intrinsically disordered (Sigler, 1988). The current mechanistic model is that TAD mediates interactions to recruit the transcriptional machinery, which is critical for transcription initiation (Ptashne & Gann, 1997). Early investigations of individual TADs of TFs as well as screens of random DNA sequences and Escherichia coli genomic fragments have revealed that TADs tend to be disordered (i.e., unstructured; Sigler, 1988), enriched for acidic (Ma & Ptashne, 1987; Erkine & Gross, 2003) and hydrophobic residues (Cress & Triezenberg, 1991; Regier et al, 1993; Drysdale et al, 1995; Lu et al, 2000; Erkine & Gross, 2003), have a propensity to form alpha helices (i.e., intrinsic helicity) upon binding to their interaction partner (Uesugi et al, 1997; Lee et al, 2010) and may contain distinct sequence motifs that mediate interactions with specific components of the transcriptional machineries (Kussie et al, 1996; Radhakrishnan et al, 1997; Jonker et al, 2005; Piskacek et al, 2007). This has led to TADs being referred to as "acid blobs and negative noodles" (Sigler, 1988). While most TADs are enriched for these properties, the set of features above do not robustly define a TAD sequence when considered individually (Abedi et al, 2001; Bhaumik & Green, 2001; Mapp & Ansari, 2007; Hahn & Young, 2011; Warfield et al, 2014; Erkina & Erkine, 2016). To discover which sequences can function as TADs, we investigated a library of random DNA sequences (60 bp, ≤ 20aa; random library). Since the encoded peptide sequences are ≤ 20aa, such segments may contain secondary structures of varying degrees but are unlikely to form defined tertiary structures (Murzin et al, 1995) and hence more likely to be disordered. Different selection assays can be designed to discover TADs. We developed an assay to discover functional sequences using the yeast heat shock factor 1 (HSF1) transcription factor as our model. HSF1 has several functional regions including a DNA binding domain, a trimerization domain, and a disordered segment containing a C-terminal TAD (Morimoto, 1998) and regulates the expression of several genes to launch a heat shock response (Hahn et al, 2004; Appendix Fig S1A). Deletion of the disordered C-terminal TAD (HSF1-ΔTAD) results in cell death when grown at 37°C (Erkine & Gross, 2003; Sorger, 1990; Appendix Fig S1B). We then fused the library of sequences to HSF1-ΔTAD and subjected them to the selection experiment. We inferred that sequences that confer survival at 37°C have the potential to function as TADs in this biological context. On the other hand, sequences that mediate promiscuous interactions or fail to initiate transcription efficiently will eventually drop out of the screen. Thus, non-functional sequences will negatively affect growth of cells harboring them or result in cell death (Appendix Fig S1C and D). In this manner, the assay design incorporates the relevant cellular context and negative selection. Functional sequences display sequence property enrichments compared to non-functional ones Using this assay, we obtained robust measurements for 67,263 random sequences (i.e., transformed and detected in at least two replicate experiments; Materials and Methods; Table EV1). Using stringent criteria to ensure a low false-positive rate (Materials and Methods), we identified 739 sequences (~ 1%) that confer survival and hence could function as TADs. Representative sequences from this screen were independently sequenced and confirmed to confer TAD functionality through spot-dilution assay experiments (Appendix Fig S4). An advantage of the IDR-Screen approach is that in addition to discovering functional sequences, the non-functional sequences that are experimentally validated through the screen (with negative selection considerations) provide a more appropriate control set of sequences to compare against. Functional sequences show enrichment for negatively charged residues (D, E) as well as aromatic amino acids (F, W, Y), compared to the non-functional sequences (Fig 2A). Furthermore, functional sequences were depleted in positively charged residues (R, K, and H). In terms of sequence properties, the functional sequences tend to be longer (median length: 18 residues), have a lower isoelectric point (median pI: 5.57), higher hydrophobicity (median % hydrophobicity: 0.33), intermediate propensity to be disordered (median probability: 0.57), and display some helical propensity (median % helicity: 0.14). The functional sequences are also enriched for the occurrence of certain linear peptide motifs (9-amino acid TAD; Piskacek et al, 2007; Fig 2B–G). Figure 2. Analysis of functional and non-functional sequences from the random library A. Enrichment and depletion of different amino acids in the random library (log2 of frequencies of functional over non-functional sequences). B–G. Boxplots of the distribution of the values of length (B), pI (C), hydrophobicity (D), disorder content (E), and helicity (F) for sequences that are functional (green) and non-functional (red). In the boxplots, the central line shows the median. Statistical significance was assessed using Wilcoxon test, n values (sample size) and P-values are provided on the right. (G) Enrichment of 9-aa TAD motif in functional versus non-function sequences; ratio of with-to-without 9-aa TAD in functional-to-non-functional sequences (219/520)/(13384/50001). Download figure Download PowerPoint We then assessed the predictive power of these individual sequence properties in discriminating functional from non-functional sequences. Given the imbalance in our dataset (739 functional: 63,385 non-functional; imbalance ratio: 0.0117), we used sub-sampling when training the models and assessed the performance using precision–recall curves (Materials and Methods). Using logistic regression models, we find that the aforementioned properties such as length, pI, hydrophobicity, disorder content, intrinsic helicity, and the occurrence of a 9-aa TAD motif poorly discriminate functional and non-functional sequences in the random library when considered individually (Appendix Fig S5A and Materials and Methods). In other words, several sequences that do not function as TADs are frequently erroneously predicted to be functional when only these properties were considered and a number of functional sequences will be often incorrectly predicted to be non-functional (see Appendix Fig S5B–G for examples). Among all the properties tested, the pI of the sequence appears to have the most discriminative power. We then combined these sequence properties in our model (rather than consider them individually), which marginally affected the ability to discriminate the sequences (Appendix Fig S6A and B). Thus, the properties described above do not exhaustively describe TAD functionality, suggesting that a more exhaustive set of features could increase the predictive power to discriminate functional and non-functional sequences. Machine learning provides a robust approach to assess sequence feature importance We therefore developed several different features that more comprehensively describe every sequence in the library in addition to the previously described ones (Appendix Fig S2; Table EV2). Analysis of the functional sequences showed prevalence of short, highly degenerate motifs (~ 2–5 residues in length involving negative and aromatic residues; Appendix Fig S7). It is known that tryptophan residues can stabilize local conformations of protein segments (Cochran et al, 2001) and is important to mediate interactions with other proteins as in the case of EIF3 (Marcotrigiano et al, 1997). We designed several new features that captured chemical properties as well as patterns of spacing such as the combinations of amino acids of defined properties (e.g., aromaticity, aliphaticity, hydrophobicity, and presence of positively and negatively charged residues) that are separated by defined distances in the sequences (degenerate mini-motifs). In this manner, we computed 146 different features that were broadly grouped into eight general feature sets (Materials and Methods; Table EV2). We then minimized feature redundancy by retaining one of the highly correlated features after clustering them based on similarity (Materials and Methods). To ensure robust analysis and effective interpretation of patterns in the data, we used different algorithms that rely on distinct principles and provide feature importance for interpretation of the models. Specifically, we used ML algorithms that assume linear (penalized logistic regression model; lasso and ridge) and nonlinear (boosted tree model) relationship between the features for classification (Materials and Methods). Both of these approaches have intrinsic feature ranking capacities. Using the features described above, we trained the different models, each scanning over a broad range of parameters (James et al, 2013). We also employed a stacked model that combines the best models from these approaches (Materials and Methods) and considers them together when identifying the feature importance. Although the predictive power of the ML models is not high, they identify features that make sequences functional (Appendix Fig S8A and B; for the best performing models: precision–recall (PR) area under the curve (AUC): 0.0563; random performance: 0.0115 and receiver operator curve (ROC) AUC: 0.6875; random performance: 0.5). It is also a way to test different hypothesis through the importance of a specific feature. Since all features are tested together while training the model, the ranking based on feature importance highlights their relative importance in discriminating functional from non-functional sequences. We highlight features that contributed the most to predict a functional sequence in our random library in the different ML models (Fig 3; Table EV3). One of the key features in the top 10 that contributed the most includes a degenerate mini-motif with a prevalence of negatively charged residues (enriched: D, E) in proximity to aromatic (enriched: F, W, Y) amino acids. Other feature sets that are important include the pI, single amino acid composition (enriched: W, D, N; depleted: S), and grouped amino acid composition (enriched: aromatic, hydrophobic, and negative; depleted: polar). We also tested the 9-aa TAD motif and helicity among others, which were not among the top 10 features. This highlights that functional sequences need not be restricted to contain specific sequence motifs or show a tendency to form specific secondary structure elements. These observations suggest that IDRs that contain multiple degenerate mini-motifs of negatively charged and aromatic residues, and devoid of positively charged residues are a generic descriptor of functional sequences in the random library. Consistent with this, an analysis of available structures of TADs in complex with their interaction partner in the Protein Data Bank (Rose et al, 2017) revealed a common theme for the known TAD interacting domains, which tend to contain a positively charged patch and a hydrophobic binding pocket (Appendix Fig S9). Figure 3. The top 10 most important features of the machine-learning models trained on the random librarySchematic describing the sequence space explored by the random library (left). Table listing the top 10 most important features. The relative feature importance is given as relative percentages in the last four columns. The size of the circles is scaled per method (lasso, ridge, xgboost, stacked). The direction column denotes the direction of enrichment of the given feature for functional sequences compared to non-functional sequences (up, positive direction and down, negative direction). This figure provides a simplified description of the actual features, which are available in Table EV3. Download figure Download PowerPoint Studying naturally occurring TADs and their variants provides insights into functional features The IDR-Screen approach can also be used to investigate libraries of naturally occurring sequences and designed variants to probe and learn about naturally occurring TADs. To highlight this, we investigated 25 transactivation domains of TFs and focused on 13 known TADs from diverse organisms ranging from yeast to human that were functional in our HSF1-based selection assay. These include TADs from human KLF4, ESX, yeast Pdr1, Oaf1, plant HSFA2 and viral VP16 and EBNA2 and includes artificial TADs identified from previous studies (Table EV4). In addition, we created variants of these reference TAD sequences to investigate their functionality (~ 90 bp, < 30aa; 962 variants; design library; Table EV5). Guided by the observations from the random library, we performed mutational scanning of all positions of the reference TAD sequence to study the importance of a residue in the natural sequence and their ability to tolerate positive charge (K and R scanning), conformational changes (P and G scanning) as well as alanine (A scanning). Some variants in this library also include known single nucleotide polymorphisms in the human population and disease mutations (variants seen in cancer genomes) in the human TADs (Table EV6). A detailed analysis of this mutagenesis data involving 962 sequences/variants revealed that the introduction of a positive charge instead of any residue in the reference TAD sequence was the least tolerated mutation (Materials and Methods). Introduction of G and A were the most tolerated mutations (Fig 4A; see rows) except if the reference TAD residue is a W, Q, I, L, and Y. Introduction of a P, which typically constrains the conformation of a peptide bond, is less tolerated by aromatic residues, whereas polar and negatively charged residues appear to tolerate P better. These findings suggest that aromatic and bulky hydrophobic residues are critical for naturally occurring TAD sequences and negatively charged residues can be substituted possibly due to their redundancy (D and E are among the most frequently occurring amino acids in wild-type TADs). These observations are in line with what we observed in terms of amino acid enrichments among the functional sequences in the random library. Figure 4. Mutational scanning of naturally occurring TADs and the top 10 features of the machine-learning models trained on the design library Heatmap of the tolerance to amino acid substitutions in WT transactivation domain sequences. The tolerance of a mutation is defined as the fraction of functional sequences over all the sequences when a specific substitution was performed. The columns (amino acid in a WT TAD that is substituted) are ordered according to decreasing tolerance (from left to right), and the rows (amino acid into which a residue in the WT TAD is substituted for) are ordered according to decreasing tolerance (from bottom to top). The cells are colored on a green to red gradient for high to low tolerance, respectively. Empty tiles represent data points not detected in the library. Schematic describing the sequence space explored by the design library (left). Table listing the top 10 most important features. The relative feature importance is given as relative percentages in the last four columns. The size of the circles is scaled per method (lasso, ridge, xgboost, stacked). The direction column denotes the direction of enrichment of the given feature for functional sequences compared to non-functional sequences (up, positive direction and down, negative direction). This figure provides a simplified description of the actual features, which are available in Table EV7. Download figure Download PowerPoint Some of the sequence variants that represent polymorphisms in the natural human population and in cancer genomes do not confer survival in our assay for TAD functionality. For instance, the W30R in EKLF4 (allele frequency in the human population of 1.3 × 10−5; gnomAD database) and the E135K mutation in ESX, which is prevalent in esophageal adenocarcinoma (allele frequency of 0.37 from cBioPortal), may lead to loss of TAD activity in these transcription factors (Appendix Fig S10). Thus, IDR-Screen can be a powerful framework to screen and infer the functional impact of a large number of naturally occurring SNPs and mutations observed in disease genomes. We then investigated the individual TADs in terms of their ability to tolerate mutations. To this end, we computed the tolerance score for every TAD in our design library. Tolerance is defined as the number of variants that confer survival over all variants tested for that TAD. An analysis of the distribution of the tolerance scores of the TAD sequences revealed a unimodal distribution where a majority of the TAD sequences have intermediate tolerance scores (Appendix Fig S11). VP16 and Oaf1 are at the most tolerant end of the spectrum, whereas Gln3 is in the least tolerant end of the spectrum. This suggests that most TADs are tolerant to mutations to a certain extent and this is likely to be determined by the nature of the substitution (Fig 4A). It also suggests that naturally occurring wild-type TADs emerged during evolution to be more or less tolerant to different kinds of mutational perturbations with implications for fine-tuning of function via sequence polymorphisms. ML-based learning of the design library data using the different approaches and by a stacked model shows improvement in predictive capacity (for the best performing models: precision–recall AUC: 0.7972; random performance: 0.5626 and ROC-AUC: 0.7602; random performance: 0.5) and allowed the identification of the key features that are important in naturally occurring TADs (Fig 4B; Appendix Fig S12A and B; Table EV7). The top 10 important features include the 9-aa TAD motif, pI of the sequence, single amino acid (enriched: D, F; depleted: K) and grouped amino acid composition (enriched: hydrophobic; depleted: polar) and lower disorder probability score. Combining libraries can train models that are more general and guide design of new sequences To develop a more general predictor and identify features that are important to discriminate sequences that have the potential to function as a TAD in our system, we combined the sequences from the random and design libraries to train machine-learning algorithms (Fig 5). Using this combined library, we trained models that strike a balance between not being able to pick up discernable pattern due to broad and sparse sequence space (random library) versus being biased by picking up patterns from a dense and narrow sequence space (design library; Appendix Fig S13A and B; for the best performing models: precision–recall AUC: 0.2001; random performance: 0.0206 and ROC-AUC: 0.7735; random performance: 0.5). An investigation of the features that contribute the most to prediction revealed several features such as the degenerate mini-motifs (also seen in the analysis of the random library) as well as the 9-aa TAD motif (also seen in the analysis of the design library) with variations. The most consistent feature that appears to be important in all three libraries is the pI of the sequence (Fig 5; Table EV8). Figure 5. The top 10 most important features of the machine-learning models trained on the combined librarySchematic describing the sequence space explored by the combined library (left). Table listing the top 10 most important features. The relative feature importance is given as relative percentages in the last four columns. The size of the circles is scaled per method (lasso, ridge, xgboost, stacked). The direction column denotes the direction of enrichment of the given feature for functional sequences compared to non-functional sequences (up, positive direction and down, negative direction). This figure provides a simplified description of the actual features, which are available in Table EV8. Download figure Download PowerPoint To explore this further, and considering that the degenerate mini-motifs emerges as an important feature to make a functional sequence, we generated new sequences and tested their ability to be functional in our assay. More specifically, we tested seq
Referência(s)