Artigo Acesso aberto Revisado por pares

Abundance and Distributions of Eukaryote Protein Simple Sequences

2002; Elsevier BV; Volume: 1; Issue: 12 Linguagem: Inglês

10.1074/mcp.m200032-mcp200

ISSN

1535-9484

Autores

Kim Lan Sim, Trevor P. Creamer,

Tópico(s)

Protein Structure and Dynamics

Resumo

Protein simple sequences are a subclass of low complexity regions of sequence that are highly enriched in one or a few residue types. Such sequences are common in transcription regulatory proteins, in structural proteins, in proteins involved in nucleic acid interactions, and in mediating protein-protein interactions. Simple sequences of 10 or more residues, containing ≥50% of a single residue type are surveyed in this work. Both eukaryote and prokaryote proteomes are investigated with emphasis on the eukaryotes. Very large numbers of such sequences are found in all organisms surveyed. It is found that eukaryotes possess far more simple sequences per protein than do the prokaryotes. Prokaryotes display a linear relationship between number of proteins containing simple sequences and proteome size, whereas it is not clear that such a relationship holds for eukaryotes. Strikingly, it is found that each eukaryote possesses its own unique distribution of simple sequences. Within those distributions it is found that simple sequences enriched in certain residue types are clearly favored, whereas others are just as clearly discriminated against. The preferences observed are not correlated with residue occurrence. An analysis of classes of proteins of known function suggests that simple sequence occurrence and distribution may be related to protein function. Based upon this analysis, the large number of simple sequences found above that would be expected from a simple statistical model, plus the known functional importance of numerous such sequences, it is postulated that eukaryotes have evolved to not only tolerate large numbers of simple sequences but also to require them. Protein simple sequences are a subclass of low complexity regions of sequence that are highly enriched in one or a few residue types. Such sequences are common in transcription regulatory proteins, in structural proteins, in proteins involved in nucleic acid interactions, and in mediating protein-protein interactions. Simple sequences of 10 or more residues, containing ≥50% of a single residue type are surveyed in this work. Both eukaryote and prokaryote proteomes are investigated with emphasis on the eukaryotes. Very large numbers of such sequences are found in all organisms surveyed. It is found that eukaryotes possess far more simple sequences per protein than do the prokaryotes. Prokaryotes display a linear relationship between number of proteins containing simple sequences and proteome size, whereas it is not clear that such a relationship holds for eukaryotes. Strikingly, it is found that each eukaryote possesses its own unique distribution of simple sequences. Within those distributions it is found that simple sequences enriched in certain residue types are clearly favored, whereas others are just as clearly discriminated against. The preferences observed are not correlated with residue occurrence. An analysis of classes of proteins of known function suggests that simple sequence occurrence and distribution may be related to protein function. Based upon this analysis, the large number of simple sequences found above that would be expected from a simple statistical model, plus the known functional importance of numerous such sequences, it is postulated that eukaryotes have evolved to not only tolerate large numbers of simple sequences but also to require them. Protein simple sequences are stretches of sequence highly enriched in one or a few residue types. These sequences form a major subclass of low complexity sequences (1.Wootton J.C. Federhen S. Analysis of compositionally biased regions in sequence databases.Methods Enzymol. 1996; 266: 554-571Google Scholar). Such sequences are common in transcription regulatory proteins where they are often enriched in glutamine, proline, or charged residues and tend to be highly conserved (2.Brendel V. Karlin S. Association of charge clusters with functional domains of cellular transcription factors.Proc. Natl. Acad. Sci. U. S. A. 1989; 86: 5698-5702Google Scholar, 3.Gerber H.P. Seipel K. Georgiev O. Hofferer M. Hug M. Rusconi S. Schaffner W. Transcriptional activation modulated by homopolymeric glutamine and proline stretches.Science. 1994; 263: 808-811Google Scholar, 4.Kashi Y. King D. Soller M. Simple sequence repeats as a source of quantitative genetic variation.Trends Genet. 1997; 13: 74-78Google Scholar, 5.Katti M.V. Sami-Subbu R. Ranjekar P.K. Gupta V.S. Amino acid repeat patterns in protein sequences: their diversity and structural-functional implications.Protein Sci. 2000; 9: 1203-1209Google Scholar). Glutamine-enriched sequences are thought to be the most common simple sequences (5.Katti M.V. Sami-Subbu R. Ranjekar P.K. Gupta V.S. Amino acid repeat patterns in protein sequences: their diversity and structural-functional implications.Protein Sci. 2000; 9: 1203-1209Google Scholar, 6.Green H. Wang N. Codon reiteration and the evolution of proteins.Proc. Natl. Acad. Sci. U. S. A. 1994; 91: 4298-4302Google Scholar) and have been associated with a number of human neurological disorders such as Huntington's disease (7.Cummings C.J. Zoghbi H.Y. Trinucleotide repeats: mechanisms and pathophysiology.Annu. Rev. Genomics Hum. Genet. 2000; 1: 281-328Scopus (280) Google Scholar, 8.Karlin S. Burge C. Trinucleotide repeats and long homopeptides in genes and proteins associated with nervous system disease and development.Proc. Natl. Acad. Sci. U. S. A. 1996; 93: 1560-1565Google Scholar, 9.Michelitsch M.D. Weissman J.S. A census of glutamine/asparagine-rich regions: implications for their conserved function and the prediction of novel prions.Proc. Natl. Acad. Sci. U. S. A. 2000; 97: 11910-11915Google Scholar, 10.Karlin S. Brocchieri L. Bergman A. Mrazek J. Gentles A.J. Amino acid runs in eukaryotic proteomes and disease associations.Proc. Natl. Acad. Sci. U. S. A. 2002; 99: 333-338Google Scholar). Proline-rich sequences are known to have important roles as structural elements and in mediating protein-protein interactions (11.Kay B.K. Williamson M.P. Sudol M. The importance of being proline: the interaction of proline-rich motifs in signaling proteins with their cognate domains.FASEB J. 2000; 14: 231-241Google Scholar, 12.Williamson M.P. The structure and function of proline-rich regions in proteins.Biochem. J. 1994; 297: 249-260Google Scholar). Sequences enriched in charged residues have been associated with DNA and RNA processing, chromatin structure, ion binding, and protein-protein interactions (10.Karlin S. Brocchieri L. Bergman A. Mrazek J. Gentles A.J. Amino acid runs in eukaryotic proteomes and disease associations.Proc. Natl. Acad. Sci. U. S. A. 2002; 99: 333-338Google Scholar, 13.Karlin S. Statistical significance of sequence patterns in proteins.Curr. Opin. Struct. Biol. 1995; 5: 360-371Google Scholar). Various simple sequences have been implicated as protein domain linkers (14.Wootton J.C. Drummond M.H. The Q-linker: a class of interdomain sequences found in bacterial multidomain regulatory proteins.Protein Eng. 1989; 2: 535-543Google Scholar) or as markers for disordered proteins (15.Romero P. Obradovic Z. Li X. Garner E.C. Brown C.J. Dunker A.K. Sequence complexity of disordered protein.Proteins. 2001; 42: 38-48Google Scholar, 16.Dunker A.K. Obradovic Z. Romero P. Garner E.C. Brown C.J. Intrinsic disorder in complete genomes.Genome Inform. Ser. Workshop Genome Inform. 2000; 11: 161-171Google Scholar). Clearly there are numerous instances where such sequences play important functional roles. In addition, Kashi et al. (4.Kashi Y. King D. Soller M. Simple sequence repeats as a source of quantitative genetic variation.Trends Genet. 1997; 13: 74-78Google Scholar) have noted that DNA simple sequences are a potential source of genetic variation. Some of these DNA sequences fall within coding regions, leading to variation at the protein level. The recent explosion in available genomic, and consequently proteomic data, has provided the opportunity to examine the occurrence and distribution of protein simple sequences at a level of detail not previously possible. Here we present a survey of the occurrence and distribution of protein simple sequences highly enriched in a single residue type in the proteomes of four eukaryotes whose genomes have been fully sequenced. The occurrence of eukaryote simple sequences is compared with the occurrence of such sequences in the proteomes of 26 prokaryotes. Some previous studies of protein simple sequences have used somewhat limited protein databases and have not necessarily compared organisms (5.Katti M.V. Sami-Subbu R. Ranjekar P.K. Gupta V.S. Amino acid repeat patterns in protein sequences: their diversity and structural-functional implications.Protein Sci. 2000; 9: 1203-1209Google Scholar, 6.Green H. Wang N. Codon reiteration and the evolution of proteins.Proc. Natl. Acad. Sci. U. S. A. 1994; 91: 4298-4302Google Scholar, 17.Saqi M. An analysis of structural instances of low complexity segments.Protein Eng. 1995; 8: 1069-1073Google Scholar, 18.Meyer E.F. Tollet Jr., W.J. WWWWhy does nature stutter? A survey of strands of repeated amino acids.Acta Crystallogr. Sect. D Biol. Crystallogr. 2001; 57: 181-186Google Scholar). Other surveys have considered whole proteomes but often remove sequences considered redundant (19.Huntley M. Golding G.B. Evolution of simple sequence in proteins.J. Mol. Evol. 2000; 51: 131-140Google Scholar, 20.Golding G.B. Simple sequence is abundant in eukaryotic proteins.Protein Sci. 1999; 8: 1358-1361Google Scholar). There are a number of surveys where simple sequences enriched in a particular residue type or associated with a particular function have been examined (2.Brendel V. Karlin S. Association of charge clusters with functional domains of cellular transcription factors.Proc. Natl. Acad. Sci. U. S. A. 1989; 86: 5698-5702Google Scholar, 3.Gerber H.P. Seipel K. Georgiev O. Hofferer M. Hug M. Rusconi S. Schaffner W. Transcriptional activation modulated by homopolymeric glutamine and proline stretches.Science. 1994; 263: 808-811Google Scholar, 9.Michelitsch M.D. Weissman J.S. A census of glutamine/asparagine-rich regions: implications for their conserved function and the prediction of novel prions.Proc. Natl. Acad. Sci. U. S. A. 2000; 97: 11910-11915Google Scholar, 14.Wootton J.C. Drummond M.H. The Q-linker: a class of interdomain sequences found in bacterial multidomain regulatory proteins.Protein Eng. 1989; 2: 535-543Google Scholar). Some recent studies have focused on comparisons between organisms (10.Karlin S. Brocchieri L. Bergman A. Mrazek J. Gentles A.J. Amino acid runs in eukaryotic proteomes and disease associations.Proc. Natl. Acad. Sci. U. S. A. 2002; 99: 333-338Google Scholar, 21.Marcotte E.M. Pellegrini M. Yeates T.O. Eisenberg D. A census of protein repeats.J. Mol. Biol. 1999; 293: 151-160Google Scholar, 22.Katti M.V. Ranjekar P.K. Gupta V.S. Differential distribution of simple sequence repeats in eukaryotic genome sequences.Mol. Biol. Evol. 2001; 18: 1161-1167Google Scholar, 23.Nishizawa K. Nishizawa M. Kim K.S. Tendency for local repetitiveness in amino acid usages in modern proteins.J. Mol. Biol. 1999; 294: 937-953Google Scholar) but have mostly considered only homopolymeric sequences. Our current study differs from prior work in that we use only intact proteomes from fully sequenced genomes, including sequences annotated as hypothetical proteins. We focus solely on non-overlapping simple sequences, of 10 or more residues in length, highly enriched in a single residue type (≥50% composition). This approach provides a non-biased view of the distribution of this set of protein simple sequences as well as allowing for ready comparison of their occurrence in the organisms examined. The eukaryotes surveyed, namely a yeast, worm, fruit fly, and plant, comprise a diverse sample of members of the eukaryote kingdom. We have chosen not to include the human proteome given the current uncertain state of its completion. In addition, for comparison we have surveyed 26 prokaryotes, including 12 Archaea, two cyanobacteria, and six Gram-negative and six Gram-positive bacteria. We find that highly enriched simple sequences are remarkably common in all of the organisms examined. Eukaryotes are found to possess more simple sequences per protein than do the prokaryotes in keeping with the findings of other groups (19.Huntley M. Golding G.B. Evolution of simple sequence in proteins.J. Mol. Evol. 2000; 51: 131-140Google Scholar, 21.Marcotte E.M. Pellegrini M. Yeates T.O. Eisenberg D. A census of protein repeats.J. Mol. Biol. 1999; 293: 151-160Google Scholar, 23.Nishizawa K. Nishizawa M. Kim K.S. Tendency for local repetitiveness in amino acid usages in modern proteins.J. Mol. Biol. 1999; 294: 937-953Google Scholar). The occurrence of prokaryote proteins containing simple sequences is linearly correlated with proteome size. Given the limited number of organisms examined, it is not clear that this is the case for the eukaryotes. Perhaps most notably, each organism examined possesses its own unique distribution of simple sequences. We find that simple sequences display surprising length dependences with some residues preferentially populating long simple sequences regions, while others clearly prefer short simple sequences. There is no discernible correlation with residue occurrence. For example, leucine-enriched sequences appear to be discriminated against despite leucine being the most common residue in most organisms. Some observed length dependences can be explained in structural and functional terms, although many remain enigmatic. We have also found that simple sequence distributions vary according to functional groupings. For example, leucine-rich regions, despite being discriminated against in the overall distributions, are among the most common simple sequences found in membrane-associated proteins. It is clear from the sheer number found that all organisms examined, particularly eukaryotes, tolerate, and perhaps even require, large numbers of protein simple sequences. The data presented here will provide the basis for future studies of these ubiquitous and potentially extremely important sequences. Complete proteomes from the fully sequenced genomes of four eukaryotes and 26 prokaryotes were used in our studies (Table I). Sequences were obtained as FASTA format files from the European Bioinformatics Institute (www.ebi.ac.uk/genomes/). We use the entire proteome for each organism, including all proteins marked "hypothetical," "putative," or "probable" as well as all proteins that have no annotation. The one exception to this is the proteome of Arabidopsis thaliana (AT), 1The abbreviations used are: AT, A. thaliana; AF, A. fulgidus; AgT, A. tumefaciens C58; AP, A. pernix K1; BH, B. halodurans; BM, B. melitensis 16M chr1; BS, B. subtilis; CA, C. acetobutylicum ATCC824; CE, C. elegans; DM, D. melanogaster; DR, D. radiodurans chr1; EC, E. coli K-12; HI, H. influenzae; HP, H. pylori 26695; HS, Halobacterium sp. NRC-1; MG, M. genitalium; MJ, M. jannaschii; MP, M. pneumoniae; MT, M. thermoautotrophicum; Nos, Nostoc sp. PCC7120; PA, P. abyssi; PAe, P. aerophilum; PH, P. horikoshii; SC, S. cerevisiae; SS, Synechocytis sp. PCC6803; SSol, S. solfataricus; ST, S. tokodaii; TA, T. acidophilum; TV, T. volcanium; VC, V. cholerae chr1. in which 782 of the protein sequences were found to be incomplete (3% of the proteome). We therefore used only the 26,496 complete sequences in the AT proteome.Table IOrganisms surveyed for protein simple sequences, the number of proteins in each proteome, total number of simple sequences found (SSTot), and the number of proteins containing at least one simple sequence (ProtSS)OrganismTwo-letter codeTypeNumber of proteins in proteomeSSTotProtSSSSTot/ProtSSSaccharomyces cerevisiaeSCEukaryote6,2037,1773,2932.18Caenorhabditis elegansCE21,96223,29511,1252.09Drosophila melanogasterDM13,60824,7257,9893.09A. thalianaaSome AT protein sequences were incomplete and were not included in the analysis. The number of proteins listed for AT corresponds to the number used.AT26,49627,54214,6371.88Synechocytis sp. PCC6803SSCyanobacteria3,1691,4931,0341.44Nostoc sp. PCC7120Nos5,3682,4971,7621.42Escherichia coli K-12ECGram-negative bacteria4,2892,0641,6361.26Haemophilus influenzaeHI1,7096144761.29Vibrio cholerae chr1VC2,7361,2198811.38Helicobacter pylori 26695HP1,5666995001.40Brucella melitensis 16M chr1BM2,0591,0677561.41Agrobacterium tumefaciens C58AgT2,7221,6101,0781.49Bacillus subtilisBSGram-positive bacteria4,3671,7231,2701.36Bacillus haloduransBH4,0661,5971,1821.35Mycoplasma pneumoniaeMP6883602421.49Mycoplasma genitaliumMG4802511701.48Deinococcus radiodurans chr1DR2,5792,2741,3111.73Clostridium acetobutylicum ATCC824CA3,6721,5501,1321.37Archaeoglobus fulgidusAFArchaea2,4219907471.32Aeropyrum pernix K1AP2,6941,8271,1761.55Methanobacterium thermoautotrophicumMT1,8696915311.30Methanococcus jannaschiiMJ1,7158756411.36Pyrococcus abyssiPA1,7658476131.38Pyrococcus horikoshiiPH2,0649737301.33Halobacterium sp. NRC-1HS2,0581,7851,0601.68Thermoplasma acidophilumTA1,4785113951.29Thermoplasma volcaniumTV1,5264523761.20Pyrobaculum aerophilumPAe2,6051,2208661.41Sulfolobus tokodaiiST2,8261,2038661.39Sulfolobus solfataricusSSol2,9941,3359621.39a Some AT protein sequences were incomplete and were not included in the analysis. The number of proteins listed for AT corresponds to the number used. Open table in a new tab We arbitrarily define simple sequences as stretches of sequence that 1) are at least 10 residues in length, 2) are composed of ≥50% of a single type of residue, 3) begin and end with the residue of interest, and 4) do not possess gaps (runs without residue of interest) of more than 5 residues in length. We represent a protein sequence of length L as a string, a1a2a3a4 . . . aL, where ai is the residue at position i. When searching for a simple sequence enriched in a certain residue type, the numerical positions in the protein string for that residue are first generated as a string of i values. Putative simple sequences are extracted based on the positions of the i values given that gaps of 6 or more residues in length are not allowed within a simple sequence. Putative simple sequences of many lengths are identified with all i values corresponding to the residue of interest being output. Since only the residue of interest is selected, the process automatically generates only sequences that begin and end with the residue of interest. Subsequent filtering removes sequences that are less than 10 residues long. Remaining sequences are tested to satisfy the ≥50% threshold for the residue of interest. Sequences that do not satisfy the criteria are further analyzed to determine whether shorter simple sequences satisfying our criteria are within them. The entire process results in the identification of all non-overlapping simple sequences within the proteomes that satisfy all four of the above criteria. The computer programs used to identify simple sequences were written in Python/C++ and executed on a Silicon Graphics work station. We use the Poisson distribution (9.Michelitsch M.D. Weissman J.S. A census of glutamine/asparagine-rich regions: implications for their conserved function and the prediction of novel prions.Proc. Natl. Acad. Sci. U. S. A. 2000; 97: 11910-11915Google Scholar, 24.Soper H.E. Tables of Poisson's exponential limit.Biometrika. 1914; 10: 25-35Google Scholar) to model the probability of random occurrence of simple sequences containing a given residue type in the eukaryote proteomes. This is given byf(n)= e−mmnn!(eq 1) where f(n) is the probability of an event happening n times. In our studies l is the length of the simple sequence, n is the threshold value, and m is derived fromm = l×(% occurrence of residue)100 (eq 2) The expected number of simple sequences of length l in a proteome is thenSSexpect=f(n)×Tl (eq 3) where Tl is the total of number of sequence windows of length l in the proteome. The difference between the actual number of simple sequences, SSTot, of length l found and the number expected from the Poisson distribution is thenΔ=SSTot−SSexpect (eq 4) For simple sequences longer than about 25 residues, SSexpect is essentially zero in which case Δ is equal to the number of simple sequences found. Finally, to compare the occurrence of simple sequences among organisms, we define ΔR as follows:ΔR= ΔNumber of proteins in proteome (eq 5) Our criteria for identifying protein simple sequences ensures that we find sequences that would satisfy any definition of simple sequences, such as the low complexity measures of Wootton and co-workers (1.Wootton J.C. Federhen S. Analysis of compositionally biased regions in sequence databases.Methods Enzymol. 1996; 266: 554-571Google Scholar) or the definition used by Golding (20.Golding G.B. Simple sequence is abundant in eukaryotic proteins.Protein Sci. 1999; 8: 1358-1361Google Scholar). We chose to use this definition since it is relatively straightforward to apply, and the sequences identified are unambiguous in nature. The allowable gap used (5 or fewer residues) was chosen because this is the largest gap possible in a 10-residue sequence, the shortest considered, while still satisfying our ≥50% threshold requirement. The ≥50% threshold ensures that even the shorter sequences identified are relatively unlikely to have occurred as a result of randomness in protein sequences. As will be demonstrated below, for many residues the Δ values obtained tend to be large and positive, indicating that we did indeed identify many more sequences than would be expected were sequences random in nature. If the threshold is decreased to ≥30%, we find significantly more simple sequences at all lengths; however, many of these, particularly short sequences, are accounted for by the number expected using the Poisson distribution model (data not shown). If the threshold is increased to ≥70% we find relatively few sequences (data not shown). We have chosen to include all complete protein sequences in the proteomes that we have examined. This includes those marked hypothetical, putative, or probable and those proteins that have not as yet been annotated. Redundant sequences have also been included. This choice was made so as to be able to perform a more complete analysis of the proteomes, leading to an "unbiased" view. It is possible that some of the simple sequences found come from sequences that are not expressed as proteins. Bork and Copley (25.Bork P. Copley R. Filling in the gaps.Nature. 2001; 409: 818-820Google Scholar) have pointed out that the identification of genes in sequenced genomes is difficult. It is particularly difficult for eukaryote genes where the identification of exons is error-prone. Ideally the analyses presented below should be repeated leaving out those proteins marked hypothetical or not annotated. This is, however, extremely difficult due to the wide variety of annotations used to denote such putative protein sequences. We have thus chosen to present the analyses of the complete proteomes with the caveat that some of the results may be slightly skewed by the presence of incorrect protein sequences. All of the organisms surveyed possess a remarkable number of simple sequences in their proteomes (Table I). The number found ranges from 251 in the small proteome of MG (480 proteins) up to 27,542 in the proteome of AT (26,496 protein sequences surveyed). Furthermore, a remarkable fraction of proteins in each proteome possess at least one simple sequence. Fig. 1a is a plot of the number of proteins possessing one or more simple sequences, ProtSS, against the number of proteins in each proteome. At first glance one might deduce that there is a linear relationship between the number of simple sequence-containing proteins and the total number of proteins. The line of best fit drawn in Fig. 1a has a correlation coefficient of 0.99. However, the eukaryotes possess significantly larger proteomes than do the prokaryotes and consequently far more simple sequences. In effect, the fit to the data is reduced to a fit to five points, the four eukaryotes plus the prokaryotes essentially as a single point. If one considers just the four eukaryotes surveyed, a line of best fit through the data in Fig. 1a would yield a correlation coefficient of 0.99. Note, however, this is just a four-point fit and that it may well be that there is not a linear relationship between eukaryote proteome size and ProtSS. Clearly the complete proteomes of more eukaryotes need to be examined, once they become available, to gain a better understanding of this relationship. What can be concluded from this figure, and the data in Table I, is that a remarkable number of the proteins in the eukaryote proteomes surveyed possess at least one simple sequence as defined in this work. The individual amounts are 53% of the proteins in SC, 51% in CE, 59% in DM, and 55% in AT. Why DM would possess a significantly higher fraction of proteins with at least one simple sequence is not clear. Karlin et al. (10.Karlin S. Brocchieri L. Bergman A. Mrazek J. Gentles A.J. Amino acid runs in eukaryotic proteomes and disease associations.Proc. Natl. Acad. Sci. U. S. A. 2002; 99: 333-338Google Scholar), in a recent survey of homopolymeric runs in proteins ≥200 residues in size, found that DM possessed far more than other eukaryotes. They also found that human proteins possessed more of these runs than proteins from CE despite there being more CE proteins surveyed. Such data suggest that the human proteome may also possess a larger fraction of proteins containing simple sequences than the average observed in this work. Fig. 1b is a plot of ProtSS against the number of proteins in each proteome for the 26 prokaryotes surveyed. There is a clear linear correlation with the line of best fit having a correlation coefficient of 0.92. Two prokaryotes, the Archaea HS and the bacteria DR, appear to be outliers. Excluding these from the fit results in a correlation coefficient of 0.96. The strong linear correlation observed for the prokaryotes might suggest that these simple sequences have arisen via random events, leading to random distributions that depend only upon the number of proteins in each proteome. As will be demonstrated below, however, our data suggest the opposite, that the occurrence and distributions of simple sequences is not random in nature and that many of these sequences may possess biological significance. Fig. 1c, a bar plot of the ratio of number of simple sequences found, SSTot, to ProtSS for each organism surveyed illustrates the difference in occurrence of protein simple sequences in prokaryotes and eukaryotes. Prokaryotes have far fewer simple sequences per protein than do the eukaryotes. In all cases, the prokaryotes have fewer simple sequences than the total number of proteins in their proteomes, whereas the eukaryotes possess more (Table I). The prokaryotes average 1.40 simple sequences per protein possessing at least one simple sequence (the dashed line on Fig. 1c). Once again, HS and DR are clear outliers among the prokaryotes, possessing SSTot/ProtSS ratios of 1.68 and 1.73, respectively, both values greater than 2 standard deviations from the mean for prokaryotes. The eukaryotes have ratios that range from 1.88 in AT through 2.09 in CE and 2.18 in SC up to 3.09 simple sequences per protein possessing at least one simple sequence in DM. Eukaryotes clearly not only tolerate a significantly higher occurrence of these sequences than do the prokaryotes, they are also more likely to possess multiple simple sequences in each protein. The ratio SSTot/ProtSS is of course dependent upon our definition of protein simple sequences. One can imagine that increasing the size of the allowable gap (currently set at 5 or fewer residues) will result in some of the simple sequences merging, resulting in fewer overall but an increase in the number of longer sequences. The result will be lower values of SSTot/ProtSS for each proteome. A number of groups have examined the occurrence of homopolymeric runs of sequence and noted that eukaryotes possess more per protein than do prokaryotes (19.Huntley M. Golding G.B. Evolution of simple sequence in proteins.J. Mol. Evol. 2000; 51: 131-140Google Scholar, 21.Marcotte E.M. Pellegrini M. Yeates T.O. Eisenberg D. A census of protein repeats.J. Mol. Biol. 1999; 293: 151-160Google Scholar, 23.Nishizawa K. Nishizawa M. Kim K.S. Tendency for local repetitiveness in amino acid usages in modern proteins.J. Mol. Biol. 1999; 294: 937-953Google Scholar). Nishizawa et al. (23.Nishizawa K. Nishizawa M. Kim K.S. Tendency for local repetitiveness in amino acid usages in modern proteins.J. Mol. Biol. 1999; 294: 937-953Google Scholar) note that "modern" tissue-specific proteins have a higher tendency to possess homopolymeric stretches of up to 20 residues in length as compared with ancient proteins. They go on to postulate that this repetitiveness enhances the chance for intermolecular interactions. This hypothesis is supported by observations that simple sequences enriched in glutamine, proline, or charged residues are often found in protein interaction domains of transcription regulatory proteins (2.Brendel V. Karlin S. Association of charge clusters with functional domains of cellular transcription factors.Proc. Natl. Acad. Sci. U. S. A. 1989; 86: 5698-5702Google Scholar, 3.Gerber H.P. Seipel K. Georgiev O. Hofferer M. Hug M. Rusconi S. Schaffner W. Transcriptional activation modulated by homopolymeric glutamine and proline stretches.Science. 1994; 263: 808-811Google Scholar, 4.Kashi Y. King D. Soller M. Simple sequence repeats as a source of quantitative genetic variation.Trends Genet. 1997; 13: 74-78Google Scholar, 5.Katti M.V. Sami-Subbu R. Ranjekar P.K. Gupta V.S. Amino acid

Referência(s)