Artigo Acesso aberto Revisado por pares

Increased Frequency of Cysteine, Tyrosine, and Phenylalanine Residues Since the Last Universal Ancestor

2002; Elsevier BV; Volume: 1; Issue: 2 Linguagem: Inglês

10.1074/mcp.m100001-mcp200

ISSN

1535-9484

Autores

Dawn J. Brooks, Jacques R. Fresco,

Tópico(s)

Polyamine Metabolism and Applications

Resumo

Analysis of extant proteomes has the potential of revealing how amino acid frequencies within proteins have evolved over biological time. Evidence is presented here that cysteine, tyrosine, and phenylalanine residues have substantially increased in frequency since the three primary lineages diverged more than three billion years ago. This inference was derived from a comparison of amino acid frequencies within conserved and non-conserved residues of a set of proteins dating to the last universal ancestor in the face of empirical knowledge of the relative mutability of these amino acids. The under-representation of these amino acids within last universal ancestor proteins relative to their modern descendants suggests their late introduction into the genetic code. Thus, it appears that extant ancient proteins contain evidence pertaining to early events in the formation of biological systems. Analysis of extant proteomes has the potential of revealing how amino acid frequencies within proteins have evolved over biological time. Evidence is presented here that cysteine, tyrosine, and phenylalanine residues have substantially increased in frequency since the three primary lineages diverged more than three billion years ago. This inference was derived from a comparison of amino acid frequencies within conserved and non-conserved residues of a set of proteins dating to the last universal ancestor in the face of empirical knowledge of the relative mutability of these amino acids. The under-representation of these amino acids within last universal ancestor proteins relative to their modern descendants suggests their late introduction into the genetic code. Thus, it appears that extant ancient proteins contain evidence pertaining to early events in the formation of biological systems. By remaining unchanged over the long course of molecular evolution, conserved residues of ancient proteins might possess significant information regarding early ancestral proteins. We sought to determine whether amino acid frequencies within conserved positions of proteins dating to the last universal ancestor (LUA) 1The abbreviations used are: LUA, last universal ancestor; COG, clusters of orthologous groups. of all life indicate that any of the 20 amino acids occurred more or less frequently within early proteins than within their modern descendants. In part, we were motivated by the idea that the amino acid composition of proteins within the LUA might have reflected the order of addition of amino acids to the genetic code, i.e. that compared to modern proteins, the composition was relatively richer in amino acids added to the code early and poorer in those added late. Our approach is based on the insight that the amino acid composition of conserved residues of modern-day proteins has been determined by two factors, the composition of the ancestral proteins that gave rise to the extant proteins and the relative mutability of the various amino acids over the course of evolution of the sequences. Therefore, based solely on knowledge of the composition of conserved (i.e. unchanged) residues of extant sequences and the relative mutability of each amino acid, it may be possible to make inferences regarding the composition of early ancestral proteins. The mutability of each amino acid has been determined empirically through pair-wise comparison of aligned homologous protein sequences; mutability is defined as the number of times an amino acid differs at analogous sites of two aligned sequences divided by the total occurrence of that amino acid within the pair of sequences (1Dayhoff M.O. Schwartz R.M. Orcutt B.C. Dayhoff M.O. Atlas of Protein Sequence and Structure. Vol. 5, Suppl. 3. National Biomedical Research Foundation, Washington, D. C.1978: 345-352Google Scholar). Thus, an amino acid that has mutated relatively frequently over the course of evolution is assigned a high mutability, whereas an amino acid that has mutated relatively infrequently is assigned a low mutability. Amino acids differ in mutability according to the ease with which each particular amino acid may be structurally or functionally replaced by any other within proteins. This depends on the size, shape, hydrophobicity, and charge of each amino acid side chain and its ability to form various types of weak bonds, as well as the structure of the genetic code. Our approach is based upon the following premise. An amino acid with relatively low mutability is by definition less likely to change over the course of sequence evolution than other amino acids. Therefore, as an original set of ancestral sequences gives rise to successive generations of descendants, the frequency of such an amino acid within conserved positions of those descendants (i.e. residues that are unchanged between ancestral and descendant sequences) will increase relative to its frequency within the entire ancestral sequence set. Consequently, the frequency of an amino acid with low mutability within conserved sequence positions of descendant sequences provides an upper limit on its frequency within the ancestral sequences, i.e. it must have occurred with a lower frequency within the ancestral sequences as a whole than within the conserved positions of descendant sequences. On the other hand, the frequency of an amino acid with relatively high mutability will decrease over evolution within conserved positions of descendant sequences relative to the entire ancestral sequence set; thus, its frequency within conserved positions provides a lower limit on its frequency within the ancestral sequences. It is important to recognize that these inferences regarding the upper and lower limits of amino acid frequencies within ancestral sequences are completely independent of substitution events occurring within non-conserved sequence positions. As a consequence of the limits specified above, two general types of observations (Table I) would suggest that a change in frequency of an amino acid over evolution within a set of proteins had occurred; if an amino acid with low mutability occurs less frequently within conserved than within non-conserved residues of the extant protein set, its frequency must have increased over evolution, because its frequency within ancestral sequences can be inferred to have been lower than that within conserved residues. Conversely, if an amino acid with high mutability occurs with greater frequency within conserved than non-conserved residues, its frequency can be inferred to have decreased over evolution, because its frequency within ancestral sequences can be inferred to have been higher than that within conserved residues. It is worth remarking that, based on this approach, no inferences regarding changing amino acid frequencies may be made in cases in which an amino acid with low mutability occurs more frequently, or an amino acid with high mutability occurs less frequently, within conserved than non-conserved residues. Nonetheless, this approach may identify some amino acids that have changed in frequency over deep evolutionary time and thereby provide novel insights regarding early proteins. Guided by this rationale, we determined the frequency of each amino acid in conserved and non-conserved sequence elements of a set of extant proteins dating to the LUA in 26 species spanning the three primary lineages.Table ISummary of rationale for inferring change in frequency of an amino acid over the course of evolutionMutabilityFrequency in conservedChange in frequency over evolutionFrequency in non-conservedLow 1?High>1DecreasedHigh<1? Open table in a new tab Although the nature of the LUA has been the subject of debate (2Woese C.R. The universal ancestor.Proc. Natl. Acad. Sci. U. S. A. 1998; 95: 6854-6859Google Scholar), for the present work it is sufficient that the LUA was an hetero- or homogeneous population that diverged to form the three primary lineages. Consistent with this view, a set of proteins was selected that can be inferred to have been present within the LUA. The clusters of orthologous groups (COG) database (3Tatusov R.L. Natale D.A. Garkavtsev I.V. Tatusova T.A. Shankavaram U.T. Rao B.S. Kiryutin B. Galperin M.Y. Fedorova N.D. Koonin E.V. The COG database: new developments in phylogenetic classification of proteins from complete genomes.Nucleic Acids Res. 2001; 29: 22-28Google Scholar), which groups proteins into families based on pairwise comparisons of the protein complements of fully sequenced genomes, was used to assist in the choice of proteins to include in the analysis. Twenty-six major lineages (19 eubacteria, six archaea and one eukaryote) are represented in the COG database. Not all species contribute members to all families in the database; on the other hand, some species contribute more than one member to a particular family. Our first requirement was that a member of a protein family be present in at least one species of each of the three primary lineages, because this criterion is used to infer that an ancestor of that family was present in the LUA (4Kyrpides N.C. Overbeek R. Ouzounis C.A. Universal protein families and the functional content of the last universal common ancestor.J. Mol. Evol. 1999; 49: 413-423Google Scholar). In fact, we required that for any protein family to be included in the study, at least one member had to be present in all 26 species selected from the COG database (for the list of species, see the legend to Table V. This made it possible to assemble a set containing members from the same protein families for each of these species. Although only one eukaryote, Saccharomyces cerevisiae, was included in the analysis, this did not in any way limit the ability to identify conserved sequence positions within the protein set or to draw conclusions based on the data obtained. In fact, the very wide phylogenetic representation of both eubacteria and archaea was more than sufficient to identify conserved residues, allowing inferences to be drawn regarding the frequency of certain amino acids within ancestral sequences in the LUA. The inclusion of proteins that have been laterally transferred between the eubacterial and the archaeal/eukaryotic lineages would confound our ability to identify residues conserved since the LUA. The protein set was therefore chosen so as to minimize inclusion of laterally transferred proteins. The phylogenetic grouping of the archaea and eukaryotes within a lineage distinct from that of the eubacteria, originally based upon the small subunit rRNA tree (5Olsen G.J. Woese C.R. Overbeek R. The winds of (evolutionary) change: breathing new life into microbiology.J. Bacteriol. 1994; 176: 1-6Google Scholar), has been supported by whole genome analysis (6Fitz-Gibbon S.T. House C.H. Whole genome-based phylogenetic analysis of free-living microorganisms.Nucleic Acids Res. 1999; 27: 4218-4222Google Scholar). Therefore, for any protein family to be included in the analysis, it was required that the family member from the one eukaryotic species, S. cerevisiae, and the members from the six archaeal species form a cluster that is separate from the members contributed by the eubacterial species on the phylogenetic tree provided with each COG (suggesting that proteins within this family have not been laterally transferred between the eubacterial and the archaeal/eukaryotic lineages). Finally, for the purpose of sequence reconstruction (see below), it had to be assumed that species and protein trees are congruent, an assumption potentially violated by inclusion of paralogs (homologs arising through gene duplication) that arose prior to speciation. Therefore, for inclusion of any COG family in the analysis, it had to have one homolog within each species, whether an ortholog or paralog, that did not invalidate the assumption of species and protein tree congruence. After these requirements were fulfilled, our protein set consisted of 59 COG families (Table II). Forty-five of these proteins play some role in translation (many are ribosomal proteins), and another seven play a role in transcription, replication, or DNA repair. These all are classified as informational proteins (7Rivera M.C. Jain R. Moore J.E. Lake J.A. Genomic evidence for two functionally distinct gene classes.Proc. Natl. Acad. Sci. U. S. A. 1998; 95: 6239-6344Google Scholar), because they function in replication, transcription, or translation. The remaining seven proteins are classified as operational proteins (7Rivera M.C. Jain R. Moore J.E. Lake J.A. Genomic evidence for two functionally distinct gene classes.Proc. Natl. Acad. Sci. U. S. A. 1998; 95: 6239-6344Google Scholar), which perform metabolic and other housekeeping roles within the cell. Informational proteins have been found to be less likely to be laterally transferred than operational proteins (7Rivera M.C. Jain R. Moore J.E. Lake J.A. Genomic evidence for two functionally distinct gene classes.Proc. Natl. Acad. Sci. U. S. A. 1998; 95: 6239-6344Google Scholar), and because one of the goals in choosing the set was to avoid laterally transferred proteins, the high proportion of informational proteins in the set was both expected and reassuring.Table IICOG protein families included in the LUA protein setCOG0013 alanyl-tRNA synthetase, COG0030 dimethyladenosine transferase, COG0060 isoleucyl-tRNA synthetase, COG0495 leucyl-tRNA synthetase, COG0143 methionyl-tRNA synthetase, COG0016 phenylalanyl-tRNA synthetase α-subunit, COG0072 phenylalanyl-tRNA synthetase β-subunit, COG0442 prolyl-tRNA synthetase, COG0081 ribosomal protein L1, COG0244 ribosomal protein L10, COG0080 ribosomal protein L11, COG0102 ribosomal protein L13, COG0093 ribosomal protein L14, COG0200 ribosomal protein L15, COG0197 ribosomal protein L16/L10E, COG0256 ribosomal protein L18, COG0090 ribosomal protein L2, COG0091 ribosomal protein L22, COG0089 ribosomal protein L23, COG0087 ribosomal protein L3, COG0088 ribosomal protein L4, COG0094 ribosomal protein L5, COG0097 ribosomal protein L6, COG0051 ribosomal protein S10, COG0100 ribosomal protein S11, COG0048 ribosomal protein S12, COG0099 ribosomal protein S13, COG0184 ribosomal protein S15P/S13E, COG0186 ribosomal protein S17, COG0185 ribosomal protein S19, COG0052 ribosomal protein S2, COG0092 ribosomal protein S3, COG0522 ribosomal protein S4 and related proteins, COG0098 ribosomal protein S5, COG0049 ribosomal protein S7, COG0096 ribosomal protein S8, COG0103 ribosomal protein S9, COG0172 seryl-tRNA synthetase, COG0441 threonyl-tRNA synthetase, COG0532 translation initiation factor 2 (GTPase), COG0180 tryptophanyl-tRNA synthetase, COG0525 valyl-tRNA synthetase, COG0202 DNA-directed RNA polymerase α-subunit/40-kDa subunit, COG0085 DNA-directed RNA polymerase β-subunit/140-kDa subunit, COG0250 transcription antiterminator, COG0258 5′-3′ exonuclease (including N-terminal domain of Pol I), COG0592 DNA polymerase III β-subunit, COG0468 RecA/RadA recombinase, COG0550 topoisomerase IA, COG0459 chaperonin GroEL (HSP60 family), COG0533 metal-dependent proteases with possible chaperone activity, COG0201 preprotein translocase subunit SecY, COG0541 signal recognition particle GTPase, COG0552 signal recognition particle GTPase, COG0112 glycine hydroxymethyltransferase, COG0125 thymidylate kinase, COG0237 dephospho-CoA kinase, COG0575 CDP-diglyceride synthetase, COG0012 predicted GTPase Open table in a new tab The next step was to identify residues within the 59 proteins from each of the 26 species that have been conserved since the LUA. Sequences were aligned using ClustalW (8Thompson J.D. Higgins D.G. Gibson T.J. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.Nucleic Acids Res. 1994; 22: 4673-4680Google Scholar). Two approaches were then used to identify conserved residues within each of the descendant sequences. The first was to identify positions in which the amino acid residues in all 26 descendants are identical. We refer to such positions as "identical sites" to distinguish them from conserved residues identified using the second method described below. Identical sites are rare (∼2% of sequence sites) and exclude many residues actually conserved between an ancestral sequence and any given descendant sequence. To identify conserved residues more accurately, maximum parsimony (9Eck R.V. Dayhoff M.O. Dayhoff M.O. Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Silver Spring, MD1966: 166-169Google Scholar) was used to partially reconstruct the ancestral protein sequences in the LUA that gave rise to each family of aligned descendants. The protein parsimony software "protpars" included in the PHYLIP phylogenetic package (10Felsenstein J. PHYLIP (Phylogeny Inference Package, Version 3.2.Cladistics. 1989; 5: 164-166Google Scholar) was used to partially reconstruct ancestral sequences, assuming the phylogenetic tree indicated by small subunit rRNA data (5Olsen G.J. Woese C.R. Overbeek R. The winds of (evolutionary) change: breathing new life into microbiology.J. Bacteriol. 1994; 176: 1-6Google Scholar). Using the inferred ancestral sequence, conserved and non-conserved sites within the descendant sequence of each species were identified. Because these ancient sequences have diverged to a great extent, only slightly more than a third (∼37%) of the sites within the ancestral sequence could be reconstructed. At sequence positions for which no ancestral residue could be assigned, it was assumed that residues within none of the descendant sequences were conserved. The frequency of each amino acid within conserved and non-conserved residues of the sequence set in each species could then be determined. Conserved sequence elements for the 26 species were pooled to determine frequencies of each amino acid in those positions; the same was done for the non-conserved sequence elements. Six amino acids (glycine, histidine, leucine, proline, arginine, and tryptophan) were more frequent in conserved than non-conserved sequence elements; the remaining 14 amino acids were more frequent in non-conserved sequence elements (Table III).Table IIIFrequency of each amino acid in conserved and non-conserved sequence residues and in the entire protein set, pooled among the 26 speciesAmino acidConservedNon-conservedProtein setConservedRelative mutabilityNon-conservedAla0.08140.08210.08200.99HCys0.00390.00850.00740.45LAsp0.05610.05510.05531.02?Glu0.07790.07840.07820.99?Phe0.03310.03880.03740.85LGly0.13200.05620.07382.35LHis0.02080.01920.01951.09?Ile0.05930.06970.06730.85HLys0.06110.07990.07550.77?Leu0.10790.08470.09011.27?Met0.01280.02650.02330.48?Asn0.02410.04020.03650.60HPro0.06290.03600.04231.74LGln0.01670.03750.03270.45HArg0.07100.05920.06201.20LSer0.02610.05630.04930.46HThr0.03850.05200.04880.74HVal0.08010.07830.07871.02HTrp0.01130.00970.01001.17LTyr0.02310.03170.02970.73L Open table in a new tab The relative mutability of the 20 amino acids has been determined empirically by several investigators, starting with Dayhoff et al. (1Dayhoff M.O. Schwartz R.M. Orcutt B.C. Dayhoff M.O. Atlas of Protein Sequence and Structure. Vol. 5, Suppl. 3. National Biomedical Research Foundation, Washington, D. C.1978: 345-352Google Scholar). Jones et al. (11Jones D.R. Taylor W.R. Thornton J.M. The rapid generation of mutation data matrices from protein sequences.Comput. Appl. Biosci. 1992; 8: 275-282Google Scholar) later updated the mutability estimates of Dayhoff et al. (1Dayhoff M.O. Schwartz R.M. Orcutt B.C. Dayhoff M.O. Atlas of Protein Sequence and Structure. Vol. 5, Suppl. 3. National Biomedical Research Foundation, Washington, D. C.1978: 345-352Google Scholar) for a much larger set of protein sequences, whereas Gonnet et al. (12Gonnet G.H. Cohen M.A. Benner S.A. Exhaustive matching of the entire protein sequence database.Science. 1992; 256: 1443-1445Google Scholar) based their mutability estimates on a set of sequences similar to that of Jones et al. (11Jones D.R. Taylor W.R. Thornton J.M. The rapid generation of mutation data matrices from protein sequences.Comput. Appl. Biosci. 1992; 8: 275-282Google Scholar) but using a modification of the Dayhoff approach. Depending on the data set and approach, some variations occur in the relative mutability ranking (Table IV). Nonetheless, seven amino acids (valine, glutamine, isoleucine, threonine, alanine, serine, and asparagine) consistently fall within the top half, and seven (tryptophan, cysteine, phenylalanine, tyrosine, glycine, proline, and arginine) fall within the bottom half of the ranking. There is, however, a lack of consensus as to whether the remaining amino acids (leucine, lysine, histidine, methionine, and glutamic and aspartic acids) are of high or low mutability. Accordingly, amino acids are assigned high, low, or undetermined mutability in Table III.Table IVRank order of relative mutability of the amino acids (from most to least mutable) based on empirical data of Dayhoff et al. (1Dayhoff M.O. Schwartz R.M. Orcutt B.C. Dayhoff M.O. Atlas of Protein Sequence and Structure. Vol. 5, Suppl. 3. National Biomedical Research Foundation, Washington, D. C.1978: 345-352Google Scholar), Jones et al. (11Jones D.R. Taylor W.R. Thornton J.M. The rapid generation of mutation data matrices from protein sequences.Comput. Appl. Biosci. 1992; 8: 275-282Google Scholar), and Gonnet et al. (12Gonnet G.H. Cohen M.A. Benner S.A. Exhaustive matching of the entire protein sequence database.Science. 1992; 256: 1443-1445Google Scholar)DayhoffJonesGonnetMost mutableAsnSerSerSerThrAlaAspAsnThrGluIleGlnAlaAlaLysThrValValIleMetGluMetHisAsnGlnAspIleValGlnLeuHisArgMetArgGluAspProLysArgLysProHisGlyLeuGlyTyrPhePhePheGlyProLeuTyrTyrCysCysCysLeast mutableTrpTrpTrp Open table in a new tab Based on both their relative mutabilities and their relative frequencies in conserved and non-conserved sequence elements, the following three amino acids may be inferred to have changed in frequency in the protein set since the LUA: cysteine, tyrosine, and phenylalanine (Table III). Because all three of these amino acids are of low mutability and are more abundant in non-conserved than conserved residues, they must have increased in frequency over time (Table I). Although valine, being of high mutability and occurring more frequently in conserved than non-conserved sequence elements, also satisfies the criteria summarized in Table I, its difference in frequency between these subsets is not statistically significant as determined using a chi-square test. The remaining amino acids either lack consensus regarding their relative mutability (see above) or fall into one of the two categories in Table I for which no inferences may be made; glycine, proline, arginine, and tryptophan are of low mutability and are more frequent in conserved than non-conserved residues, whereas alanine, isoleucine, glutamine, serine, and threonine are of high mutability and are less frequent in conserved than non-conserved residues. The frequencies of cysteine, tyrosine, and phenylalanine within conserved residues are 0.0039, 0.0231, and 0.0331, respectively (Table III). Because of their low mutability, the frequencies of these amino acids within conserved residues provide an upper limit on their frequencies within this protein set in the LUA. By comparison, the frequencies of cysteine, tyrosine, and phenylalanine within the protein set as a whole are 0.0074, 0.0297, and 0.0374, respectively. It can therefore be inferred that the frequency of cysteine has doubled within this protein set between the LUA and today, whereas that of tyrosine has increased at least 29% and phenylalanine at least 13%. Given these findings, we sought to determine whether the frequency of these three amino acids increased to an even greater extent within the modern whole-genome protein sets (i.e. proteomes) than within the ancient protein set. To this end, the mean frequency of each amino acid within the ancient protein set and within the proteomes was compared. Data on the proteomic frequency of these amino acids were taken from the Proteome Analysis Database (13Apweiler R. Biswas M. Fleischmann W. Kanapin A. Karavidopoulou Y. Kersey P. Kriventseva E.V. Mittard V. Mulder N. Phan I. Zdobnov E. Proteome analysis database: online application of InterPro and CluSTr for the functional classification of proteins in whole genomes.Nucleic Acids Res. 2001; 29: 44-48Google Scholar). The mean frequency of cysteine within the ancient protein set is 0.0074 compared with 0.0099 in the proteomes, the frequency of tyrosine is 0.0297 versus 0.0335, and the frequency of phenylalanine is 0.0375 versus 0.0437. It is apparent, therefore, that the frequency of these three amino acids within modern proteomes has increased even more than within the set of ancient proteins itself. To gain insight on whether cysteine, tyrosine, and phenylalanine might still be increasing in frequency today, we determined whether they are present in modern proteomes at frequencies predicted by neutral evolution. The neutral theory of molecular evolution predicts that an amino acid within a proteome should eventually reach an equilibrium frequency determined primarily by the number of codons assigned to that amino acid, adjusted for the nucleotide composition of its codons and the nucleotide composition of the genomic coding sequences (14King J.L. Jukes T.H. Non-Darwinian evolution.Science. 1969; 164: 788-798Google Scholar). The probability of observing amino acid j in a specific genome is given by pj = λ(Σi xiyizi), where i represents each codon assigned to amino acid j; xi, yi, and zi represent the frequency of occurrence of the first, second, and third nucleotides, respectively, of codon i within coding sequences of that genome; and λ is a constant such that the sum over all amino acids is equal to one. The normalization constant λ compensates for probabilities assigned to stop codons. Using genomic coding sequence nucleotide frequency data derived from the Codon Usage Database (15Nakamura Y. Gojobori T. Ikemura T. Codon usage tabulated from international DNA sequence databases: status for the year 2000.Nucleic Acids Res. 2000; 28: 292Google Scholar), the frequencies of cysteine, tyrosine, and phenylalanine in the proteome of each species predicted by neutral evolution were determined Table V). The observed frequency of cysteine is significantly less than that predicted in all 26 species (p 0.01), the mean over all species being one-third of that predicted. In contrast, the observed frequencies of tyrosine is less than predicted in only 15 of the species (p = 0.28, which is not statistically significant), and the mean observed frequency of tyrosine, 0.0335, is close to that predicted, 0.0358. For phenylalanine, the observed frequency is higher than predicted in 25 species (p 0.01), the mean observed frequency, 0.0437, being ∼40% higher than predicted. Therefore, the observed frequency of cysteine is less than, and of phenylalanine is greater than, that predicted by neutral evolution, whereas that of tyrosine agrees with the prediction of neutral evolution.Table VPercent frequency of cysteine, tyrosine, and phenylalanine observed and predicted by neutral evolution in proteomes of 26 speciesSpeciesCysteineTyrosinePhenylalanineProteome observedProteome predictedProteome observedProteome predictedProteome observedProteome predictedAae0.00790.02660.04150.03570.05160.0260Afu0.01180.03090.03650.02960.04590.0251Ape0.00940.03110.03350.02220.02750.0209BAC0.00800.03010.03480.03760.04490.0321CLA0.01600.03280.03260.04570.04740.0463Cje0.01220.02960.03680.05930.06000.0535Dra0.00670.02670.02300.01360.03160.0127ENT0.01170.03320.02860.02980.03890.0293HPY0.01100.03030.03680.04490.05420.0394Hbs0.00750.02630.02550.01480.03120.0128Hin0.01040.03210.03150.04790.04470.0456MYC0.00750.02920.03230.04400.05590.0385Mja0.01280.02770.04380.05160.04260.0401Mth0.01210.02830.03220.02850.03650.0225Mtu0.00880.02950.02080.01480.02960.0152Nme0.01030.02880.02980.02750.04120.0237Pae0.01000.02710.02540.01440.03560.0136Pyr0.00630.03030.03840.03870.04600.0322Rpr0.01100.02850.03890.06030.04880.0536SPI0.00730.02700.04290.06080.06190.0511Sce0.01300.02850.03800.04530.04500.0385Ssp0.01000.03350.02910.03430.04010.0347Tac0.00600.02940.04640.03300.04700.0272Tma0.00710.02860.03580.03320.05190.0262Vch0.01050.03340.02960.03520.04070.0347Xfa0.01190.03400.02620.02770.03470.0291Mean0.00990.02980.03350.03580.04370.0317 Open table in a new tab It is generally assumed that those amino acids believed to have been absent from the prebiotic environment were added to the genetic code later, as enzymes for their biosynthesis evolved (16Wong J.T. A co-evolution theory of the genetic code.Proc. Natl. Acad. Sci. U. S. A. 1975; 72: 1909-1912Google Scholar). Thus, very early versions of the code would have included only prebiotically-available amino acids. Because cysteine, tyrosine, and phenylalanine are absent from simulations of the prebiotic environment of the Earth (17Miller S.L. Which organic compounds could have occurred on the prebiotic earth?.Cold Spring Harbor Symp. Quant. Biol. 1987; 52: 17-27Google Scholar), they are commonly held to be late additions to the genetic code. Although we do not propose a specific mechanism for addition of these amino acids to the evolving primitive code, we do make the assumption that codon reassignments would have occurred in a fashion that introduced them into proteins gradually, because the impact upon protein structure of introducing these amino acids en masse was more likely to be detrimental than beneficial (18Osawa S. Jukes T.H. Watanabe K. Muto A. Recent evidence for evolution of the genetic code.Microbiol Rev. 1992; 56: 229-264Google Scholar). Specifically, these amino acids most likely adopted codons that occurred infrequently within coding sequences. This idea is consistent with the fact that both cysteine and tyrosine share four-codon blocks with at least one stop codon; it is quite possible that the code had only recently evolved to use those codons to specify other amino acids (through modification of existing tRNAs) when cysteine and tyrosine "captured" them. Consequently, we propose that upon their introduction into the code, these three amino acids would have gone from being non-existent to being rare within early coded proteins. Furthermore, because of the distinct physicochemical properties of these amino acids, the majority of subsequent coding sequence mutations introducing them into proteins presumably would have been deleterious, causing their increase in frequency to be gradual (that of cysteine especially so). Because our data indicate that these three amino acids increased in frequency between the LUA and today, they must not have reached their equilibrium frequencies by the time of the LUA. According to this scenario, the under-representation of these amino acids in the LUA relative to today is consistent with their late addition to the genetic code. It has conventionally been assumed that the time between the origin of proteins and today has been sufficient for all amino acids to reach their equilibrium frequencies and therefore, that an observed frequency of an amino acid distinct from that predicted by neutral evolution is evidence of some strict requirement of protein structure or function that places unusual selection on that amino acid (14King J.L. Jukes T.H. Non-Darwinian evolution.Science. 1969; 164: 788-798Google Scholar). However, because our findings suggest that at the time of the LUA, cysteine, tyrosine, and phenylalanine had yet to reach equilibrium frequencies, change of amino acid composition toward that predicted by neutral evolution may be a process requiring very long time periods. Indeed, the observation that the frequency of cysteine is so much lower than that predicted by neutral evolution in modern proteomes may be evidence that the increase in usage of this particular amino acid has been especially gradual over evolution. Consequently, the possibility that even today cysteine continues to move toward its equilibrium frequency through neutral evolution, as the vast range of all possible sequence space is gradually searched, cannot be ruled out. On the other hand, over time phenylalanine has become more frequent in proteins than predicted by neutral evolution. In fact, it is possible that the frequency of phenylalanine, too, will increase further with evolution. In any case, positive selection for phenylalanine has caused any initial rarity of this amino acid in the earliest proteins to be overcome. The same may be argued for tyrosine, the observed frequency of which does not differ significantly from that predicted by neutral evolution. Although our approach did not produce evidence for a change in frequency of any of the other 17 amino acids over the course of evolution, this does not imply that no other amino acids have changed in frequency. Using our rationale, it is not possible to reach a definite conclusion regarding the change in frequency (or lack thereof) of those amino acids of high mutability that are less frequent in conserved than non-conserved positions and those of low mutability that are more frequent in conserved than non-conserved positions. Moreover, our ability to make inferences was limited by the lack of consensus on the relative mutability of six amino acids (see Table III and Table IV). It is therefore possible that amino acids other than cysteine, tyrosine, and phenylalanine have increased in frequency since the LUA. With the increase in frequency of these three (and perhaps other) amino acids, there must have been a concomitant decrease in frequency of at least one other amino acid. Because valine is of low mutability and is present at greater frequency in conserved than non-conserved sequence elements (although not to a statistically significant extent), it may indeed have decreased in frequency over time. An alternative approach will be required to determine with certainty which amino acids other than cysteine, tyrosine, and phenylalanine have in fact changed in frequency over evolution. It is not immediately evident how amino acid composition and structure have co-evolved in the ancient protein set investigated. Studies of protein evolution suggest that structure and function can be well conserved even as protein sequence diverges extensively (see Ref. 19Chothia C. Lesk A.M. The relation between the divergence of sequence and structure in proteins.EMBO J. 1986; 5: 823-826Google Scholar, but see Ref. 20Wood T.C. Pearson W.R. Evolution of protein sequences and structures.J. Mol. Biol. 1999; 291: 977-995Google Scholar for a contrary view). However, evolution of amino acid composition may have impacted structure in newly arising proteins of the proteome. Each amino acid has a specific predisposition to occur in different secondary structures, i.e. in α-helices, β-sheets, or random coils (21Chou P.Y. Fasman G.D. Conformational parameter for amino acids in helical, β-sheet, and random coil regions calculated from proteins.Biochemistry. 1974; 13: 211-217Google Scholar, 22King R.D. Sternberg M.J. Identification and application of the concepts important for accurate and reliable protein secondary structure prediction.Protein Sci. 1996; 5: 2298-2310Google Scholar), and negative selection preserving structure would have been relatively relaxed in this later protein set. Further investigation will be required to elucidate structural consequences of changes in proteomic amino acid composition.

Referência(s)