Revisão Acesso aberto Revisado por pares

Marrying structure and genomics

1998; Elsevier BV; Volume: 6; Issue: 3 Linguagem: Inglês

10.1016/s0969-2126(98)00029-x

ISSN

1878-4186

Autores

Burkhard Rost,

Tópico(s)

Enzyme Structure and Function

Resumo

Large-scale genome sequencing is filling up the catalogue of natural proteins at a breathtaking speed. Today, we have available not just a large number of sequences, but also glimpses of the inventory of entire organisms. This information will soon improve our understanding of cells and of life in general. Three means will contribute to this expanding body of knowledge: sequencing genomes (genomics); the determination of protein structures; and the determination of protein function. Protein structure is interwoven with function (e.g. [1Murzin A.G. Structural classification of proteins: new superfamilies.Curr. Opin. Struct. Biol. 1996; 6 (96397941): 386-394Crossref PubMed Scopus (224) Google Scholar, 2Holm L. Sander C. New structure – novel fold?.Structure. 1997; 5: 165-171Abstract Full Text Full Text PDF PubMed Scopus (52) Google Scholar, 3Lima C.D. Klein M.G. Hendrickson W.A. Structure-based analysis of catalysis and substrate definition in the HIT protein family.Science. 1997; 278: 286-290Crossref PubMed Scopus (194) Google Scholar]). Sequence analysis and determination of function are also routinely combined (e.g. [[4]Warbrick E. Two's company, three's a crowd: the yeast two hybrid system for mapping molecular interactions.Structure. 1997; 5 (97169440): 13-17Abstract Full Text Full Text PDF PubMed Scopus (20) Google Scholar]). What about the relationship between structure determination and genomics, however? Structural genomics, the marriage between protein structure determination and genomics, is already beginning. Attempts are made here to illustrate the likely direction this marriage will take. Structure determination will be pushed by, and profit from, genomics. Furthermore, basing research and technical developments, such as drug design, on all three pillars (sequence, structure and function) will provide a large step towards the understanding of life. Structure determination will benefit from genomics in two ways (Figure 1). Firstly, the mass of available sequences will facilitate the quick determination of structure for most existing folds. Secondly, the availability of sequences for the entire genome of an organism will not only help us to unravel missing links in functional pathways, but also to explore alternative pathways and to widen our understanding of principle mechanisms and evolutionary cross-links. The first sequence of an entire genome of an organism was published in 1995. Two years on, another ten complete genome sequences have been published (Table 1). Nucleotide databases have increased two times more over the last two years, than in the previous 20 years (Figure 2). The growth of these databases now outpaces even the development of computers (Figure 2). This is merely the beginning.Table 1Completely sequenced genomes∗.GenomeDateReferenceHaemophilus influenzae8/95[22]Fleischmann R.D. Venter J.C. et al.Whole-genome random sequencing and assembly of Haemophilus influenzae rd.Science. 1995; 269 (95350630): 496-512Crossref PubMed Scopus (4666) Google ScholarMycoplasma genitalium10/95[23]Fraser C.M. Venter J.C. et al.The minimal gene complement of Mycoplasma genitalium.Science. 1995; 270 (96026346): 397-403Crossref PubMed Scopus (2112) Google ScholarSaccharomyces cerevisiae1/96[24]Goffeau A. Oliver S.G. et al.Life with 6000 genes.Science. 1996; 274: 546-567Crossref PubMed Scopus (3232) Google ScholarMethanococcus jannaschii8/96[25]Bult C.J. Geoghagen N.S.M. et al.Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii.Science. 1996; 273 (96337999): 1058-1073Crossref PubMed Scopus (2284) Google ScholarSynechocystis sp. PCC68039/96[26]Kaneko T. Tabata S. et al.Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. Strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions.DNA Res. 1996; 3 (97061201): 109-136Crossref PubMed Scopus (2119) Google ScholarMycoplasma pneumoniae11/96[27]Himmelreich R. Hilbert H. Plagens H. Pirkl E. Li B.C. Herrmann R. Complete sequence analysis of the genome of the bacterium Mycoplasma pneumoniae Nucleic Acids Res. 1996; 24: 4420-4449Google ScholarEscherichia coli1/97[28]Blattner F.R. Shao Y. et al.The complete genome sequence of Escherichia coli K-12.Science. 1997; 277: 1453-1474Crossref PubMed Scopus (6007) Google ScholarMethanobacterium thermoautotrophicum5/97[29]Smith D.R. Reeve J.N. et al.Complete genome sequence of Methanobacterium thermoautotrophicum delta H: functional analysis and comparative genomics.J. Bacteriol. 1997; 179: 7135-7155Crossref PubMed Scopus (1037) Google ScholarArchaeoglobus fulgidus6/97[30]Klenk H.-P. Venter C. et al.The complete genome sequence of the hyperthermophilic, sulphate-reducing archaeon Archaeoglobus fulgidus.Nature. 1997; 390: 364-370Crossref PubMed Scopus (1201) Google ScholarHelicobacter pylori6/97[31]Tomb J.-F. Venter J.C. et al.The complete genome sequence of the gastric pathogen Helicobacter pylori.Nature. 1997; 388: 539-547Crossref PubMed Scopus (3014) Google ScholarBorrelia burgdorferi7/97[32]Fraser C.M. Venter J.C. et al.Genomic sequence of a Lyme disease spirochaete, Borrelia burgdorferi.Nature. 1997; 390: 580-586Crossref PubMed Scopus (1724) Google ScholarTreponema pallidum10/97†Bacillus subtilis11/97[33]Kunst F. Danchin A. et al.The complete genome sequence of the Gram-positive bacterium Bacillus subtilis.Nature. 1997; 390: 249-256Crossref PubMed Scopus (3122) Google ScholarPyrococcus horikoshii1/98‡Aquifex aeolicus2/98[34]Deckert G. Swanson R. et al.The Aquifex aeolicus genome.Nature. 1998; : in pressGoogle Scholar∗List obtained from T Gaasterland http://www.mcs.anl.gov/home/gaasterl/genomes.html †Sequences publicly available (CM Fraser et al.). ‡Sequences partially publicly available [35]Kawarabayasi Y. Kikuchi H. et al.Complete genome sequences of Pyrococcus horikoshii.DNA Res. 1998; 5: in pressGoogle Scholar. Open table in a new tab ∗List obtained from T Gaasterland http://www.mcs.anl.gov/home/gaasterl/genomes.html †Sequences publicly available (CM Fraser et al.). ‡Sequences partially publicly available [35]Kawarabayasi Y. Kikuchi H. et al.Complete genome sequences of Pyrococcus horikoshii.DNA Res. 1998; 5: in pressGoogle Scholar. Structure determination has now become almost routine [[5]Lattman E.E. Protein crystallography for all.Proteins. 1994; 18 (94211752): 103-106Crossref PubMed Scopus (18) Google Scholar]. Currently, as many structures are determined every ten days as in the first ten years of crystallography (Figure 2). Hitting on a novel fold, however, still resembles unearthing a nugget [1Murzin A.G. Structural classification of proteins: new superfamilies.Curr. Opin. Struct. Biol. 1996; 6 (96397941): 386-394Crossref PubMed Scopus (224) Google Scholar, 2Holm L. Sander C. New structure – novel fold?.Structure. 1997; 5: 165-171Abstract Full Text Full Text PDF PubMed Scopus (52) Google Scholar]. Each novel fold can contribute towards understanding the functional details of entire protein families. How does the rate of determination of new structures compare to the rate of determination of sequence data? Considering the organisms for which the entire genome sequence is known (Figure 3), we have structural knowledge for one in every ten protein sequences. Three projects have recently been initiated to solve structures systematically for all the proteins within an organism (http://www.mcs.anl.gov/home/gaasterl/sgreview.html): Haemophilus influenzae (J Moult, Centre for Advanced Research in Biotechnology [CARB], in collaboration with the Institute for Genomic Research [TIGR]); Pyrobaculum aerophilum (T Terwillinger, Lawrence Livermore National Laboratory [LANL]; D Eisenberg and J Miller, University of California Los Angeles [UCLA]); and Methanococcus jannaschii (S-H Kim, Lawrence Berkeley National Laboratory [LBNL]). With all this wealth of information, what are the objectives of structural genomics? Already we know approximately 500 [[1]Murzin A.G. Structural classification of proteins: new superfamilies.Curr. Opin. Struct. Biol. 1996; 6 (96397941): 386-394Crossref PubMed Scopus (224) Google Scholar] of the estimated 1000 protein folds [6Finkelstein A.V. Ptitsyn O.B. Why do globular proteins fit the limited set of folding patterns?.Prog. Biophys. Molec. Biol. 1987; 50: 171-190Crossref PubMed Scopus (242) Google Scholar, 7Chothia C. One thousand protein families for the molecular biologist.Nature. 1992; 357 (92301549): 543-544Crossref PubMed Scopus (806) Google Scholar]. Thus, only about half the ‘blankspots’ in structure space remain to be filled (Figure 1a). Optimistic or not, the first objective for structural genomics will be to determine most water-soluble native folds. Genomics can facilitate finding the blank spots. The recipe is simple: find proteins common to different organisms; exclude those with structural homologues (10%; Figure 3); exclude integral membrane proteins (20–30%; Figure 4); and exclude all proteins for which threading detects known folds (< 10%) [8Fischer D. Eisenberg D. Assigning folds to the proteins encoded by the genome of Mycoplasma genitalium.Proc. Natl. Acad. Sci. USA. 1997; 94: 11929-11934Crossref PubMed Scopus (104) Google Scholar, 9Rost B. O'Donoghue S.I. Sisyphus and prediction of protein structure.Cabios. 1997; 13: 345-356PubMed Google Scholar]. Arriving at the final list requires a large repository of sequences and some skills in bioinformatics. The mass of sequences yielded by genomics will help surmount essential problems in structure determination (e.g. protein expression, purification, and — for crystallography — growth of crystals). For each blank spot candidate, research groups can select the homologue in their favourite organism, (e.g. from thermophilic bacteria, where proteins have the advantage of remaining stable at high temperatures). How likely is a structure, thus selected, to have a novel fold? Today, the specific goal to find novel folds is not driving structure determination. Nevertheless, 10–30% of the structures added to the Protein Data Bank (PDB) constitute novel folds [[1]Murzin A.G. Structural classification of proteins: new superfamilies.Curr. Opin. Struct. Biol. 1996; 6 (96397941): 386-394Crossref PubMed Scopus (224) Google Scholar]. A large-scale structure determination enterprise could easily yield 2000 (additional) new structures annually. Thus, we shall have at least one representative structure for every fold in less than a decade (assuming an initial 10% yield of novel folds, this yield then decaying exponentially). Most pairs of similar structures have < 15% pairwise sequence identity (Figure 5). Thus, filling all the blank spots does not yield all families populating the respective island (Figure 1). The enormous sequence variation within islands is often associated with functional divergence (or convergence). In order to use structure determination to help further our understanding of protein function, the next goal of structural genomics will be to determine structures for all sequence families (and preferably for more than one representative per family). How many structures would it take to fill the map with such detail? Currently, the structures are known for 1145 proteins of unique sequence (the set used in Figure 5) [[10]Holm L. Sander C. Touring protein fold space with DALI/FSSP.Nucleic Acids Res. 1998; 26: 318-321Crossref Scopus (596) Google Scholar], and these represent about 10% of known genomes (Figure 3). thus, about 10,000 additional structures are required to provide one structure per sequence family). This second phase, however, yields 100% coverage in a large-scale structure determination project (the recipe described above only selects candidates representing a single sequence family). Thus, assuming a moderate production of 2000 structures annually, approximately twofold coverage should be obtained within a decade. Knowing all the protein sequences for entire organisms, we can start to map them to pathways (e.g. metabolic, regulatory, signalling, pathogenic), or particular mechanisms (e.g. expression, transcription, replication, recombination) [[11]Gaasterland T. Sensen C.W. Fully automated genome analysis that reflects user needs and preferences – a detailed introduction to the MAGPIE system architecture.Biochimie. 1996; 78 (97061102): 302-310Crossref PubMed Scopus (99) Google Scholar]. Suppose we miss one (or a couple) of the proteins essential for a particular pathway. Can we conclude that this pathway is missing in the organism, or should we try harder to find it? The answer provides the second objective for structural genomics: to find functionally missing links (Figure 1b). Initially this objective will aim to determine the missing structures for all major pathways and mechanisms. The first step, to complete the structural knowledge for all pathways and mechanisms for which we know the associated proteins, is straightforward. The second step, however, appears hopeless: how can we determine structures for unknown proteins? In reality this may not prove to be as difficult as it would at first seem. It is likely that many of the candidates selected to find all the blank spots in structural space will turn out to be representatives of most major functional protein classes. In addition, in the course of large-scale structure determination, cross-links will be uncovered that complete the catalogue of proteins participating in certain functions (e.g. the corresponding mechanisms identified in fragile histidine triad protein (FHIT) and protein kinase C interacting protein (PKCI) and the implications of their structural similarity to galactose-1-phosphate uridylyl-transferase (GalT) [[3]Lima C.D. Klein M.G. Hendrickson W.A. Structure-based analysis of catalysis and substrate definition in the HIT protein family.Science. 1997; 278: 286-290Crossref PubMed Scopus (194) Google Scholar]). After determining structures for all major functional elements, we shall have to complete the functional map (i.e. it will be necessary to determine structures for representatives of all pathways and mechanisms). Candidates for those structures to be determined will be found by structure-based comparative genome analysis, focusing on particular sites (e.g. active sites and binding sites) or uncovering ‘motifs’ [[1]Murzin A.G. Structural classification of proteins: new superfamilies.Curr. Opin. Struct. Biol. 1996; 6 (96397941): 386-394Crossref PubMed Scopus (224) Google Scholar]. For example, the goal could be to find the scaffold containing the common features of all amino hydrolases [[12]Brannigan J.A. Murzin A.G. et al.A protein catalytic framework with an N-terminal nucleophile is capable of self-activation.Nature. 1995; 378 (96077117): 416-419Crossref PubMed Scopus (544) Google Scholar]. Furthermore, alternative pathways will be searched, as well as proteins with particular biochemical ‘fingerprints’ (the structures of such proteins will be crucial to correctly define the motifs). Finally, unknown functions could be searched for specifically by classifying families of determined and homology-modelled structures into functional groups based on electrostatic properties [[13]Blomberg N. Nilges M. Functional diversity of PH domains: an exhaustive modelling study.Fold. Des. 1997; 2: 343-355Abstract Full Text Full Text PDF PubMed Scopus (32) Google Scholar], or based on simple combinations of sequence alignment and structure analysis [[14]Lichtarge O. Bourne H.R. Cohen F.E. An evolutionary trace method defines binding surfaces common to protein families.J. Mol. Biol. 1996; 257 (96180020): 342-358Crossref PubMed Scopus (1008) Google Scholar]. The major objectives of structural genomics have been portrayed here: to find all natural structures; and to find missing links in all functional pathways and mechanisms (Figure 1). These objectives correspond to two aspects of genome sequencing: the mass of sequences produced; and the entirety of sequencing complete genomes from organisms. In order to attain the objectives outlined here a large-scale structure determination enterprise is required. A prerequisite for understanding the function of a protein is to know its structure. Furthermore, large-scale structure determination will enable us to uncover most major functional elements. The scaffolds of structures provide the elements for evolution. Most functional motifs known today are sequence motifs. In the absence of structural data, however, most functional motifs remain hidden. Structural genomics will help us to further understand evolution, and will also provide the knowledge necessary to improve the techniques used in processes such as drug design and discovery. Finally, entities defined by refined structural [8Fischer D. Eisenberg D. Assigning folds to the proteins encoded by the genome of Mycoplasma genitalium.Proc. Natl. Acad. Sci. USA. 1997; 94: 11929-11934Crossref PubMed Scopus (104) Google Scholar, 15Gerstein M. Levitt M. A structural census of the current population of protein sequences.Proc. Natl. Acad. Sci. USA. 1997; 94: 11911-11916Crossref PubMed Scopus (80) Google Scholar] and functional features [1Murzin A.G. Structural classification of proteins: new superfamilies.Curr. Opin. Struct. Biol. 1996; 6 (96397941): 386-394Crossref PubMed Scopus (224) Google Scholar, 2Holm L. Sander C. New structure – novel fold?.Structure. 1997; 5: 165-171Abstract Full Text Full Text PDF PubMed Scopus (52) Google Scholar] will permit a more elaborate comparison of organisms than sequence analysis. This review has focused on the description of structural modules, or domains. Clearly, domains are not enough to understand function. Instead, we need to study functional complexes composed of many proteins. Although a large-scale structure determination enterprise may trigger the study of such complexes by uncovering their elements, a comprehensive exploration of functional systems will be the next step. Humans have about 100,000 different proteins. If we knew all these sequences today, through a combination of structure determination and prediction we would already have structural knowledge for more than 10,000 of these (Figure 3). The sequence of the human genome, however, will not be completed before the year 2004. With 2000 new structures determined by a large-scale enterprise, we shall have structural knowledge for about 70% of all human sequences by the year 2004; many of the remaining 30,000 will be membrane proteins (Figure 4). The mass of sequences produced by genomics should enable most natural folds to be determined within less than a decade. Is this wishful thinking? Firstly, the — strongly disputed — assumption that there are only 1000 folds is not crucial. Instead, the upper limit for the number is provided by the number of sequence families, and the estimate that there are 10,000–15,000 families is rather conservative (to date 1200 sequence families are known, corresponding to 8–18% of all families; Figure 3). To determine one structure for each family is just a matter of a large-scale structure determination enterprise. Secondly, 2000 structures were added to the PDB in 1997, and structure determination techniques continue to improve. Thus, the assumption that 2000 new structures will be determined annually is a rather conservative estimate. What remains is the uncertainty as to how difficult the unknown folds will be to determine. Here we can only be guided by past experience, which shows that most structure determination problems can be solved — eventually. Of course there is no easy answer, we just have to try. Supplementary material available with the internet version of this paper contains a diagram showing the percentage of protein structures annually deposited in the PDB from a particular organism. Thanks to Alfonso Valencia (CNB Madrid), John Moult (CARB Washington) and Alexei Murzin (MRC Cambridge) for discussions; to Sean O'Donoghue (EMBL Heidelberg) and Terry Gaasterland (University of Chicago/Argonne) for proof-reading and discussions; to the GeneQuiz consortium (Miguel Andrade, Nigel Brown, Christophe Leroy and Chris Sander, Ebi Hinxton) for permission to use their unpublished data for Figure 3; and to Chris Sander (Millennium Boston), and Matti Sarraste (EMBL Heidelberg) for financial support. Download .pdf (.02 MB) Help with pdf files Supplementary material B Rost, European Molecular Biology Laboratory, 69 012, Heidelberg, Germany. e-mail: [email protected]

Referência(s)
Altmetric
PlumX