Defining the Mandate of Proteomics in the Post-Genomics Era: Workshop Report
2002; Elsevier BV; Volume: 1; Issue: 10 Linguagem: Inglês
10.1016/s1535-9476(20)34374-7
ISSN1535-9484
AutoresGeorge L. Kenyon, David M. DeMarini, Elaine Fuchs, David J. Galas, Jack F. Kirsch, Thomas S. Leyh, Walter H. Moos, Gregory A. Petsko, Dagmar Ringe, Gerald M. Rubin, Laura C. Sheahan,
Tópico(s)Genetics, Bioinformatics, and Biomedical Research
ResumoResearch in proteomics is the next step after genomics in understanding life processes at the molecular level. In the largest sense proteomics encompasses knowledge of the structure, function and expression of all proteins in the biochemical or biological contexts of all organisms. Since that is an impossible goal to achieve, at least in our lifetimes, it is appropriate to set more realistic, achievable goals for the field. Up to now, primarily for reasons of feasibility, scientists have tended to concentrate on accumulating information about the nature of proteins and their absolute and relative levels of expression in cells (the primary tools for this have been 2D gel electrophoresis and mass spectrometry). Although these data have been useful and will continue to be so, the information inherent in the broader definition of proteomics must also be obtained if the true promise of the growing field is to be realized. Acquiring this knowledge is the challenge for researchers in proteomics and the means to support these endeavors need to be provided. An attempt has been made to present the major issues confronting the field of proteomics and two clear messages come through in this report. The first is that the mandate of proteomics is and should be much broader than is frequently recognized. The second is that proteomics is much more complicated than sequencing genomes. This will require new technologies but it is highly likely that many of these will be developed. Looking back 10 to 20 years from now, the question is: Will we have done the job wisely or wastefully? This report summarizes the presentations made at a symposium at the National Academy of Sciences on February 25, 2002. Research in proteomics is the next step after genomics in understanding life processes at the molecular level. In the largest sense proteomics encompasses knowledge of the structure, function and expression of all proteins in the biochemical or biological contexts of all organisms. Since that is an impossible goal to achieve, at least in our lifetimes, it is appropriate to set more realistic, achievable goals for the field. Up to now, primarily for reasons of feasibility, scientists have tended to concentrate on accumulating information about the nature of proteins and their absolute and relative levels of expression in cells (the primary tools for this have been 2D gel electrophoresis and mass spectrometry). Although these data have been useful and will continue to be so, the information inherent in the broader definition of proteomics must also be obtained if the true promise of the growing field is to be realized. Acquiring this knowledge is the challenge for researchers in proteomics and the means to support these endeavors need to be provided. An attempt has been made to present the major issues confronting the field of proteomics and two clear messages come through in this report. The first is that the mandate of proteomics is and should be much broader than is frequently recognized. The second is that proteomics is much more complicated than sequencing genomes. This will require new technologies but it is highly likely that many of these will be developed. Looking back 10 to 20 years from now, the question is: Will we have done the job wisely or wastefully? This report summarizes the presentations made at a symposium at the National Academy of Sciences on February 25, 2002. Due to the rising interest in proteomics research worldwide, a symposium entitled “Defining the Mandate of Proteomics in the Post-Genomics Era” was held at the National Academy of Sciences on February 25, 2002, in Washington, D.C. Most of the attendees were invited because of their strong interest in proteomics, proteins, or drug discovery. They came from industry, both large and small, academia, and government. Most were from the United States, but an effort was made to invite people from outside the United States. Four of the 10 speakers came from outside of the United States. Six young scientists from around the world received travel fellowships to attend the meeting. The attendees heard about recent advances in the field that will greatly accelerate the process of accumulating and interpreting much of this additional needed data and information. The planning committee selected speakers (see Table I) and designed the symposium in the hope that one of the outcomes of the meeting would be helping to set the field on as wise a path as possible for the future. After the presentations attendees were involved in individual breakout sessions on a variety of topics, including •protein separation and identification•protein structure and function•metabolic pathways and post-translational modifications•implementation: necessary policy and infrastructure conditions for collaboration•platforms: emerging technologies•computational methods and bioinformatics•clinical proteomicsTable ISymposium speakers and affiliationsRuedi Aebersold, Institute for Systems Biology, Seattle, WACheryl Arrowsmith, University of Toronto, CanadaMarvin Cassman, NIGMS, National Institutes of Health, Bethesda, MDJulio Celis, Institute of Cancer Biology and Danish Center for Human Genome Research, Copenhagen, DenmarkBrian Chait, Rockefeller University, New York, NYFrancis Collins, NHGRI, National Institutes of Health, Bethesda, MDDenis Hochstrasser, University of Geneva, Geneva University Hospital, SwitzerlandJoshua LaBaer, Harvard Medical School, Boston, MAScott Patterson, Celera Genomics Corp., Rockville, MDJohn E. Walker, Medical Research Council, Cambridge, UK Open table in a new tab The thoughts and ideas of the speakers and those expressed in the breakout sessions were captured by recorders to assist in the preparation of this report. While other organizations and meetings have addressed many of the issues facing proteomics, we hope that participants and readers of this report will look back on this meeting as the field progresses and find that it was of some help in defining the current efforts and applications, as well as providing direction to the advancing state of the art. Now that the DNA sequences of the human genome and genomes of dozens of other organisms are essentially known, the biomedical and biological communities are placing increased emphasis on proteomics, the study of the proteins that are the gene products. Proteomics, a word derived from “protein” and “genomics,” needs further definition, as do proteomics initiatives, especially since many in the scientific community are asking for a human proteome project. Historically one can point back to meetings and articles over 20 years ago, when scientists began to think about mapping the entire set of human proteins (see, for example, B. F. C. Clark, “Towards a Total Human Protein Map” (1.Clark B.F. Towards a total human protein map.Nature. 1981; 292: 491-492Google Scholar)). Indeed, Congress was considering a project called the “Human Protein Index,” long before the Human Genome Project had been conceived. The Human Protein Index project was developed in the late 70's by Norman G. Anderson and N. Leigh Anderson at the Department of Energy's Argonne National Laboratory (2.Anderson N.G. Anderson N.L. Behring Inst. Mitt. 1979; 63: 169-210Google Scholar). Its objective was to enumerate the human proteins (what would now be called the human proteome) by separation on 2-D gels and thus define their genes from the protein end, the only approach possible in those days before large scale DNA sequencing was possible. But this effort was perhaps ahead of its time given the lack of suitable technologies and shifting political sands. Instead, the rise of genomics took center stage. An Australian postdoctoral student, Marc Wilkins, is often credited with coining the term “proteomics” in 1994 (3.http://www.signalsmag.com, November 2, 1999Google Scholar) at a time when only one proteomics company existed (Large Scale Biology Corporation). Today many proteomics initiatives are underway in industry and otherwise, such as the Human Proteomics Initiative (HPI), an effort which began in 2000 by the Swiss Institute of Bioinformatics and the European Bioinformatics Institute. The goal of the HPI is to annotate each known protein, providing information that includes the description of protein function, domain structure, subcellular location, post-translational modifications, splice variants, and similarities to other mammalian proteins (4.http://us.expasy.org/sprot/hpi/Google Scholar). Another major proteomics effort is led by the Human Proteome Organization (HUPO), a group which has created a worldwide organization that engages in scientific and educational activities to encourage the spread of proteomics technologies and to disseminate knowledge pertaining to the human proteome and that of model organisms (5.http://www.hupo.org/Google Scholar). On which goals should these national and international efforts focus? Should they be limited to human proteomics or like the Human Genome Project, include key model organisms? Perhaps the proteomes of the human pathogens should be included as well (e.g., the malaria parasite and other infectious microorganisms), and if so, in what order of priority? Should development of more efficient instrumentation (e.g., mass spectrometers, X-ray diffractometers, nuclear magnetic resonance spectrometers) and improved computational methodologies (e.g., high-speed computers and software useful in bioinformatics) be emphasized? What should be the role of major federal funding agencies (e.g., the National Institutes of Health, the National Science Foundation, the U.S. Environmental Protection Agency, and the U.S. Department of Agriculture)? What should be the role of academic laboratories? Should projects be supported mostly by individual research grants or program project (group effort) grants? What should be the role of the private sector, particularly those companies large and small that have a major stake in exploiting the results of the various genome projects and proteomics initiatives? How can all of these stakeholders cooperate most effectively while still maintaining proprietary information where appropriate? Should the overall goal be to understand the structure and function of all known proteins or should only those known to be involved in diseases be emphasized? After all, one must first understand function if one is to fully understand dysfunction. Is enough emphasis being given to the functional aspects of proteomics? Are studies on post-translational modifications of proteins and subsequent functional aspects included in “proteomics?” Hence the interest in organizing the one-day symposium reported herein. Beginning with a definition of the term “proteomics,” Marvin Cassman, former director of the National Institute of General Medical Sciences, and now at University of California, San Francisco and the Institute for Quantitative Biomedical Research, was one of many speakers expressing an opinion on this subject and it was clear that proteomics means many (or at least different) things to different people. Some definitions include “high-throughput” and some do not. Obviously proteomics is not merely protein chemistry. Symposium chair and Dean of the University of Michigan College of Pharmacy, George Kenyon, commented, “Proteomics is not just a mass spectrum of a spot on a gel.” Perhaps the most useful definition of proteomics for our purposes is the broadest: Proteomics represents the effort to establish the identities, quantities, structures, and biochemical and cellular functions of all proteins in an organism, organ, or organelle, and how these properties vary in space, time, or physiological state. Somewhat limited operational definitions of proteomics were offered by some of the speakers. For instance, “In one sense it makes no difference at all why should you call something proteomics or call it something else?” Dr. Cassman continued, “What we call things often conditions how we organize our thinking and our efforts.” He explained that genome-driven target selection coupled to high-throughput technologies is what he believes structural genomics means. “It means you are using the genomes as the primary source for target selection.” However, structural proteomics uses these features “plus the additive feature of full coverage of protein space, that is, completeness” stated Dr. Cassman. The goal of completeness does not intend to suggest, however, that any smaller scale experiments, even including high- throughput analysis of specific tissues or subsets of proteins, would not be considered to be part of proteomics. Of course there are many “-omics” along with proteomics including genomics, metabolomics, transcriptomics, interactomics and so on, which are collectively involved in the mandate of defining proteomics. However, we will restrain ourselves from commenting on other “-omics.” Functional genomics and functional proteomics (which can encompass other 'omics’ as mentioned) are closely juxtaposed on a continuum along the path of discovering the detailed secrets of life and life processes. The general topics covered at the symposium included •Perspectives (including genomics perspective; relationship of proteome to genome)•Source of proteins (including, among other things, organism, sample storage, etc.)•Protein separation (including purification if subcellular)•Protein identification (largely mass spectometry)•Protein function (including localization, protein:protein interactions, structure determination, structure-function, post-translational modifications)•Applications (including drug discovery, diagnostics)•Informatics (including homology modeling, databases, analysis software, standardization)•Other topics (including international collaboration, ethical considerations, collaboratories 1Collaboratories are distributed research centers in which scientists in two or more locations are able to work together with the assistance of various forms of communications and collaborative technologies. )1Collaboratories are distributed research centers in which scientists in two or more locations are able to work together with the assistance of various forms of communications and collaborative technologies. Dr. Cassman defined proteomics as a set of related options: “the analysis of complete complements of proteins present in defined cell or tissue environments (i.e., context-dependent) and their variation in space and time” (with credit given to Stan Fields for his contributions to this definition). One example of a proteomic effort is the Protein Structure Initiative of the National Institutes of General Medical Sciences (NIGMS), which has as a goal the generation of a complete complement of protein structures in nature through the combination of direct structure determination and homology modeling. Although it requires high-throughput technology and genomic data to use for target determination, the goal of “completeness” is what distinguishes the effort as proteomics, according to Dr. Cassman. The second part of his definition is exemplified by the use of microarrays to identify characteristic markers for cancer progression in specific tissue samples. These studies involve image and pattern recognition tools, which yield large-scale visualization of specific cell-dependent, context-dependent proteomic outputs. The third part of the definition involves examining proteomic outputs in time and space. This requires not only the application of bioinformatics tools but also computational biology, that is, the use of modeling and simulation. Complex systems analysis could be considered an important element in the larger picture of defining a proteome, and such analysis will require theoretical modeling of systems. Several examples of NIGMS initiatives that focus on mathematical modeling of complex biological systems were provided. One example of this is the protein structure initiative or structural genomics as some may call it, which is discussed later in this report. While we may be far off in terms of defining a complete human proteome, approaching proteomics on an organellar basis provides goals that are perhaps achievable in our lifetimes. Remember that the first DNA genomes sequenced were those of the bacteriophage, in the 1970s, followed in 1981 with the DNA sequencing of a human mitochondrial genome. Consider also that the mitochondrion, which is estimated to be composed of about 2,000 proteins, presents a considerably more manageable problem and a microcosm of whole cell proteomics. With this in mind Nobel laureate Sir John Walker, head of the Dunn Medical Research Council Unit in Cambridge, UK, discussed his proteomic studies of mitochondria directed to resolving specific biological issues. Dr. Walker's work includes the definition of the protein complement assembled in the respiratory enzyme known as complex I, the identification of the biochemical functions of a family of transport proteins found only in mitochondria, and the discovery of phosphorylation-dephosphorylation pathways in mitochondria. These studies rely not only on mass spectrometric and bioinformatics tools but also on biochemistry and genetics. Such an integrated approach is proving to be quite rewarding in Dr. Walker's view, in terms of both understanding the biology of mitochondria and the technical development of new methods versus attempts to analyze the global complement of proteins in the organelle. It is also possible to focus on subcompartments of mitochondria, such as the inner mitochondrial membrane of so much interest to bioenergeticists. In this report we have tried to avoid being constrained by a narrow definition of proteomics (e.g., merely quantitating protein levels) and have used the broad definition given earlier to allow a wide-ranging discussion of goals, techniques, opportunities, and challenges. Francis Collins, director of the National Human Genome Research Institute, spoke about lessons learned from the Human Genome Project that might be applicable to the discussion of a public large-scale proteomics initiative (see Table II). He began his presentation by taking issue with the term “post-genomics era.” He queried whether this means that from the beginning of the universe until 2001 we were in the “pre-genome era,” and then suddenly, “bang,” we moved into the post-genome era (leading one to wonder what happened to the genome era). He suggested that it was presumptuous to say that the Human Genome Project is already behind us. He pointed out that proteomics is a subset of genomics, and genomics is more than sequencing genomes, which will be ongoing for decades to come. His comments are especially relevant given that the human genome was still only about 69 percent complete at the time of the meeting.Table IILessons learned from the Human Genome Project: Comments from Francis CollinsHigh level planning process with broad input from the scientific community is crucial to setting ambitious but achievable and realistic goals.A focus on completeness is important, even though this is extremely difficult when dealing with proteins. This is what distinguishes proteomics from the study of individual proteins, or the fields of biochemistry and physiology. Without completeness as a goal of proteomics much of the same research would be duplicated at a later time.Technology must be developed and validated before attempting to scale up. Technology development includes the range of activities from proof of principle, to pilot projects, to scaling up, to high-throughput. The Human Genome Project sequenced model organisms and generated the necessary infrastructure prior to actually sequencing the human genome, which did not start until six years into the project and was initiated first with pilot projects.Public availability of data and resources is absolutely critical if the benefits to the scientific community are going to be realized. The rapid release of pre-publication data was a key to the success of the Human Genome Project.Interdisciplinary research needs to be fostered, including the participation of experts in automation, chemistry, and bioinformatics.International participation and coordination is an essential component to bring the best minds to the problem, to avoid duplication, and for cost sharing.Centralized databases that allow for integration and visualization of the data are an essential resource and are needed to transfer all these data into the hands of those who want to use them. They are expensive and need to be nurtured.Public-private partnerships should be sought whenever feasible, especially for the generation of pre-competitive data sets. (Successful examples include the single nucleotide polymorphism consortium and mouse genomic sequencing.) Characteristics for successful public-private partnerships include a compelling scientific opportunity, pre-competitive data sets, simultaneous availability of data to all users, production facilities already in place, firm milestones and deliverables, affordability, and having well-defined endpoints. Open table in a new tab Dr. Collins concurred with other participants in delivering the sobering message that a large-scale proteomics effort is orders of magnitude more complicated and difficult than the sequencing of the human genome. (As if 100 trillion cells making up an organism and billions of base pairs in genomes are not enough complexity already!) The concept of a complete dataset of all human proteins is therefore very difficult to imagine. There are many challenges as stated below. •wide dynamic range of expression•protein modifications•physical handling of proteins is more difficult than working with nucleic acids•need for multiple technologies, many of which are not optimized or even invented•unlike DNA data, protein data are more analog than digital, making data integration and analysis very challenging•Intellectual property rights and claims Dr. Collins said that the most important area for investment in proteomics right now is technology development so that we can move these methods in the direction of being able to tackle a mammalian proteome without facing enormous costs and problems with quality of the data. A number of resources for genomics research continue to be generated that may help inform a proteomics effort, including multiple coverage of certain genomes and more specifically: •Multiple genomic sequences from mouse (6x coverage), rat (3x coverage), puffer fish, zebrafish, a sea squirt, and close relatives of C. elegans (10x coverage) and D. melanogaster will be forthcoming. Comparative genomics will be helpful in understanding gene models and gene function.•Full-length human cDNA sequencing efforts are ongoing in Germany and Japan.•Full-length cDNAs for human and mouse are being generated through the National Institutes of Health (NIH) Mammalian Gene Collection (6.http://mgc.nci.nih.gov/Google Scholar). Multiple NIH institutes plan to support a central database of protein sequence and function through a new initiative (7.http://grants.nih.gov/grants/guide/rfa-files/RFA-HG-02-001.htmlGoogle Scholar). Dr. Collins referred to one publication: “Global Analysis of Protein Activities Using Proteome Chips (8.Zhu H. Bilgin M. Bangham R. Hall D. Casamayor A. Bertone P. Lan N. Jansen R. Bidlingmaier S. Houfek T. Mitchell T. Miller P. Dean R.A. Gerstein M. Snyder M. et al.Global analysis of protein activities using proteome chips.Science. 2001; 293: 2101-2105Google Scholar).” He finished his presentation with a particular recommendation, not from a scientist but from a famous athlete (hockey star Wayne Gretzky). When asked how it occurred that he was so good at playing hockey, and why it was that he always seemed to score the key goals, Gretzky said, “It is very simple. You have got to skate where the puck is going to be.” In the field of proteomics Dr. Collins said he was not sure where exactly the puck was going to be, but there were a lot of “Wayne Gretzky's” at the meeting, and Dr. Collins was glad to get a chance to listen to them. By definition any proteomics effort aims at 'completeness’ of information. This part of the symposium addressed primarily the comprehensiveness or completeness of any assembled library of proteins and the quality of the materials. It was noted that protein expression in a given cell varies from none to abundant. Historically, for practical reasons, the abundant proteins have been investigated most extensively; however, some of the rarely expressed proteins and proteins that appear only in disease states may be among the more interesting. Joshua LaBaer, Harvard Medical School, noted that the function of all proteins can be studied regardless of in vivo levels once a copy of the gene and adequate expression vectors are available. Ideally it would be desirable to have an available repository or library containing one clone for every spliced variant in the proteome. The size of that library will not be known for some time, but an intermediate realizable objective would be a repository consisting of one clone for every gene. These clones should be “expression ready”; that is, they should contain only the cDNA from the initiation site to the stop codons. It seems likely that we should have “some idea of all the different cDNAs” in the genome in the near future. The expressed proteins could be studied functionally and often identified by mass spectrometry. In general it is fairly easy to produce large quantities of proteins in insect cells or bacteria, but in certain cases it may be necessary to express them in their native cells in order to address such problems as localization or post-translational modifications. Dr. LaBaer compared the complexities of studying mammalian systems with those in yeast. There are approximately 6,000 genes in yeast compared to a much larger number in humans. Moreover, the genome in yeast is relatively simple; for example, there are only about 220 intron-containing genes in yeast, whereas a much larger fraction of mammalian genes contain introns and alternative splicing substantially increases the number of expressed proteins. To this end Dr. LaBaer described the FLEX Gene repository, which is currently being assembled by a consortium of about 20 different public and private research laboratories. “FLEX” stands for Full Length Expression ready. This repository will enable scientists to move several genes simultaneously from the master vector to any expression vector, which will allow researchers to screen for function by high-throughput experimentation. It is the intention of this consortium to make this collection of all human genes broadly available without restrictions on their use. The four self-defined objectives of the consortium are (1.Clark B.F. Towards a total human protein map.Nature. 1981; 292: 491-492Google Scholar) identification of the genes, (2.Anderson N.G. Anderson N.L. Behring Inst. Mitt. 1979; 63: 169-210Google Scholar) assembly of clones, (3.http://www.signalsmag.com, November 2, 1999Google Scholar) sequence validation, and (4.http://us.expasy.org/sprot/hpi/Google Scholar) distribution to the scientific community. One example of the success of this effort resulted in the identification of two new genes that are likely involved in the migration of breast cancer cells through a membrane. The collaboration of public and private research groups raises certain legal issues, which include consideration of antitrust law. Recombination-based cloning was presented as a high-throughput technology to enable the ready transfer of cDNAs from the supplied vector to one's own preferred expression vector. Dr. LaBaer described a protein purification scheme that was developed by a graduate student in his laboratory, Pascal Braun. “In the case of human proteins,” Dr. LaBaer explained, “where it is not easy to produce these proteins in human cells, [the availability of large numbers of purified proteins] will require the use of heterologous [expression] systems such as bacteria.” “To develop these methods,” continued Dr. LaBaer, “Braun transferred a collection of 30 cancer genes into four different expression vectors, each one adding a different epitope tag. [Braun] then developed a two-hour automated protocol for purifying 96 proteins in parallel [and] has now purified over 330 different proteins using this approach.” Braun and Yanhui Hu of the lab created a database that correlates the success of purification with various features of the proteins such as pI, GO annotation, subcellular localization, and domain structure. Dr. LaBaer said they found that the presence of certain domains such as SH2 domains or SH3 domains can predict success in purification. Dr. LaBaer concluded with a description of a database derived from a computer program that searches the primary literature for abstracts that mention both a gene and a disease. The assumption is that a significant number of such occurrences may identify groups of genes associated with a given disease. This effort was presented as a task in progress, and interested scientists were invited to experiment with the database (9.http://hipseq.med.harvard.edu/MEDGENE/Google Scholar). Brian T. Chait from Rockefeller University described a proteomics approach to understanding cellular function. His group is interested in mechanisms by which materials enter and exit the nucleus, the isolation of multiprotein complexes and to the determination of their cellular localization. The basic concept is to introduce a particular affinity tag to one of the proteins at its natural location in the chromosome, which is done by replacing the endogenous gene by a gene that will code for a protein with a tag on it or as he termed it, “a piece of molecular Velcro.” So long as the mu
Referência(s)