MitoMiner, an Integrated Database for the Storage and Analysis of Mitochondrial Proteomics Data
2009; Elsevier BV; Volume: 8; Issue: 6 Linguagem: Inglês
10.1074/mcp.m800373-mcp200
ISSN1535-9484
AutoresAnthony C. Smith, Alan J. Robinson,
Tópico(s)Advanced Proteomics Techniques and Applications
ResumoMitochondria are a vital component of eukaryotic cells with functions that extend beyond energy production to include metabolism, signaling, cell growth, and apoptosis. Their dysfunction is implicated in a large number of metabolic, degenerative, and age-related human diseases. Therefore, it is important to characterize and understand the mitochondrion. Many experiments have attempted to define the mitochondrial proteome, resulting in large and complex data sets that are difficult to analyze. To address this, we developed a new public resource for the storage and investigation of this mitochondrial proteomics data, called MitoMiner, that uses a model to describe the proteomics data and associated biological information. The proteomics data of 33 publications from both mass spectrometry and green fluorescent protein tagging experiments were imported and integrated with protein annotation from UniProt and genome projects, metabolic pathway data from Kyoto Encyclopedia of Genes and Genomes, homology relationships from HomoloGene, and disease information from Online Mendelian Inheritance in Man. We demonstrate the strengths of MitoMiner by investigating these data sets and show that the number of different mitochondrial proteins that have been reported is about 3700, although the number of proteins common to both animals and yeast is about 1400, and membrane proteins appear to be underrepresented. Furthermore analysis indicated that enzymes of some cytosolic metabolic pathways are regularly detected in mitochondrial proteomics experiments, suggesting that they are associated with the outside of the outer mitochondrial membrane. The data and advanced capabilities of MitoMiner provide a framework for further mitochondrial analysis and future systems level modeling of mitochondrial physiology. Mitochondria are a vital component of eukaryotic cells with functions that extend beyond energy production to include metabolism, signaling, cell growth, and apoptosis. Their dysfunction is implicated in a large number of metabolic, degenerative, and age-related human diseases. Therefore, it is important to characterize and understand the mitochondrion. Many experiments have attempted to define the mitochondrial proteome, resulting in large and complex data sets that are difficult to analyze. To address this, we developed a new public resource for the storage and investigation of this mitochondrial proteomics data, called MitoMiner, that uses a model to describe the proteomics data and associated biological information. The proteomics data of 33 publications from both mass spectrometry and green fluorescent protein tagging experiments were imported and integrated with protein annotation from UniProt and genome projects, metabolic pathway data from Kyoto Encyclopedia of Genes and Genomes, homology relationships from HomoloGene, and disease information from Online Mendelian Inheritance in Man. We demonstrate the strengths of MitoMiner by investigating these data sets and show that the number of different mitochondrial proteins that have been reported is about 3700, although the number of proteins common to both animals and yeast is about 1400, and membrane proteins appear to be underrepresented. Furthermore analysis indicated that enzymes of some cytosolic metabolic pathways are regularly detected in mitochondrial proteomics experiments, suggesting that they are associated with the outside of the outer mitochondrial membrane. The data and advanced capabilities of MitoMiner provide a framework for further mitochondrial analysis and future systems level modeling of mitochondrial physiology. Mitochondria have a varied and critical role in many aspects of eukaryotic metabolism and are implicated in a large number of metabolic, degenerative, and age-related human diseases, including cancer and aging itself (1DiMauro S. Schon E.A. Mitochondrial respiratory-chain diseases.N. Eng. J. Med. 2003; 348: 2656-2668Crossref PubMed Scopus (1305) Google Scholar, 2Trifunovic A. Wredenberg A. Falkenberg M. Spelbrink J.N. Rovio A.T. Bruder C.E. Bohlooly Y.M. Gidlof S. Oldfors A. Wibom R. Tornell J. Jacobs H.T. Larsson N.G. Premature ageing in mice expressing defective mitochondrial DNA polymerase.Nature. 2004; 429: 417-423Crossref PubMed Scopus (2034) Google Scholar, 3Fliss M.S. Usadel H. Caballero O.L. Wu L. Buta M.R. Eleff S.M. Jen J. Sidransky D. Facile detection of mitochondrial DNA mutations in tumors and bodily fluids.Science. 2000; 287: 2017-2019Crossref PubMed Scopus (703) Google Scholar, 4Wallace D.C. A mitochondrial paradigm of metabolic and degenerative diseases, aging, and cancer: a dawn for evolutionary medicine.Annu. Rev. Genet. 2005; 39: 359-407Crossref PubMed Scopus (2542) Google Scholar). About 1500 different proteins are estimated to be present in the mammalian mitochondrion (5Taylor S.W. Fahy E. Ghosh S.S. Global organellar proteomics.Trends Biotechnol. 2003; 21: 82-88Abstract Full Text Full Text PDF PubMed Scopus (188) Google Scholar), and many of these proteins are tissue and development state-specific (6Pagliarini D.J. Calvo S.E. Chang B. Sheth S.A. Vafai S.B. Ong S.E. Walford G.A. Sugiana C. Boneh A. Chen W.K. Hill D.E. Vidal M. Evans J.G. Thorburn D.R. Carr S.A. Mootha V.K. A mitochondrial protein compendium elucidates complex I disease biology.Cell. 2008; 134: 112-123Abstract Full Text Full Text PDF PubMed Scopus (1509) Google Scholar), but despite intense interest in this organelle, the mitochondrial proteome has yet to be fully defined and characterized. Efforts to identify mitochondrial proteins and their post-translational modifications (7Carroll J. Fearnley I.M. Skehel J.M. Runswick M.J. Shannon R.J. Hirst J. Walker J.E. The post-translational modifications of the nuclear encoded subunits of complex I from bovine heart mitochondria.Mol. Cell. Proteomics. 2005; 4: 693-699Abstract Full Text Full Text PDF PubMed Scopus (63) Google Scholar, 8Carroll J. Fearnley I.M. Walker J.E. Definition of the mitochondrial proteome by measurement of molecular masses of membrane proteins.Proc. Natl. Acad. Sci. U. S. A. 2006; 103: 16170-16175Crossref PubMed Scopus (76) Google Scholar) from proteomics studies of purified mitochondrial organelles to in-depth analyses of protein complexes have resulted in the publication of various data sets. The number, size, and complexity of these data sets coupled with a lack of common standards for proteomics data are a major challenge to their use and integration with resources such as the public protein databases. However, understanding the mitochondrial proteome and modeling mitochondrial physiology and molecular pathology at a systems level needs a fully defined and searchable catalog of mitochondrial proteins that is cross-referenced with relevant data. Ten Web-accessible resources are available currently that store data on the mitochondrial proteome (Table I). Among these, there is a large variation in the number of data sets included, the way the data are stored, and the sophistication of the query interface. Each resource has its own strengths and weaknesses, but some limitations are common. First, many do not appear to be actively maintained. Although their experimental data remains valid, it has been integrated with information from public databases that is subject to revision, which undermines confidence in the resource. This emphasizes that even small resources can become difficult to maintain without careful design. Second, many resources are limited to a single species or have no protein homology data, which hinders cross-species comparisons and using orthology to annotate related proteins. Third, many resources do not cite experimental references for individual proteins. Yet provenance is needed to assess whether a protein has been identified correctly as mitochondrial. Fourth, the sophistication of the query interfaces varies considerably. For some, the data are presented as a text file with queries limited to a single identifier, whereas others use relational databases, which allow greater flexibility in the number of searchable fields as well as to constrain attributes. A few resources have query interfaces with multiple options and constraints that are combined to build complex queries. However, their flexibility and ease of use could be improved.Table ISpecies and evidence types cataloged in public databases reporting the mitochondrial localization of proteinsDatabaseSpeciesaSpecies: Hs, H. sapiens; Mm, M. musculus; Rn, R. norvegicus; Dm, D. melanogaster; Ce, Caenorhabditis elegans; Nc, Neurospora crassa; Sc, S. cerevisiae; and At, A. thaliana.EvidencebEvidence type reported for mitochondrial protein localization: identification from mass spectrometry of purified mitochondria (M), localization from GFP tagging (G), or curated annotation from public databases and literature (A).Ref.HsMmRnDmCeNcScAtMitoP2++−−−++−M, G, A61Andreoli C. Prokisch H. Hortnagel K. Mueller J.C. Munsterkotter M. Scharfe C. Meitinger T. MitoP2, an integrated database on mitochondrial proteins in yeast and man.Nucleic Acids Res. 2004; 32: D459-D462Crossref PubMed Google ScholarMiGenes+++++−+−A62Basu S. Bremer E. Zhou C. Bogenhagen D.F. MiGenes: a searchable interspecies database of mitochondrial proteins curated using gene ontology annotation.Bioinformatics. 2006; 22: 485-492Crossref PubMed Scopus (18) Google ScholarMitoRescMitoRes includes metazoan species from UniProt.+++++−−−A63Catalano D. Licciulli F. Turi A. Grillo G. Saccone C. D'Elia D. MitoRes: a resource of nuclear-encoded mitochondrial genes and their products in Metazoa.BMC Bioinformatics. 2006; 7: 36Crossref PubMed Scopus (24) Google ScholarMitoProteome+−−−−−−−M, A64Cotter D. Guda P. Fahy E. Subramaniam S. MitoProteome: mitochondrial protein sequence database and annotation system.Nucleic Acids Res. 2004; 32: D463-D467Crossref PubMed Google ScholarHMPDb+−−−−−−−AAMPDb−−−−−−−+M, A65Heazlewood J.L. Millar A.H. AMPDB: the Arabidopsis Mitochondrial Protein Database.Nucleic Acids Res. 2005; 33: D605-D610Crossref PubMed Scopus (58) Google ScholarAMPP−−−−−−−+M66Kruft V. Eubel H. Jansch L. Werhahn W. Braun H.P. Proteomic approach to identify novel mitochondrial proteins in Arabidopsis.Plant Physiol. 2001; 127: 1694-1710Crossref PubMed Scopus (313) Google ScholarORMD−+−−−−−−M, G29Foster L.J. de Hoog C.L. Zhang Y. Zhang Y. Xie X. Mootha V.K. Mann M. A mammalian organelle map by protein correlation profiling.Cell. 2006; 125: 187-199Abstract Full Text Full Text PDF PubMed Scopus (470) Google ScholarYMP−−−−−−+−M38Ohlmeier S. Kastaniotis A.J. Hiltunen J.K. Bergmann U. The yeast mitochondrial proteome, a study of fermentative and respiratory growth.J. Biol. Chem. 2004; 279: 3956-3979Abstract Full Text Full Text PDF PubMed Scopus (133) Google ScholarYDPM−−−−−−+−M40Prokisch H. Scharfe C. Camp II, D.G. Xiao W. David L. Andreoli C. Monroe M.E. Moore R.J. Gritsenko M.A. Kozany C. Hixson K.K. Mottaz H.M. Zischka H. Ueffing M. Herman Z.S. Davis R.W. Meitinger T. Oefner P.J. Smith R.D. Steinmetz L.M. Integrative analysis of the mitochondrial proteome in yeast.PLoS Biol. 2004; 2: e160Crossref PubMed Scopus (166) Google ScholarMitoMiner+++++−+−M, G, Aa Species: Hs, H. sapiens; Mm, M. musculus; Rn, R. norvegicus; Dm, D. melanogaster; Ce, Caenorhabditis elegans; Nc, Neurospora crassa; Sc, S. cerevisiae; and At, A. thaliana.b Evidence type reported for mitochondrial protein localization: identification from mass spectrometry of purified mitochondria (M), localization from GFP tagging (G), or curated annotation from public databases and literature (A).c MitoRes includes metazoan species from UniProt. Open table in a new tab Given the limitations of the other resources, we developed a new public resource for the storage and analysis of data about the mitochondrial proteome, called MitoMiner. The foundation of this resource is a model that describes cellular localization by GFP 1The abbreviations used are:GFPgreen fluorescent proteinKEGGKyoto Encyclopedia of Genes and GenomesOMIMOnline Mendelian Inheritance in ManXMLExtensible Markup LanguageMGIMouse Genome InformaticsPIRProtein Information ResourceIDidentifierBLASTBasic Local Alignment Search ToolDAVIDDatabase for Annotation, Visualization, and Integrated Discovery. tagging and mass spectrometry of purified organelles as well as associated biological information and that formalizes the relationship among these different data. In developing MitoMiner, we addressed the four major limitations common to other resources. First, to ease the long term maintenance and continuity of the MitoMiner infrastructure beyond the original developers, we built it using the InterMine data warehouse (9Lyne R. Smith R. Rutherford K. Wakeling M. Varley A. Guillier F. Janssens H. Ji W. McLaren P. North P. Rana D. Riley T. Sullivan J. Watkins X. Woodbridge M. Lilley K. Russell S. Ashburner M. Mizuguchi K. Micklem G. FlyMine: an integrated database for Drosophila and Anopheles genomics.Genome Biol. 2007; 8: R129Crossref PubMed Scopus (264) Google Scholar) rather than develop a bespoke system. InterMine is easier to maintain by being an open source system with documentation, tutorials, and an active user and development community. To ease maintenance of the data, the underlying data sources in MitoMiner can be updated with minimal manual intervention by using automated Perl scripts, and we aim for the resource to be updated every 4–6 months. Furthermore new types of data sources can be added by extending the data model and then using InterMine to generate a new relational database schema. Data files in an XML format that is compatible with the new schema can then be easily loaded. Second, the model is not species-specific, and MitoMiner currently includes data sets from six species. Furthermore by incorporating protein orthology in the model it is possible to compare data among these species. Third, with regard to data provenance, MitoMiner records all the evidence for the classification of each individual protein as mitochondrial. This creates a comprehensive provenance for each protein, and a user can evaluate the evidence for the cellular localization of a protein and use this as a constraint in queries. Fourth, InterMine provides a user-friendly query interface for simple data browsing and querying as well as powerful and flexible methods to facilitate complex analyses incorporating multiple resources and search constraints. green fluorescent protein Kyoto Encyclopedia of Genes and Genomes Online Mendelian Inheritance in Man Extensible Markup Language Mouse Genome Informatics Protein Information Resource identifier Basic Local Alignment Search Tool Database for Annotation, Visualization, and Integrated Discovery. We demonstrate the advantages of defining a data model and the variety of data imported in MitoMiner by using the flexible query interface of the InterMine system to report (i) the annual growth in the number of studies and the mitochondrial proteins they identify, (ii) the number of proteins (of a particular species) that are annotated as mitochondrial or have experimental evidence of mitochondrial localization, (iii) the evidence for the mitochondrial localization of proteins in metabolic pathways, and (iv) the union, intersection, and subtraction of mitochondrial proteins among data sets from different studies or organisms. When all the mitochondrial data currently loaded were considered, about 3700 different proteins have been reported as mitochondrial, and about 1400 proteins are common to yeast and animals. Combining the data from multiple studies showed that the identification of transmembrane proteins remains difficult and that these proteins are likely to be underrepresented in the data. Furthermore some cytosolic proteins, such as those of glycolysis, may be co-localized with the mitochondrion through interactions with the outer mitochondrial membrane. The analyses also highlighted known differences in the mitochondrial physiology of organisms, such as fermentation in yeast and apoptosis in animals. MitoMiner was built using the InterMine open source data warehouse system (9Lyne R. Smith R. Rutherford K. Wakeling M. Varley A. Guillier F. Janssens H. Ji W. McLaren P. North P. Rana D. Riley T. Sullivan J. Watkins X. Woodbridge M. Lilley K. Russell S. Ashburner M. Mizuguchi K. Micklem G. FlyMine: an integrated database for Drosophila and Anopheles genomics.Genome Biol. 2007; 8: R129Crossref PubMed Scopus (264) Google Scholar), and version 11.0 was installed and configured. The functionality of InterMine relies upon an object model that describes each data type, its attributes, and the relationships among these different data types, which are defined by the use of shared identifiers. The core object model of InterMine includes definitions of genes, proteins, publications, and the Gene Ontology (10Ashburner M. Ball C.A. Blake J.A. Botstein D. Butler H. Cherry J.M. Davis A.P. Dolinski K. Dwight S.S. Eppig J.T. Harris M.A. Hill D.P. Issel-Tarver L. Kasarskis A. Lewis S. Matese J.C. Richardson J.E. Ringwald M. Rubin G.M. Sherlock G. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.Nature Genet. 2000; 25: 25-29Crossref PubMed Scopus (27278) Google Scholar). This object model was extended to incorporate data types and attributes for describing cellular localization, protein homology, metabolic pathways, genetic phenotypes, and post-translational modifications as well as GFP targeting and mass spectrometry data. The MitoMiner object model was not normalized as it was designed for optimal query performance and ease of navigation in the InterMine Query Builder. The relational database schema of MitoMiner was generated automatically from the object model by the InterMine system. MitoMiner was populated with data downloaded from the Web sites of several public resources. To allow the cross-referencing and integration of data, protein identifiers in all data sets were unified to UniProt (11Wu C.H. Apweiler R. Bairoch A. Natale D.A. Barker W.C. Boeckmann B. Ferro S. Gasteiger E. Huang H. Lopez R. Magrane M. Martin M.J. Mazumder R. O'Donovan C. Redaschi N. Suzek B. The Universal Protein Resource (UniProt): an expanding universe of protein information.Nucleic Acids Res. 2006; 34: D187-D191Crossref PubMed Scopus (882) Google Scholar) accession numbers by using the on-line conversion tools of the Mouse Genome Informatics (MGI) (12Bult C.J. Eppig J.T. Kadin J.A. Richardson J.E. Blake J.A. The Mouse Genome Database (MGD): mouse biology and model systems.Nucleic Acids Res. 2008; 36: D724-D728Crossref PubMed Scopus (348) Google Scholar) for proteins from Mus musculus and the Protein Information Resource (PIR) ID program (13Wu C.H. Yeh L.S. Huang H. Arminski L. Castro-Alvear J. Chen Y. Hu Z. Kourtesis P. Ledley R.S. Suzek B.E. Vinayaka C.R. Zhang J. Barker W.C. The Protein Information Resource.Nucleic Acids Res. 2003; 31: 345-347Crossref PubMed Scopus (353) Google Scholar) for other species. In many cases a protein was mapped to more than one UniProt identifier because when using these programs separate entries for fragments, isoforms, and duplicates can be associated with the original identifier. The literature was searched with PubMed for publications that reported large scale data sets on the mitochondrial localization of proteins. Each data set of these publications was downloaded and imported into Microsoft Excel. Recorded from each publication were the type of experiment, tissues or cell lines from which proteins had been isolated, and the PubMed identifier. Recorded for each protein of the mass spectrometry data sets were, where available, the original protein identifier, subcellular location, sequence of identified peptides, sequence coverage, and the experimental techniques that had been used for the purification, separation, and identification of the protein. If the original protein identifier could not be mapped to a UniProt primary accession number by PIR ID or MGI, then the protein was compared with proteins in UniProt by using BLASTP (14Altschul S.F. Madden T.L. Schaffer A.A. Zhang J. Zhang Z. Miller W. Lipman D.J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.Nucleic Acids Res. 1997; 25: 3389-3402Crossref PubMed Scopus (59933) Google Scholar). If there was a significant match, then the UniProt primary accession number was assigned to the protein. Those proteins without a significant match were discarded. By using the PIR ID and the MGI identifier conversion tools, the evidence of mitochondrial localization for a protein was linked to many of the UniProt entries representing it. Identifiers of proteins encoded in the mitochondrial genome of organisms were taken from the Organelle database of the European Molecular Biology Laboratory-European Bioinformatics Institute and used to annotate the appropriate proteins in MitoMiner. The source of protein sequences, related features, and annotation was UniProt (11Wu C.H. Apweiler R. Bairoch A. Natale D.A. Barker W.C. Boeckmann B. Ferro S. Gasteiger E. Huang H. Lopez R. Magrane M. Martin M.J. Mazumder R. O'Donovan C. Redaschi N. Suzek B. The Universal Protein Resource (UniProt): an expanding universe of protein information.Nucleic Acids Res. 2006; 34: D187-D191Crossref PubMed Scopus (882) Google Scholar). All UniProt entries were downloaded for the six species with mitochondrial localization data sets. The literature citations in each UniProt entry were retrieved from PubMed by using an InterMine parser. Additional Gene Ontology annotation on the biological process, metabolic function, and cellular component of each protein was taken from UniProt (15Camon E. Magrane M. Barrell D. Lee V. Dimmer E. Maslen J. Binns D. Harte N. Lopez R. Apweiler R. The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology.Nucleic Acids Res. 2004; 32: D262-D266Crossref PubMed Google Scholar) and individual genome projects of M. musculus (12Bult C.J. Eppig J.T. Kadin J.A. Richardson J.E. Blake J.A. The Mouse Genome Database (MGD): mouse biology and model systems.Nucleic Acids Res. 2008; 36: D724-D728Crossref PubMed Scopus (348) Google Scholar), Rattus norvegicus (16Twigger S.N. Shimoyama M. Bromberg S. Kwitek A.E. Jacob H.J. The Rat Genome Database, update 2007—easing the path from disease to data and back again.Nucleic Acids Res. 2007; 35: D658-D662Crossref PubMed Scopus (108) Google Scholar), Drosophila melanogaster (17Wilson R.J. Goodman J.L. Strelets V.B. FlyBase: integration and improvements to query tools.Nucleic Acids Res. 2008; 36: D588-D593Crossref PubMed Scopus (127) Google Scholar), and Saccharomyces cerevisiae (18Hong E.L. Balakrishnan R. Dong Q. Christie K.R. Park J. Binkley G. Costanzo M.C. Dwight S.S. Engel S.R. Fisk D.G. Hirschman J.E. Hitz B.C. Krieger C.J. Livstone M.S. Miyasato S.R. Nash R.S. Oughtred R. Skrzypek M.S. Weng S. Wong E.D. Zhu K.K. Dolinski K. Botstein D. Cherry J.M. Gene Ontology annotations at SGD: new data sources and annotation methods.Nucleic Acids Res. 2008; 36: D577-D581Crossref PubMed Scopus (204) Google Scholar). Finally lists of human genes and the descriptions of their associated disease phenotypes were taken from OMIM (19Hamosh A. Scott A.F. Amberger J.S. Bocchini C.A. McKusick V.A. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders.Nucleic Acids Res. 2005; 33: D514-D517Crossref PubMed Scopus (1971) Google Scholar), the definitions of groups of homologous proteins were taken from HomoloGene (20Wheeler D.L. Barrett T. Benson D.A. Bryant S.H. Canese K. Chetvernin V. Church D.M. DiCuccio M. Edgar R. Federhen S. Geer L.Y. Kapustin Y. Khovayko O. Landsman D. Lipman D.J. Madden T.L. Maglott D.R. Ostell J. Miller V. Pruitt K.D. Schuler G.D. Sequeira E. Sherry S.T. Sirotkin K. Souvorov A. Starchenko G. Tatusov R.L. Tatusova T.A. Wagner L. Yaschenko E. Database resources of the National Center for Biotechnology Information.Nucleic Acids Res. 2007; 35: D5-D12Crossref PubMed Scopus (724) Google Scholar), and data on the reactions, enzymes, and compounds of metabolic pathways were taken from KEGG (21Kanehisa M. Goto S. Hattori M. Aoki-Kinoshita K.F. Itoh M. Kawashima S. Katayama T. Araki M. Hirakawa M. From genomics to chemical genomics: new developments in KEGG.Nucleic Acids Res. 2006; 34: D354-D357Crossref PubMed Scopus (2399) Google Scholar). The EC numbers of proteins in UniProt were used to define the cross-reference between proteins and metabolic pathways. The data files for UniProt and Gene Ontology were loaded into MitoMiner by using InterMine parsers. The other data sources were converted into XML data files compatible with the MitoMiner object model by using Perl scripts that use BioPerl (22Stajich J.E. Block D. Boulez K. Brenner S.E. Chervitz S.A. Dagdigian C. Fuellen G. Gilbert J.G. Korf I. Lapp H. Lehvaslaiho H. Matsalla C. Mungall C.J. Osborne B.I. Pocock M.R. Schattner P. Senger M. Stein L.D. Stupka E. Wilkinson M.D. Birney E. The Bioperl toolkit: Perl modules for the life sciences.Genome Res. 2002; 12: 1611-1618Crossref PubMed Scopus (1247) Google Scholar) modules, and then these were loaded into MitoMiner. These scripts were designed to allow the sources to be updated quickly and with minimal manual intervention. A simplified data flow for MitoMiner is shown in Fig. 1. InterMine provides access to the data by using Apache Tomcat to create a configurable Web interface. This interface allows sophisticated cross-resource queries to be created using the integral Query Builder that are executed using the InterMine query engine. With the exception of analyses involving BLAST searches and DAVID functional classifications (23Dennis Jr., G. Sherman B.T. Hosack D.A. Yang J. Gao W. Lane H.C. Lempicki R.A. DAVID: Database for Annotation, Visualization, and Integrated Discovery.Genome Biol. 2003; 4: P3Crossref PubMed Google Scholar), the analyses reported in the results were done using queries written with the Query Builder. The UniProt database contains redundancy as the same protein can be represented by multiple entries. Therefore the number of UniProt entries reported as mitochondrial in MitoMiner is not the same as the number of mitochondrial proteins. This redundancy was reduced by incorporating HomoloGene into MitoMiner and using it to cluster duplicate entries. However, it should be noted that HomoloGene does cluster some highly similar paralogs. The number of HomoloGene clusters was given for analyses that reported the number of proteins, unless stated otherwise, for example, when the number of proteins with evidence for mitochondrial localization was evaluated. To prevent double counting, proteins were excluded from analyses if they were not members of a HomoloGene cluster as many of these were fragments that were of insufficient size to have been clustered with their corresponding full-length counterparts. HomoloGene was also used in MitoMiner to identify orthologous proteins among different species. As HomoloGene appeared to be too stringent in its criteria for homology for some analyses, more distant orthologs were identified by using BLASTP with an expect value cutoff of 10−35. The BLAST searches were done on lists of proteins exported in FASTA format from MitoMiner. Queries considered to meet the most common requirements of users were written by using the integral Query Builder tool of InterMine. These template queries were made available on the relevant data category Web pages, as well as together on a single searchable Web page. The user interface of the InterMine Web application was customized, and the service was deployed. To determine which Gene Ontology annotation terms were significantly overrepresented (p < 0.001) in lists of proteins compared with a background population, the DAVID Functional Annotation Clustering tool (23Dennis Jr., G. Sherman B.T. Hosack D.A. Yang J. Gao W. Lane H.C. Lempicki R.A. DAVID: Database for Annotation, Visualization, and Integrated Discovery.Genome Biol. 2003; 4: P3Crossref PubMed Google Scholar) was used; it uses a modified version of Fisher exact p value. The DAVID analyses were done using lists of UniProt identifiers exported from MitoMiner. If the list contained identifiers from more than one species, then the identifiers from each species were analyzed separately. MitoMiner is publicly accessible. For ease of navigation in the Web interface, the data in MitoMiner were divided into separate data categories, and these are available from the MitoMiner home page. The data categories are mass spectrometry data, GFP tagging data, homology information (from HomoloGene), protein annotation (from UniProt and others), metabolic pathways (from KEGG), proteomics publications (from PubMed), and genetic phenotypes and disease (from OMIM). The data categories organize and provide background information on their source, access to bulk data sets, relevant template queries, and pertinent starting points for the Query Builder. For example, the protein data category page (Fig. 2) provides (i) what protein annotation is available and from where it was taken, (ii) the option to download all proteins that have experimental evidence of mitochondrial localization, and (iii) template queries for the most common searches with regard to protein annotation, such as show all proteins of a particular species that have experimental evidence of mitochondrial localization. Report pages specify the information in the database related to entries of a data category and provide cross-references to the relevant entries in external public resources. For example, the report page of a protein (Fig. 3) lists attributes of the protein and a link to the entry at the UniProt Web site as well as tabulating and cross-referencing the available data on GFP tagging, mass spectrometry, publications
Referência(s)