STRENDA DB: enabling the validation and sharing of enzyme kinetics data
2018; Wiley; Volume: 285; Issue: 12 Linguagem: Inglês
10.1111/febs.14427
ISSN1742-4658
AutoresNeil Swainston, Antonio Baici, Barbara M. Bakker, Athel Cornish‐Bowden, Paul F. Fitzpatrick, Peter J. Halling, Thomas S. Leyh, Claire O’Donovan, Frank M. Raushel, Udo Reschel, Johann M. Rohwer, Santiago Schnell, Dietmar Schomburg, Keith F. Tipton, Ming‐Daw Tsai, Hans V. Westerhoff, Ulrike Wittig, Roland Wohlgemuth, Carsten Kettner,
Tópico(s)Plant biochemistry and biosynthesis
ResumoStandards for reporting enzymology data (STRENDA) DB is a validation and storage system for enzyme function data that incorporates the STRENDA Guidelines. It provides authors who are preparing a manuscript with a user-friendly, web-based service that checks automatically enzymology data sets entered in the submission form that they are complete and valid before they are submitted as part of a publication to a journal. Enzyme kinetics is important to many fields within the biological sciences and is a discipline practiced by a large number of researchers. The study of enzyme functions has led to important developments for the sustainable production of a wide variety of compounds in the food, pharmaceutical, flavour and fragrance, agro- and chemical industries [1], and the discovery of novel enzyme functions. These activities cross frontiers for both fundamental and applied research. If biology is to be understood as a dynamical process, then researchers need quantitative data on the regulation and energetics of enzymes. To date, enzymology data are available in repositories such as BRENDA [2] and SABIO-RK [3]. While these resources are extensively curated by experts, the quality and completeness of the data depends on the quality of data available in the scientific literature. All too often, however, essential metadata about the conditions under which kinetic parameters were obtained (e.g. temperature, pH, ionic strength, enzyme and substrate concentrations, presence of activators and inhibitors) are not comprehensively reported in papers. Such omissions make compiling of necessary metadata, and therefore reuse and comparison of datasets, difficult [4, 5]. These difficulties become even more acute for those wishing to use published data to model the behaviour of metabolic systems, cellular behaviour or the interaction of cells within tissues and organs. This is the case in particular for systems biologists, who require reliable data for enzymes from many enzyme classification (EC) classes to be able to produce accurate predictive models. Specialized repositories for specific enzyme classes, such as the CAZy database [6] that focuses on structural and functional information about enzymes which assemble, modify and break down oligo- and polysaccharides, are limiting their data sets on their topics, while systems biocatalysis and systems biology approaches need to collect data in different formats from specialized repositories and the scientific literature. Standards for reporting enzymology data (STRENDA) DB, freely available at http://www.strenda-db.org, is an online validation and storage system for functional enzyme data that aims at being integrated into the publication practices of the scientific community and into the publication processes of journals. It provides a simple-to-use web submission tool and searchable database allowing the sharing, comparison and accurate reporting of enzyme kinetics data. The submission tool incorporates the STRENDA, Guidelines which specify minimum information requested in the reporting of enzyme function data, including kinetic parameter values and full experimental conditions under which they were acquired. STRENDA DB checks the manuscript data entered by the author for compliance with the STRENDA Guidelines. If data are submitted prior to or during the publication process, the submission tool aids the author of a manuscript in the submission of kinetic parameters, ensuring that all required data and metadata are supplied. Data sets compliant with the Guidelines are assigned a STRENDA Registry Number and registered a Direct Object Identifier (DOI), which provides a perennial and resolvable identifier for each data set. The data will normally be publicly available in STRENDA DB only after the corresponding article has been peer-reviewed and published in a journal. Data can also be submitted after publication. By promoting the practice of simultaneously submitting articles to journals and kinetics data to STRENDA DB, reviewers of journal articles as well as authors and consumers of data will benefit from the availability of standardized data in multiple ways. To mitigate this problem, the standards for reporting enzymology data (STRENDA) guidelines were developed [7, 8], following a community-based discussion of the currently accepted best approaches for data reporting in enzyme research. The goal of these guidelines is to improve the quality of data reporting in the scientific literature, enabling readers and reviewers to interpret, evaluate and corroborate the experimental findings. Since their approval, more than 50 biochemistry journals have recommended that their authors follow the STRENDA Guidelines when reporting functional enzymology data (see http://www.beilstein-institut.de/en/projects/strenda/journals). However, despite the existence of the STRENDA Guidelines, many publications still do not describe the experimental conditions and results in sufficient detail to allow the experiment to be reproduced, a topic that has recently attracted considerable attention [9, 10]. Furthermore, it is clear that not only researchers could benefit from having a resource that indicates best practices for the reporting of enzyme kinetics data, but that the value and impact of published work in biocatalysis could be increased, thereby promoting increased citations and further growth of applications [11]. It is now common practice for scientists to submit experimental data to public repositories as a result of policies established by journals and funding agencies. There are multitudinous databases and repositories for ‘omics data, such as ArrayExpress [12], PRIDE [13], MetaboLights [14] and PDB [15]. These resources provide user-interfaces enabling researchers to share transcriptomics, proteomics, metabolomics and protein structure data. However, to date, there is no similar resource to encourage the user submission of enzyme kinetics data for biological molecules. This paper describes a functional enzyme database, STRENDA DB. In contrast to the available enzyme resources such as BRENDA and SABIO-RK, STRENDA DB has been designed specifically to accept data submissions directly from the research community, ensuring that newly acquired enzyme kinetics data are collected with appropriate metadata as it enters the literature. STRENDA DB implements the STRENDA Guidelines in an intuitive and easy-to-use web-based form, facilitating the submission and sharing of data, aiding the literature review process, and increasing the visibility, accessibility and impact of enzyme kinetics publications. The system provides a community-driven and continually updated enzyme kinetics resource supporting enzymology research. Currently, more than 10 journals already recommend their authors both to apply STRENDA DB to validate their manuscript data on completeness and to deposit this data in the database. A related initiative, BioCatNet [16] also accepts kinetic data from authors, particularly raw data on reaction progress and initial rates. It uses an Excel sheet for data entry, and handles some complications found in applied biocatalysis. The STRENDA DB web-based interface, hosted by the Beilstein-Institut, is freely available at http://www.strenda-db.org, and offers two tools: (a) data submission; and (b) data query. The design of the user interface fulfils the requirements of a responsive design that allows the user to submit and query data from any device connected with the Internet. The web application has been implemented using primefaces 4.0 (Primetek Informatics, Ankara, Turkey) and jsf (Java Servlets) 2.1 (Oracle Corporation, Redwood, CA, USA), and the data are stored in an Oracle 12C database. The data submission tool collects data and metadata from users. The goal is to collect data during the preparation of conventional journal submissions to improve the quality of enzyme data reported in the literature. On the basis of the STRENDA Guidelines, the data are collected in a common and standardized format. This will simplify the review process, reproducibility of enzyme assays as well as the accessibility of information to the community. Authors enter the relevant functional enzyme data from their manuscript into the data submission system. The data entry requires the description of the minimum information on materials, methods and assay conditions, as well as the experimental results based on the corresponding experimental conditions. The minimum information is defined by the STRENDA Guidelines, and determines the compulsory fields in the entry section of STRENDA DB. The system validates automatically the data entered in the compulsory fields against completeness and formal correctness (e.g. pH range, defined temperature range). When required information is missing the user receives detailed warning information. After successful finalization of the data input, the author receives a STRENDA Registry Number (SRN) for each data set, providing an unambiguous identifier comparable to the UniProt AC for protein data sets [17]. In addition, each data set is assigned a DOI that allows data referencing and access. The data become publicly available in the database only after the corresponding article has been peer-reviewed and accepted for publication in a journal. The STRENDA Guidelines require a full description of the identity of the catalytic or binding entity (enzyme, protein, nucleic acid or other molecule). This information should include the origin or source of the molecule, its purity, composition and other characteristics, such as post-translational modifications (PTMs), mutations and any modifications made to facilitate expression or purification. The assay methods and exact experimental conditions of the assay must be fully described if it is a new assay or provided as a reference to previously published work, with or without modifications. The temperature, pH and pressure (if other than atmospheric) of the assay must always be included, even if previously published. The data submission to STRENDA DB is possible only after registration and login into the submission system. This allows the user to interrupt the entry process without losing data already entered. In addition, it identifies researchers responsible for the data input to the database development and curation team. The data submission tool was designed to streamline the data collection process. The web-based tool allows simple navigation through the submission system, providing extensive help tooltips and hints, as well as an autofill functionality when specifying enzymes and small molecules by making use of UniProt [17] and PubChem [18] respectively. The overall concept of the submission system of STRENDA DB reflects the structure of a manuscript, that is, introduction, materials and methods, results, discussion and references. For the data input, the materials and methods as well as the results section are most relevant. The submission tool therefore acts as a structural support for the author guided by this general manuscript structure when entering data into STRENDA DB. In addition, the design around the STRENDA Guidelines allows authors to identify required data for entering in the database. In STRENDA DB, the top level of structure is a ‘Manuscript’, typically containing all the data that might ultimately appear in a published paper, specified by its title and authors. A Manuscript can contain data for one or more ‘Experiments’, each of which involves the study of one specific protein as the active enzyme (Fig. 1). This structure allows the user to enter data from the comparison of, for example, the activity of two isozymes, such as two mutant proteins, each of which would be a different Experiment. The core of the definition of an Experiment is the basic data on the protein, such as protein identification, sequence modifications (PSMs), PTMs, source and the typical reaction which it catalyses. For each Experiment, there will be one or more ‘Datasets’. Each Dataset consists of one defined assay condition linked to the experimental result(s), for example, the determination of kinetic parameters at a defined pH. The effects of changes in conditions that can be summarized by kinetic parameters, such as different substrate or inhibitor concentrations, are captured within a single Dataset. But changes in substrate identity or temperature, for example, would require a different Dataset. In the case of a pH profile, the Experiment will contain several Datasets each with different pH values but with the same assay components connected to the pH-dependent kinetic parameters. In consequence, when entering tabular data (pH profiles can be represented in tables), the author needs to enter the description of the enzyme assay only once and only varies the specific parameters for the subsequent assay conditions. The following examples may illustrate the concept: Again, methods used and techniques applied are described, followed by the description of the pyruvate kinase assayed (as the Experiment). The first assay starts at pH 3 and the corresponding kinetic results are added. This makes the first Dataset1. For each subsequent assay, most components of Dataset1 remain constant, with only the pH parameter being changed as the corresponding kinetic results are entered (Fig. 3). Similarly, such an approach is applicable to represent the kinetics at various assay temperatures. In principle, any modification in the assay conditions can affect the kinetic parameters and thus these data are kept in the ‘container’ defined as a Dataset. For the experiment that studies the pH profile of PYK1, the scheme reads as follows: Dataset1: components used in this assay (assay conditions) at pH 3 and corresponding kinetics parameters (Results). Dataset2: components used in the assay conditions from Dataset1 (just copied and pasted from here) but at pH 4 and corresponding kinetics parameters. An additional five Datasets can be similarly input (one for each pH step from 5 to 9). In addition, if inhibitors or activators are used in the experiment, the first Dataset would include the kinetic parameters without the inhibitor or activator. The subsequent Dataset provides the kinetics parameters that are dependent on the added inhibitor or activator. If several inhibitors are tested the number of Datasets corresponds to the number of inhibitors. It should be noted that the data input does not result in the simple completion of a checklist. Rather the details that must be included depend on the nature of the enzyme, the type of experiment performed and what results are to be reported. The STRENDA DB system already recognizes these complexities, in particular by providing expandable sections for details only required under particular circumstances. Thus, kinetic parameters for activators or inhibitors can be only entered if activators or inhibitors have been defined in the description of the assay conditions. As the system further develops, it is envisaged that more sophisticated automated validation steps will be introduced. Similarly, over time further expandable sections will be added to support more complex experiments. Successful data input results in assignment of both the SRN and a DOI, which are identifiers for the data within an Experiment on the functional properties of a single enzyme. Thus, multiple SRNs and DOIs can be linked analogously to one manuscript containing one or more Experiments. The user can therefore subsequently query the database for a given publication using a PubMed identifier (PMID) and obtain the number of SRNs and Experiments along with the assay conditions and experimental results respectively. The DOIs are automatically registered with DataCite (https://www.datacite.org) to enable users not only to search the metadata of datasets but also to support the community by providing a perennial, resolvable identifier for each dataset in STRENDA DB (Fig. 5). The query interface is accessed via the ‘Query’ button in the menu. The interface has been kept straightforward and simple by following the search mask of major search engines such as Google. For querying STRENDA DB neither registration nor login is required. The user can search in the database using key terms such as protein name, EC number, UniProt accession number, organism, author name, PMID, SRN or DOI. For an initial overview the search mask can be left empty and all datasets published are displayed. The hit list consists of a table that displays entries for all the key terms mentioned above plus a column with hyperlinks that provide access to: (a) the experimental overview; (b) the fact sheet downloadable as a PDF file; and (c) an experimental XML file (Fig. 6). The experimental overview is accessed via the ‘Show’ button in the right hand column of the hit list table. This page displays the header data such as the manuscript title and the names of the authors as well as the identifiers of this data set (SRN and DOI) along with the most important data on the protein studied. The header data are followed by the list of Datasets, which include the assay conditions with the calculated kinetic parameters (Fig. 7). The fact sheet contains all input data in a human-readable format (Table S1) and contains far more information than the experimental overview page, including the sequence of the protein, identifiers of chemical compounds used in the assay, and concentration of enzyme in the assay and data on how this was measured. The fact sheet can be extended by additional data such as International Union of Pure and Applied Chemistry (IUPAC) names and the IUPAC International Chemical Identifier (InChI) of the compounds used in the assay. Authors are encouraged to submit the fact sheet to the journal as supplementary information along with the main manuscript to the journal. The supplementary information is not only considered for publication but also indicates that the reporting of the enzyme assays is in compliance with the STRENDA Guidelines; the SRN assigned indicates that all relevant information is provided in the manuscript or its supplementary information. Since all data sets are assigned a DOI and can be cited elsewhere, there is an alternative way to search and directly access datasets deposited in STRENDA DB; clicking on a hyperlinked DOI leads the user to the corresponding hit page, which is linked to both the Experiment overview page and the data fact sheet PDF (Fig. 8). The STRENDA Commission strongly encourages the scientific community to incorporate the STRENDA DB in the general publication workflow. It is proposed to authors to submit their enzyme function data to STRENDA DB, where these data are automatically validated on compliance with the STRENDA Guidelines. A successful formal compliance is confirmed by the awarding of a SRN and documented in a fact sheet (in PDF format) containing all input data that can be submitted with the manuscript to the journal. Once the corresponding article has been peer-reviewed and published in the journal, the bibliographic data, in the form of a PMID, is added and the experimental data is made publicly accessible in STRENDA DB (Fig. 9). The direct electronic submission of data by the authors prior to or during the publication has proven to be the gold standard for comprehensive data acquisition for protein structures in PDB [19]. We expect that the STRENDA DB would become the analogous tool to PDB for enzyme functional data. Standards for reporting enzymology data DB is the first database adhering to community-based guidelines for ensuring reproducibility of enzyme kinetics data. It is designed to aid the data provider in publishing and sharing data, the manuscript reviewer in interpreting data during the review process, the data consumer in finding, comparing and utilizing publicly available kinetics data, and the funding agency increasing research impact and availability of data. The checking for completeness and validation offered by the STRENDA DB system benefits all involved in the process of reporting and publishing. Authors will be assured that they have comprehensively recorded all essential details of the experiment – and hence reduce problems that currently can occur with data reproducibility. Journal reviewers and editors can be assured that the data and metadata underlying a publication has been reported fully and will eventually be available to the whole scientific community. Readers of a published paper will know that a comprehensive description of the experiments and results is available in a standardized format. Supporting the review process is a key consideration of STRENDA DB, although it will of course be for individual journals to decide if and how to incorporate STRENDA DB into their review and publication policies. It is hoped the catalysis community and its journals will move towards a model in which authors would be required to submit the underlying data to STRENDA DB at the point of manuscript submission. This would be a logical extension of the current state in which journals request that authors follow the written STRENDA Guidelines in preparing a manuscript. Journals could also require that the dataset be made publicly available at the point of publication. This mirrors the approach taken with a range of ‘omics data types, including that of protein structure data and the PDB. However, validation of a data set as STRENDA compliant is not intended to replace the general review process. STRENDA DB merely checks that an enzyme function experiment has been comprehensively described, and makes no judgement on the scientific quality. Reviewers and editors will still need to evaluate the importance of the topic studied, the experimental design and the reliability of the results. The review process may be aided by access to the PDF summary fact sheet generated by STRENDA DB, which shows in a standardized format all data and metadata. As STRENDA DB develops, it may include additional automated checks on the submitted data based on appropriate validation criteria, but the final judgement on the integrity of the data will always be left to expert reviewers and editors. All data sets in STRENDA DB are assigned a persistent DOI, which allows for their direct access via web browsers. Authors will be able to quote these to allow readers immediate access to the data once a paper has been published. Such an approach will increase the accessibility of experimental data, in accordance with the general trend of increasing data reuse and ensuring reproducibility. Through the DOI it will also be possible for authors and others to track the use of their datasets, and hence support the trend of rewarding data providers for sharing data in addition to the traditional performance metrics based upon citations of publications. To facilitate the finding and reuse of datasets submitted to STRENDA DB, the system includes numerous cross-references to well-used, publicly available databases, such as UniProt for definition of enzymes, ExplorENZ for the definition of EC numbers and reactions catalysed by the enzyme [20], and PubChem for specification of any small-molecule compounds present in an assay mixture, such as substrates, products, buffers, salts and inhibitors. At the time of data entry, such links are provided by searchable fields in the submission tool to aid the user. Including such facilities in the interface provide the advantage of reducing the amount of data that the submitter must supply manually and increasing the accuracy of supplied metadata. Furthermore, linking this metadata to external database identifiers facilitates data retrieval and integration with external applications and related data resources such as KEGG [21] and ChEBI [22, 23]. One such consumer of enzyme kinetics data is the systems biology community, who will be greatly aided by the availability of reliable enzyme activity data in a standardized and annotated format, from which realistic and predictive models of signalling and metabolic pathways may be built [24-26]. The STRENDA Commission is aware that the data entry process must be as simple as possible to minimize burden to authors, in particular those who are first-time users of the database. Apart from the data input process which reflects the schema of the common structure of a manuscript, the user is guided through the input process by tool tips associated with most of the input fields. During the data entry process, users receive specification of data required, and steps to take to continue data input. In addition, a comprehensive and downloadable user guide is available online, which provides the reader with a description of both the STRENDA DB and the step-by-step data input process. Finally, video tutorials (freely accessible at http://www.beilstein.tv/categories/strenda/) demonstrate step-by-step the data entry process in STRENDA DB. The Beilstein-Institut and the STRENDA Commission will support the upkeep and development of the database over the coming years, including the provision of data curation of entries submitted by the community. In time, it is hoped that the STRENDA DB will provide access to kinetics data covering a multitude of enzymes from prokaryotic and eukaryotic proteomes. It is recognized, however, that the current release of STRENDA DB is an initial version, and as such only handles the most common experimental procedures. Over time, and with the benefit of user feedback, STRENDA DB will improve its functionality and to cover a broader range of experimental methods and data types. Additional features will be introduced in such a way that limits additional demands on the user. For example, fields specific to a particular experiment type will be hidden in expandable sections when not required. The current system already makes extensive use of such facilities; for example, it provides details of protein sequence and PTMs. Many future developments are envisaged, and their implementation will be prioritized in consultation with the user community, who are encouraged to provide their feedback. The STRENDA Commission envisions a series of improvements for the database. The system could accept a more complete description of the kinetic equation to which data have been fitted to estimate parameters. This may be offered as a selection from a standard list, perhaps utilizing existing resources and ontologies. Such enhanced definitions may incorporate methods to report rates of the formation of multiple products formed from the same substrate, such as two enantiomers of a given product. Specifying a kinetic equation would simplify the interface, limiting the parameter values required from the user, and would also allow validation algorithms to flag possible mistakes in the set of user-specified parameters. For example, a warning can be issued if a Km value entered falls outside the reported range of substrate concentrations studied, violating conditions for validity of the kinetic equation [27], or if a reported rate would convert all substrate present in a few seconds (1000-fold mistakes in units are not uncommon). Similarly, values entered for effector concentrations studied could be automatically compared with kinetic parameters reported for those effectors. A description of the software used for data analysis could be included along with calculated errors for all parameters. An extended system for the specification of macromolecular ingredients (other than the enzyme) in an assay mixture may be implemented, generating links to appropriate databases and utilizing existing ontologies where appropriate. This will be especially relevant in considering special cases of data, including multicomponent and multi-EC number enzymes. This may be extended to accommodate protein descriptions that differ from the wild-type description in UniProt, considering issues such as the presence of pro- and signal sequences and of zymogen peptides that have been cleaved in the actual protein studied, hetero-oligomer proteins made up of multiple UniProt entries, enzymes that are studied with tightly bound metal ions and prosthetic groups, especially where more than one variant is possible. (This may all be best solved by help text explaining how to describe these possibilities.) Automated cross-checking between the specification of PTMs against the protein sequence may also be introduced, ensuring that PTMs are only ever assigned to ‘allowed’ residues. The collection of additional metadata may be offered, including introduction of more structured fields to capture some items that currently go into the ‘Experimental Methods’ free text box. Such fields may include, ‘What compound was monitored to follow the reaction?’ and ‘What analytical/spectroscopic method was used to monitor it?’ Again, such details will be determined in consultation with the user community. In instances where catalytic activity or binding cannot be detected, an estimate of the limit of detection based on the sensitivity and error analysis of the assay could be asked for. Previous work has illustrated the feasibility of integrating the analysis of initial rate data or even progress curve data with the submission of enzyme kinetics data [28]. As such, introduction of more sophisticated data, including those on bisubstrate reactions, grid data sets or time-course data will also be investigated. Over time this may develop into a downloadable tool that can be used locally in labs at the time of experiment, incorporating analysis of raw experimental data, collection of appropriate metadata, performance of STRENDA validation and seamless data transfer to the database. Another consideration is the development of improved methods for data retrieval, including combined and complex queries, and the incorporation of a programmatically accessible API, allowing for both the submission and the extraction of data to be integrated with existing LIMS and electronic lab notebook systems. NS acknowledges the funding from the Biotechnology and Biological Sciences Research Council (BBSRC) under grants BB/M017702/1, ‘Centre for synthetic biology of fine and speciality chemicals (SYNBIOCHEM)’, BB/K019783/1, ‘Continued development of ChEBI towards better usability for the systems biology and metabolic modelling community’ and BB/M006891/1, ‘Enriching Metabolic PATHwaY models with evidence from the literature (EMPATHY)’. This is a contribution from the Manchester Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM). SS acknowledges the funding from NIH/NIDDK under grant R25 DK088752. STRENDA and STRENDA DB are completely funded by the Beilstein-Institut. The authors declare no conflict of interest. CK initiated and drove the project. UR developed the STRENDA DB under guidance from CK. All authors contributed towards the design and testing of STRENDA DB, contributed initial data sets and wrote and approved the manuscript. Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.
Referência(s)