Revisão Revisado por pares

Machine learning for Big Data analytics in plants

2014; Elsevier BV; Volume: 19; Issue: 12 Linguagem: Inglês

10.1016/j.tplants.2014.08.004

ISSN

1878-4372

Autores

Chuang Ma, Hao Helen Zhang, Xiangfeng Wang,

Tópico(s)

Genomics and Phylogenetic Studies

Resumo

•Use the Big Data technology to assist basic and translational research in plants. •Basic concepts and procedures, pitfalls, and remedies of using machine learning. •Demonstration of using machine learning for data-driven discovery of stress genes. Rapid advances in high-throughput genomic technology have enabled biology to enter the era of ‘Big Data’ (large datasets). The plant science community not only needs to build its own Big-Data-compatible parallel computing and data management infrastructures, but also to seek novel analytical paradigms to extract information from the overwhelming amounts of data. Machine learning offers promising computational and analytical solutions for the integrative analysis of large, heterogeneous and unstructured datasets on the Big-Data scale, and is gradually gaining popularity in biology. This review introduces the basic concepts and procedures of machine-learning applications and envisages how machine learning could interface with Big Data technology to facilitate basic research and biotechnology in the plant sciences. Rapid advances in high-throughput genomic technology have enabled biology to enter the era of ‘Big Data’ (large datasets). The plant science community not only needs to build its own Big-Data-compatible parallel computing and data management infrastructures, but also to seek novel analytical paradigms to extract information from the overwhelming amounts of data. Machine learning offers promising computational and analytical solutions for the integrative analysis of large, heterogeneous and unstructured datasets on the Big-Data scale, and is gradually gaining popularity in biology. This review introduces the basic concepts and procedures of machine-learning applications and envisages how machine learning could interface with Big Data technology to facilitate basic research and biotechnology in the plant sciences. a machine-learning approach that iteratively update training dataset by strategically selecting informative data for obtaining a classifier with high prediction performance. a machine-learning approach that iteratively increases the weight of misclassified samples for boosting weak classifiers to be a stronger classifier. a framework that allows the automated parallel storing and processing of data on a large cluster of computing nodes. a project that aims to build a scalable machine-learning library running on Hadoop for Big Data analysis. a set of numerical or categorical quantities used to describe an example. a popular term describing large datasets with the features of high velocity, volume, and variety, which are difficult to process using traditional database management and analytical methods. 1 Exabyte (EB) = 1000 Petabytes (PB) = 1 000 000 Terabytes (TB) = 1 000 000 000 Gigabytes (GB). a new type of use-on-demand data computing and storage paradigm that enables users to build time-consuming applications and manage large datasets on many commodity computing nodes. a criterion used to measure the performance of a learned model. the objects from which a model is learned or on which a model will be applied for prediction. a measure that can be used to find the optimized threshold of machine-learning models with both high precision and recall. a group of Hadoop-related Big Data storage, access, processing and analysis utilities, including HBase, Spark, Hive, Pig, Sqoop, and Mahout. a distributed file system developed for accessing and processing the data stored with Hadoop in a parallel manner. two techniques for handling missing data. Hot deck imputation replaces missing data with substituted values randomly selected from similar samples in the same dataset. By contrast, cold deck imputation selects values from other datasets. machine-learning algorithms that transform the features of samples into a higher-dimensional space using kernel functions, such as polynomial function and radial basis function. a programming model that enables users to easily write programs supporting automated parallel processing distributed on multiple computing nodes. a measure used in machine learning to evaluate the prediction performance of two-class classifiers by taking into account true positives, true negatives, false positives and false negatives. a machine-learning algorithm that assigns an output to an example described with a set of attributes. a technology that produces DNA or RNA fragments with the capacity of high throughput, scalability, speed, and resolution. a new data management system that enables the storage and manipulation of data through the construction of highly reliable, scalable and distributed databases. the outcome of a learning problem. The output can be a categorical label (qualitative) or a continuous value (quantitative). a statistical technique that eliminates redundancy by converting data into a set of linearly uncorrected variables (i.e., principle components). a modern machine learning algorithm that constructs with an ensemble of decision trees for classification and regression problems. an R package that provides an application programming interface (API) for running R scripts with Hadoop. models that built with support vector machine algorithm to perform the classification of positive and negative samples in a high-dimensional space. a set of examples used to learn the model. a set of examples used to select and validate the model. a set of examples used to assess the generalization performance of a learned model.

Referência(s)