Revisão Revisado por pares

Why Batch Effects Matter in Omics Data, and How to Avoid Them

2017; Elsevier BV; Volume: 35; Issue: 6 Linguagem: Inglês

10.1016/j.tibtech.2017.02.012

ISSN

0167-9430

Autores

Wilson Wen Bin Goh, Wei Wang, Limsoon Wong,

Tópico(s)

Gene Regulatory Network Analysis

Resumo

Effectively dealing with batch effects will be the next frontier in large-scale biological data analysis, particularly involving the integration of different data sets. Given how batch-effect correction exaggerates cross-validation outcomes, cross-validation is becoming considered a less authoritative form of evaluation. Batch effect-resistant methods will become important in the future, alongside existing batch effect-correction methods. Effective integration and analysis of new high-throughput data, especially gene-expression and proteomic-profiling data, are expected to deliver novel clinical insights and therapeutic options. Unfortunately, technical heterogeneity or batch effects (different experiment times, handlers, reagent lots, etc.) have proven challenging. Although batch effect-correction algorithms (BECAs) exist, we know little about effective batch-effect mitigation: even now, new batch effect-associated problems are emerging. These include false effects due to misapplying BECAs and positive bias during model evaluations. Depending on the choice of algorithm and experimental set-up, biological heterogeneity can be mistaken for batch effects and wrongfully removed. Here, we examine these emerging batch effect-associated problems, propose a series of best practices, and discuss some of the challenges that lie ahead. Effective integration and analysis of new high-throughput data, especially gene-expression and proteomic-profiling data, are expected to deliver novel clinical insights and therapeutic options. Unfortunately, technical heterogeneity or batch effects (different experiment times, handlers, reagent lots, etc.) have proven challenging. Although batch effect-correction algorithms (BECAs) exist, we know little about effective batch-effect mitigation: even now, new batch effect-associated problems are emerging. These include false effects due to misapplying BECAs and positive bias during model evaluations. Depending on the choice of algorithm and experimental set-up, biological heterogeneity can be mistaken for batch effects and wrongfully removed. Here, we examine these emerging batch effect-associated problems, propose a series of best practices, and discuss some of the challenges that lie ahead. a data-cleaning approach where batch effects are estimated and removed from the data. technical sources of variation, such as different processing times or different handlers, which may confound the discovery of real explanatory variables from data. a complex system comprising functional relationships among various biological entities. Almost all networks are abstractions, and commonly studied ones include protein–protein interaction, signaling, and metabolic networks. biological sources of variation that provide discrimination between two phenotypes or classes, such as normal versus disease. a trained system that has established a set of classification rules derived from a set of predictor variables. Given observed values of the predictor variables for a sample, the classifier uses its classification rules to determine a class assignment for the sample. mixed together; difficult to disambiguate. Possible to express quantitatively, for example, batch and class effects are perfectly confounded if all class A samples are in batch 1 and all class B samples are in batch 2. an evaluative technique to infer how the outcome from a statistical analysis in one instance may generalize to another, provided that the instances are sampled subsets derived from one original dataset. a continuous measurement indicating the extent of difference between one variable in one class relative to one in another class. the magnitude of difference for a given variable between two classes. a test result that declares a feature as insignificant when in fact it is significant. a test result that declares a feature as significant when in fact it is not. a statistical technique for simplifying models by identifying and including only the most relevant variables among all others being measured. make universally applicable. Here, ‘generalizable’ means that the findings in one study also apply to another. an evaluative technique used to infer how the outcome from a statistical analysis in one instance may generalize to another, provided that the instances are not derived from one original dataset but are independently obtained. an analytical technique where data from each laboratory or each batch is analyzed independently, with the expectation that they lead towards mutually supportive findings. the tendency to overestimate the likelihood of an outcome. the tendency for a statistical test to detect a true effect. a qualitative variable indicating whether a feature is pertinent towards differentiating one class from another. the tendency to produce the same findings given independent resamplings from the same reference populations. groups of phenotypically equivalent samples, such as normal or disease. genetic variants of the same disease (subtypes); also referred to as biological heterogeneity.

Referência(s)