Dealing with Confounders in Omics Analysis
2018; Elsevier BV; Volume: 36; Issue: 5 Linguagem: Inglês
10.1016/j.tibtech.2018.01.013
ISSN0167-9430
AutoresWilson Wen Bin Goh, Limsoon Wong,
Tópico(s)Biomedical Text Mining and Ontologies
ResumoAdvanced statistics for biomarker development should no longer comprise a pair of naive hypothesis statements. Permutation tests should see greater usage as a robust way to avoid issues with inappropriate null distributions. Standards for addressing confounders in applied statistics must be established. The Anna Karenina effect is a manifestation of the theory–practice gap that exists when theoretical statistics are applied on real-world data. In the course of analyzing biological data for differential features such as genes or proteins, it derives from the situation where the null hypothesis is rejected for extraneous reasons (or confounders), rather than because the alternative hypothesis is relevant to the disease phenotype. The mechanics of applying statistical tests therefore must address and resolve confounders. It is inadequate to simply rely on manipulating the P-value. We discuss three mechanistic elements (hypothesis statement construction, null distribution appropriateness, and test-statistic construction) and suggest how they can be designed to foil the Anna Karenina effect to select phenotypically relevant biological features. The Anna Karenina effect is a manifestation of the theory–practice gap that exists when theoretical statistics are applied on real-world data. In the course of analyzing biological data for differential features such as genes or proteins, it derives from the situation where the null hypothesis is rejected for extraneous reasons (or confounders), rather than because the alternative hypothesis is relevant to the disease phenotype. The mechanics of applying statistical tests therefore must address and resolve confounders. It is inadequate to simply rely on manipulating the P-value. We discuss three mechanistic elements (hypothesis statement construction, null distribution appropriateness, and test-statistic construction) and suggest how they can be designed to foil the Anna Karenina effect to select phenotypically relevant biological features. technical sources of variation, such as different processing times or different handlers, which may confound the discovery of real explanatory variables from data. also known as sample stratification, a process where samples are ensured to be similar in all other respects except the variable of interest being examined. biological source of variation that provides discrimination between two phenotypes or classes (e.g., normal and disease). variables that can create spurious associations. mixed together or difficult to disambiguate. the extent of information inferable from one variable about another. the number of variables, given the calculation of a test statistic, that are free to vary. identifying the disease or disease subtype. a test result that declares a feature as insignificant when in fact, it is significant. a test result that declares a feature as significant when in fact, it is not significant. a condition where the genetic material is constantly being rearranged or rewritten. a test that evaluates the overrepresentation of a particular group in a sample. caused by mutation in a single gene. the probability distribution of the test statistic when the null hypothesis is true. The shapes of these distributions depend on the test used. a random draw of sample instances where a significant effect is not expected, (i.e., the null hypothesis is true). caused by mutations in many potential genes, often with complex intergene relationships. a statistical procedure that summarizes high-dimensional data (e.g., comprising tens of thousands of genes) where variables are potentially correlated into a lower-dimensional set of uncorrelated variables known as principle components (PCs). a prediction of disease outcome. errors incurred by chance when a randomly drawn sample is nonreflective of its population. ability to get the same gene set in a signature. multigene biomarkers. ability to work across all independent datasets.
Referência(s)