Artigo Acesso aberto Revisado por pares

Improved Statistical Inference from DNA Microarray Data Using Analysis of Variance and A Bayesian Statistical Framework

2001; Elsevier BV; Volume: 276; Issue: 23 Linguagem: Inglês

10.1074/jbc.m010192200

ISSN

1083-351X

Autores

Anthony D. Long, Harry J. Mangalam, Bob Chan, Lorenzo Tolleri, G. Wesley Hatfield, Pierre Baldi,

Tópico(s)

Bioinformatics and Genomic Networks

Resumo

We describe statistical methods based on thet test that can be conveniently used on high density array data to test for statistically significant differences between treatments. These t tests employ either the observed variance among replicates within treatments or a Bayesian estimate of the variance among replicates within treatments based on a prior estimate obtained from a local estimate of the standard deviation. The Bayesian prior allows statistical inference to be made from microarray data even when experiments are only replicated at nominal levels. We apply these new statistical tests to a data set that examined differential gene expression patterns in IHF+ and IHF− Escherichia coli cells (Arfin, S. M., Long, A. D., Ito, E. T., Tolleri, L., Riehle, M. M., Paegle, E. S., and Hatfield, G. W. (2000) J. Biol. Chem. 275, 29672–29684). These analyses identify a more biologically reasonable set of candidate genes than those identified using statistical tests not incorporating a Bayesian prior. We also show that statistical tests based on analysis of variance and a Bayesian prior identify genes that are up- or down-regulated following an experimental manipulation more reliably than approaches based only on a t test or fold change. All the described tests are implemented in a simple-to-use web interface called Cyber-T that is located on the University of California at Irvine genomics web site. We describe statistical methods based on thet test that can be conveniently used on high density array data to test for statistically significant differences between treatments. These t tests employ either the observed variance among replicates within treatments or a Bayesian estimate of the variance among replicates within treatments based on a prior estimate obtained from a local estimate of the standard deviation. The Bayesian prior allows statistical inference to be made from microarray data even when experiments are only replicated at nominal levels. We apply these new statistical tests to a data set that examined differential gene expression patterns in IHF+ and IHF− Escherichia coli cells (Arfin, S. M., Long, A. D., Ito, E. T., Tolleri, L., Riehle, M. M., Paegle, E. S., and Hatfield, G. W. (2000) J. Biol. Chem. 275, 29672–29684). These analyses identify a more biologically reasonable set of candidate genes than those identified using statistical tests not incorporating a Bayesian prior. We also show that statistical tests based on analysis of variance and a Bayesian prior identify genes that are up- or down-regulated following an experimental manipulation more reliably than approaches based only on a t test or fold change. All the described tests are implemented in a simple-to-use web interface called Cyber-T that is located on the University of California at Irvine genomics web site. integration host factor open reading frame The recent availability of complete genomic sequences and/or large numbers of cDNA clones from model organisms coupled with technical advances in DNA arraying technology have made it possible to study genome-wide patterns of gene expression. Most high density microarray experiments consist of one of two types: examining changes in gene expression over a temporal or treatment gradient (1DeRisi J.L. Iyer V.R. Brown P.O. Science. 1997; 278: 680-686Crossref PubMed Scopus (3686) Google Scholar) or comparing gene expression between two different cell sample types or genotypes (2Schena M. Shalon D. Davis R.W. Brown P.O. Science. 1995; 270: 467-470Crossref PubMed Scopus (7564) Google Scholar, 3Schena M. Shalon D. Heller R. Chai A. Brown P.O. Davis R.W. Proc. Natl. Acad. Sci. U. S. A. 1996; 93: 10614-10619Crossref PubMed Scopus (1411) Google Scholar, 4Lashkari D.A. DeRisi J.L. McCusker J.H. Namath A.F. Gentile C. Hwang S.Y. Brown P.O. Davis R.W. Proc. Natl. Acad. Sci. U. S. A. 1997; 94: 13057-13062Crossref PubMed Scopus (530) Google Scholar, 5Arfin S.M. Long A.D. Ito E.T. Tolleri L. Riehle M.M. Paegle E.S. Hatfield G.W. J. Biol. Chem. 2000; 275: 29672-29684Abstract Full Text Full Text PDF PubMed Scopus (220) Google Scholar). Fluorescently or isotopically labeled cDNA or RNA probes are hybridized to high density arrays of cDNA clones on glass supports (1DeRisi J.L. Iyer V.R. Brown P.O. Science. 1997; 278: 680-686Crossref PubMed Scopus (3686) Google Scholar, 2Schena M. Shalon D. Davis R.W. Brown P.O. Science. 1995; 270: 467-470Crossref PubMed Scopus (7564) Google Scholar, 3Schena M. Shalon D. Heller R. Chai A. Brown P.O. Davis R.W. Proc. Natl. Acad. Sci. U. S. A. 1996; 93: 10614-10619Crossref PubMed Scopus (1411) Google Scholar, 4Lashkari D.A. DeRisi J.L. McCusker J.H. Namath A.F. Gentile C. Hwang S.Y. Brown P.O. Davis R.W. Proc. Natl. Acad. Sci. U. S. A. 1997; 94: 13057-13062Crossref PubMed Scopus (530) Google Scholar, 6DeRisi J.L. Penland L. Brown P.O. Bittner M.L. Meltzer P.S. Ray M. Chen Y. Su Y.A. Trent J.M. Nat. Genet. 1996; 14: 457-460Crossref PubMed Scopus (1748) Google Scholar, 7Shalon D. Smith S.J. Brown P.O. Genome Res. 1996; 6: 639-645Crossref PubMed Scopus (877) Google Scholar, 8Heller R.A. Schena M. Chai A. Shalon D. Bedillon T. Gilmore J. Woolley D.E. Davis R.W. Proc. Natl. Acad. Sci. U. S. A. 1997; 94: 2150-2155Crossref PubMed Scopus (664) Google Scholar), nylon membranes (9Lennon G.G. Lehrach H. Trends Genet. 1991; 7: 314-317Abstract Full Text PDF PubMed Scopus (166) Google Scholar, 10Gress T.M. Hoheisel J.D. Lennon G.G. Zehetner G. Lehrach H. Mamm. Genome. 1992; 3: 609-619Crossref PubMed Scopus (131) Google Scholar, 11Nguyen C. Rocha D. Granjeaud S. Baldit M. Bernard K. Naquet P. Jordan B.R. Genomics. 1995; 29: 207-216Crossref PubMed Scopus (173) Google Scholar, 12Takahashi N. Hashida H. Zhao N. Misumi Y. Sakaki Y. Gene ( Amst. ). 1995; 164: 219-227Crossref PubMed Scopus (25) Google Scholar, 13Zhao N. Hashida H. Takahashi N. Misumi Y. Sakaki Y. Gene ( Amst. ). 1995; 156: 207-213Crossref PubMed Scopus (105) Google Scholar, 14Pietu G. Alibert O. Guichard V. Lamy B. Bois F. Leroy E. Mariage- Samson R. Houlgatte R. Soularue P. Auffray C. Genome Res. 1996; 6: 492-503Crossref PubMed Scopus (158) Google Scholar, 15Rovere P. Trucy J. Zimmerman V.S. Granjeaud S. Rocha D. Nguyen C. Ricciardi-Castagnoli P. Jordan B.R. Davoust J. Adv. Exp. Med. Biol. 1997; 417: 467-473Crossref PubMed Scopus (2) Google Scholar), or oligonucleotides directly synthesized on silica wafers (16Fodor S.P. Read J.L. Pirrung M.C. Stryer L. Lu A.T. Solas D. Science. 1991; 251: 767-773Crossref PubMed Scopus (2424) Google Scholar, 17Lipshutz R.J. Fodor S.P. Gingeras T.R. Lockhart D.J. Nat. Genet. 1999; 21: 20-24Crossref PubMed Scopus (1856) Google Scholar). Signals are quantified using phosphorimaging, photomultiplier tubes, or CCD imaging, and a data set is created that consists of expression measurements for all of the elements of the array.Despite rapid technological developments, the statistical tools required to analyze these fundamentally different types of DNA microarray data are not in place. Data often consist of expression measures for thousands of genes, but experimental replication at the level of single genes is often low. This creates problems of statistical inferences because many genes will show fairly large changes in gene expression purely by chance alone. Therefore, to interpret data from DNA microarrays it is necessary to employ statistical methods capable of distinguishing chance occurrences from biologically meaningful data.The t test can be used to determine whether the observed difference between two means is statistically significant (18Sokal R.R. Rohlf F.J. Biometry. W. H. Freeman and Co., New York1995: 219-227Google Scholar). Thet test incorporates a measure of within treatment error into the statistical test; as a result only genes showing a large change in gene expression relative to the within treatment variance are considered to have significantly changed. In a perfect world, all DNA microarray experiments would be highly replicated. Such replication would allow accurate estimates of the variance within experimental treatments to be obtained, and the t test would perform well. However, samples may be available in limited supply, and DNA microarray experiments are expensive and time consuming to carry out. As a result, the level of replication within experimental treatments is often low. This results in poor estimates of variance and a correspondingly poor performance of the t test itself.An alternative to the t test is to ignore the within treatment variance and only look at fold change as a proxy for statistical significance. Intuition suggests that larger observed fold changes can be more confidently interpreted as a stronger response to the experimental manipulation than smaller observed fold changes. However, an implicit assumption of this reasoning is that the variance among replicates within treatments is the same for every gene. In reality, the variance varies among genes (e.g. see Fig. 2), and it is critical to incorporate this information into a statistical test. These different approaches demonstrate the statistical problem faced when analyzing DNA microarray data. Ignoring the sampling variance is incorrect, yet incorporating it in the traditional manner may not be much better because the number of replicate experiments is often quite low.Although there is no substitute for experimental replication, confidence in the interpretation of DNA microarray data with a low number of replicates can be improved by using a Bayesian statistical approach (19Baldi P. Brunak S. Bioinformatics: The Machine Learning Approach. MIT Press, Cambridge, MA1998Google Scholar) that incorporates prior information of within treatment measurement. The Bayesian prior assumes that genes of similar expression levels have similar measurement errors, amounting to parallel pseudo-replication of the experiment. For example, the variance of any single gene can be estimated from the variance from a number of genes of similar expression level. More specifically, the variance of any gene within any given treatment can be estimated by the weighted average of a prior estimate of the variance for that gene. This weighting factor, or hyperparameter, is controlled by the experimenter and will depend on how confident the experimenter is that the background variance of a closely related set of genes approximates the variance of the gene under consideration. In the Bayesian approach employed in this study, the weight given to the within gene variance estimate is a function of the number of observations contributing to that value. This leads to the desirable property of the Bayesian approach converging to the t test as the experimenter carries out additional replications and thus becomes more confident in the observed estimate of within treatment variance.Commonly used software packages are poorly suited for implementing the Bayesian statistical methods we develop in this work. However, we have created a program, Cyber-T, which accommodates this approach. Cyber-T is available for on-line use at the genomics web site at the University of California at Irvine. This program is ideally suited to experimental designs in which replicate control cDNA samples are being compared with replicate experimental cDNA samples.In this study we use the statistical tools incorporated into Cyber-T to compare and analyze the gene expression profiles obtained from a wild-type strain of Escherichia coli and an otherwise isogenic strain lacking the gene for the global regulatory protein, integration host factor (IHF),1 previously reported by Arfin et al. (5Arfin S.M. Long A.D. Ito E.T. Tolleri L. Riehle M.M. Paegle E.S. Hatfield G.W. J. Biol. Chem. 2000; 275: 29672-29684Abstract Full Text Full Text PDF PubMed Scopus (220) Google Scholar). We apply different statistical methods for identifying genes showing changes in expression to this data set and show that a Bayesian approach identifies a stronger set of genes as being significantly up- or down-regulated based on our biological understanding of IHF regulation. We show that commonly used approaches for identifying genes as being up- or down-regulated (i.e., simple t test or fold change thresholds) require more replication to approach the same level of reliability as Bayesian statistical approaches applied to data sets with more modest levels of replication. We further show that statistical tests identify a different set of genes than those based on fold change and argue that the set of genes identified by fold change is more likely to harbor experimental artifacts.DISCUSSIONAt present, the significance of high density array data is often judged solely on the basis of observed fold change in expression. An arbitrary fold change threshold is created, with genes showing greater change than that threshold declared significant and all others declared nonsignificant. It is apparent from Fig. 2 that fold change in expression may not be a good proxy for statistical significance. If analyses are carried out on nontransformed data, any given threshold of fold change in expression will be liberal for genes expressed at a high level and conservative for genes expressed at a low level. Conversely, if analyses are carried out on log-transformed data, the threshold will be conservative for genes expressed at a high level and liberal for genes expressed at a low level. In cases where either the control or experimental observations are replicated, it is possible to assess the significance of the difference between the control and experimental data relative to the observed level of within class variation. This results in smaller fold changes being significant for genes whose expression levels are measured with great accuracy and large fold changes being nonsignificant for genes whose expression levels cannot be measured very accurately.Ignoring the variation among replicates or not carrying out replication and determining significance based solely on fold change do not negate the problems discussed above; it amounts to assuming the variance among replicates is equal for all genes. The support for an expression level difference being meaningful relative to the observed variation within treatments is conveniently represented by the t statistic (see "Materials and Methods").Observed differences between a single replication of a controlversus experimental treatment can be due to inadequately controlled experimental factors as opposed to the experimental condition itself. Examples of such variables may include small differences in the time or method of harvesting cells, differences between tissue samples not related to the experiment, and variation induced by the RNA isolation or labeling protocol. For this reason, replicates of high density array experiments are particularly useful. Replication will increase the likelihood of detecting subtle changes in expression between treatments while decreasing the likelihood of false positives. In an ideal world, high density array experiments would be replicated a "large" number of times (e.g. >10), and the t statistic would measure the relative support for a difference between control and experimental treatments being due to chance alone. In practice, high density array experiments are rarely replicated this many times, and the sampling variance on thet statistic is therefore quite large. In this context, using only the t statistic (or a corresponding p value) as a measure of whether or not a gene is significant can be misleading. Nonetheless, we have found it useful to sort genes on pvalues as an exploratory tool for identifying potentially interesting genes (5Arfin S.M. Long A.D. Ito E.T. Tolleri L. Riehle M.M. Paegle E.S. Hatfield G.W. J. Biol. Chem. 2000; 275: 29672-29684Abstract Full Text Full Text PDF PubMed Scopus (220) Google Scholar).A Bayesian approach to estimating the within treatment variation among replicates has been implemented within Cyber-T. The use of the weighted average of the "local" standard deviation for genes with similar expression levels and the observed gene-specific standard deviation stabilizes within treatment variance estimates. Increasing the precision of variance estimates in both the control and experimental treatments results in more stable t statistics. This allows inferences to be drawn from high density array experiments that have been carried out with nominal levels of replication. This is demonstrated in Fig. 3 where statistical inference using the Bayesian approach with only two replicates approaches that normally achieved with more replication. There remains a possibility that different genes of similar expression levels have widely differing true variances. Under this possibility and the current prior, poorly replicable genes will be falsely declared significant, and intrinsically highly replicable genes will be falsely declared not significant. Although this would represent an undesirable outcome, when experiments show little replication the relative error in inference introduced by an incorrect prior is likely to be less than the error in inference introduced from very poor estimates of within treatment variance. Ultimately, it will be important to derive empirical guidelines for the determination of the correct hyperparameter to use in weighting the prior information (i.e. the local average standard deviation) relative to the observed within treatment estimate of variation. It is possible that the best weighting will depend on factors such as the biological system being studied, experimental conditions employed, and high density array technology used.Analyses in Cyber-T are performed on both log-transformed and nontransformed data. Log transformations are carried out for three reasons. First, in plots of raw data (Fig. 2 A) many of the data points are clustered at the low end of the values. Plots of log-transformed data tend to expand these low values and make them easier to examine visually (Fig. 2 B). Second, an assumption of the t test is that the variances of the two groups being tested are equal. Although the t test is fairly robust with respect to violations of this assumption (especially when the sample sizes of the two groups are equal), if the variances of the two treatments are widely different the statistical test for a difference between means may not be valid. Often, unequal variances between treatments result from the variance in a set of observations scaling with their mean. Log transformations often reduce or eliminate this dependence. It can be seen from C and D of Fig. 2that the variance in raw expression level is a function of the mean, and in this case a log transformation may be appropriate. Eand F of Fig. 2 show that the dependence of the variance on the mean is somewhat uncoupled following a log transformation. Interestingly, in these plots it appears that the variance in log-transformed expression levels is higher for genes expressed at lower rather than at higher levels. These plots suggest that genes expressed at low or near background levels may be good candidates for ignoring in expression analyses. The variance in the measurement of genes expressed at a low level is large enough that in many cases it is difficult to detect significant changes in expression for this class of loci. Third, statistical tests of log-transformed data have an intuitive appeal. The difference between the log of two numbers raised to the base of the log is equivalent to the ratio of the two numbers (i.e. a/b =e lna−lnb). Thus a test of the significance of the difference between the log expression levels of two genes is equivalent to a test of whether or not their fold change is significantly different.We have shown that statistical tests for changes in genes that incorporate the within treatment variance and a Bayesian prior on the estimate of the within treatment variance have a number of desirable properties. They are generally more consistent than tests not employing a Bayesian prior, implying that they give similar results when high density array experiments are replicated. Tests incorporating a measure of experimental error into the test statistic do not identify genes showing large fold changes in expression that also show little correspondence over within treatment replicates. The IHF data presented in Fig. 3 suggest that a Bayesian statistical framework facilitates the identification of more true positives and fewer false positives with fewer replications. A primary deterrent to a more widespread adoption of statistical approaches incorporating a Bayesian prior for the analysis of high density array data is the lack of software that can easily be used to carry out such analyses. We have implemented the approaches described in this work and have created a simple to use web interface that make these tools widely available and accessible.In summary, Cyber-T provides an easily accessible interface that allows routine assessment of high density array data for statistical significance. The incorporation of a Bayesian prior into the commonly accepted t test allows statistical inferences to be drawn from high density array data that is not highly replicated. Although it is often difficult to achieve the levels of statistical significance necessary to satisfy a stringent criterion for experiment-wide significance, the p values generated in Cyber-T can be used to rank genes and determine those differences most likely to be real. The recent availability of complete genomic sequences and/or large numbers of cDNA clones from model organisms coupled with technical advances in DNA arraying technology have made it possible to study genome-wide patterns of gene expression. Most high density microarray experiments consist of one of two types: examining changes in gene expression over a temporal or treatment gradient (1DeRisi J.L. Iyer V.R. Brown P.O. Science. 1997; 278: 680-686Crossref PubMed Scopus (3686) Google Scholar) or comparing gene expression between two different cell sample types or genotypes (2Schena M. Shalon D. Davis R.W. Brown P.O. Science. 1995; 270: 467-470Crossref PubMed Scopus (7564) Google Scholar, 3Schena M. Shalon D. Heller R. Chai A. Brown P.O. Davis R.W. Proc. Natl. Acad. Sci. U. S. A. 1996; 93: 10614-10619Crossref PubMed Scopus (1411) Google Scholar, 4Lashkari D.A. DeRisi J.L. McCusker J.H. Namath A.F. Gentile C. Hwang S.Y. Brown P.O. Davis R.W. Proc. Natl. Acad. Sci. U. S. A. 1997; 94: 13057-13062Crossref PubMed Scopus (530) Google Scholar, 5Arfin S.M. Long A.D. Ito E.T. Tolleri L. Riehle M.M. Paegle E.S. Hatfield G.W. J. Biol. Chem. 2000; 275: 29672-29684Abstract Full Text Full Text PDF PubMed Scopus (220) Google Scholar). Fluorescently or isotopically labeled cDNA or RNA probes are hybridized to high density arrays of cDNA clones on glass supports (1DeRisi J.L. Iyer V.R. Brown P.O. Science. 1997; 278: 680-686Crossref PubMed Scopus (3686) Google Scholar, 2Schena M. Shalon D. Davis R.W. Brown P.O. Science. 1995; 270: 467-470Crossref PubMed Scopus (7564) Google Scholar, 3Schena M. Shalon D. Heller R. Chai A. Brown P.O. Davis R.W. Proc. Natl. Acad. Sci. U. S. A. 1996; 93: 10614-10619Crossref PubMed Scopus (1411) Google Scholar, 4Lashkari D.A. DeRisi J.L. McCusker J.H. Namath A.F. Gentile C. Hwang S.Y. Brown P.O. Davis R.W. Proc. Natl. Acad. Sci. U. S. A. 1997; 94: 13057-13062Crossref PubMed Scopus (530) Google Scholar, 6DeRisi J.L. Penland L. Brown P.O. Bittner M.L. Meltzer P.S. Ray M. Chen Y. Su Y.A. Trent J.M. Nat. Genet. 1996; 14: 457-460Crossref PubMed Scopus (1748) Google Scholar, 7Shalon D. Smith S.J. Brown P.O. Genome Res. 1996; 6: 639-645Crossref PubMed Scopus (877) Google Scholar, 8Heller R.A. Schena M. Chai A. Shalon D. Bedillon T. Gilmore J. Woolley D.E. Davis R.W. Proc. Natl. Acad. Sci. U. S. A. 1997; 94: 2150-2155Crossref PubMed Scopus (664) Google Scholar), nylon membranes (9Lennon G.G. Lehrach H. Trends Genet. 1991; 7: 314-317Abstract Full Text PDF PubMed Scopus (166) Google Scholar, 10Gress T.M. Hoheisel J.D. Lennon G.G. Zehetner G. Lehrach H. Mamm. Genome. 1992; 3: 609-619Crossref PubMed Scopus (131) Google Scholar, 11Nguyen C. Rocha D. Granjeaud S. Baldit M. Bernard K. Naquet P. Jordan B.R. Genomics. 1995; 29: 207-216Crossref PubMed Scopus (173) Google Scholar, 12Takahashi N. Hashida H. Zhao N. Misumi Y. Sakaki Y. Gene ( Amst. ). 1995; 164: 219-227Crossref PubMed Scopus (25) Google Scholar, 13Zhao N. Hashida H. Takahashi N. Misumi Y. Sakaki Y. Gene ( Amst. ). 1995; 156: 207-213Crossref PubMed Scopus (105) Google Scholar, 14Pietu G. Alibert O. Guichard V. Lamy B. Bois F. Leroy E. Mariage- Samson R. Houlgatte R. Soularue P. Auffray C. Genome Res. 1996; 6: 492-503Crossref PubMed Scopus (158) Google Scholar, 15Rovere P. Trucy J. Zimmerman V.S. Granjeaud S. Rocha D. Nguyen C. Ricciardi-Castagnoli P. Jordan B.R. Davoust J. Adv. Exp. Med. Biol. 1997; 417: 467-473Crossref PubMed Scopus (2) Google Scholar), or oligonucleotides directly synthesized on silica wafers (16Fodor S.P. Read J.L. Pirrung M.C. Stryer L. Lu A.T. Solas D. Science. 1991; 251: 767-773Crossref PubMed Scopus (2424) Google Scholar, 17Lipshutz R.J. Fodor S.P. Gingeras T.R. Lockhart D.J. Nat. Genet. 1999; 21: 20-24Crossref PubMed Scopus (1856) Google Scholar). Signals are quantified using phosphorimaging, photomultiplier tubes, or CCD imaging, and a data set is created that consists of expression measurements for all of the elements of the array. Despite rapid technological developments, the statistical tools required to analyze these fundamentally different types of DNA microarray data are not in place. Data often consist of expression measures for thousands of genes, but experimental replication at the level of single genes is often low. This creates problems of statistical inferences because many genes will show fairly large changes in gene expression purely by chance alone. Therefore, to interpret data from DNA microarrays it is necessary to employ statistical methods capable of distinguishing chance occurrences from biologically meaningful data. The t test can be used to determine whether the observed difference between two means is statistically significant (18Sokal R.R. Rohlf F.J. Biometry. W. H. Freeman and Co., New York1995: 219-227Google Scholar). Thet test incorporates a measure of within treatment error into the statistical test; as a result only genes showing a large change in gene expression relative to the within treatment variance are considered to have significantly changed. In a perfect world, all DNA microarray experiments would be highly replicated. Such replication would allow accurate estimates of the variance within experimental treatments to be obtained, and the t test would perform well. However, samples may be available in limited supply, and DNA microarray experiments are expensive and time consuming to carry out. As a result, the level of replication within experimental treatments is often low. This results in poor estimates of variance and a correspondingly poor performance of the t test itself. An alternative to the t test is to ignore the within treatment variance and only look at fold change as a proxy for statistical significance. Intuition suggests that larger observed fold changes can be more confidently interpreted as a stronger response to the experimental manipulation than smaller observed fold changes. However, an implicit assumption of this reasoning is that the variance among replicates within treatments is the same for every gene. In reality, the variance varies among genes (e.g. see Fig. 2), and it is critical to incorporate this information into a statistical test. These different approaches demonstrate the statistical problem faced when analyzing DNA microarray data. Ignoring the sampling variance is incorrect, yet incorporating it in the traditional manner may not be much better because the number of replicate experiments is often quite low. Although there is no substitute for experimental replication, confidence in the interpretation of DNA microarray data with a low number of replicates can be improved by using a Bayesian statistical approach (19Baldi P. Brunak S. Bioinformatics: The Machine Learning Approach. MIT Press, Cambridge, MA1998Google Scholar) that incorporates prior information of within treatment measurement. The Bayesian prior assumes that genes of similar expression levels have similar measurement errors, amounting to parallel pseudo-replication of the experiment. For example, the variance of any single gene can be estimated from the variance from a number of genes of similar expression level. More specifically, the variance of any gene within any given treatment can be estimated by the weighted average of a prior estimate of the variance for that gene. This weighting factor, or hyperparameter, is controlled by the experimenter and will depend on how confident the experimenter is that the background variance of a closely related set of genes approximates the variance of the gene under consideration. In the Bayesian approach employed in this study, the weight given to the within gene variance estimate is a function of the number of observations contributing to that value. This leads to the desirable property of the Bayesian approach converging to the t test as the experimenter carries out additional replications and thus becomes more confident in the observed estimate of within treatment variance. Commonly used software packages are poorly suited for implementing the Bayesian statistical methods we develop in this work. However, we have created a program, Cyber-T, which accommodates this approach. Cyber-T is available for on-line use at the genomics web site at the University of California at Irvine. This program is ideally suited to experimental designs in which replicate control cDNA samples are being compared with replicate experimental cDNA samples. In this study we use the statistical tools incorporated into Cyber-T to compare and analyze the gene expression profiles obtained from a wild-type strain of Escherichia coli and an otherwise isogenic strain lacking the gene for the global regulatory protein, integration host factor (IHF),1 previously reported by Arfin et al. (5Arfin S.M. Long A.D. Ito E.T. Tolleri L. Riehle M.M. Paegle E.S. Hatfield G.W. J. Biol. Chem. 2000; 275: 29672-29684Abstract Full Text Full Text PDF PubMed Scopus (220) Google Scholar). We apply different statistical methods for identifying genes showing changes in expression to this data set and show that a Bayesian approach identifies a stronger set of genes as being significantly up- or down-regulated based on our biological understanding of IHF regulation. We show that commonly used approaches for identifying genes as being up- or down-regulated (i.e., simple t test or fold change thresholds) require more replication to approach the same level of reliability as Bayesian statistical approaches applied to data sets with more modest levels of replication. We further show that statistical tests identify a different set of genes than those based on fold change and argue that the set of genes identified by fold change is more likely to harbor experimental artifacts. DISCUSSIONAt present, the significance of high density array data is often judged solely on the basis of observed fold change in expression. An arbitrary fold change threshold is created, with genes showing greater change than that threshold declared significant and all others declared nonsignificant. It is apparent from Fig. 2 that fold change in expression may not be a good proxy for statistical significance. If analyses are carried out on nontransformed data, any given threshold of fold change in expression will be liberal for genes expressed at a high level and conservative for genes expressed at a low level. Conversely, if analyses are carried out on log-transformed data, the threshold will be conservative for genes expressed at a high level and liberal for genes expressed at a low level. In cases where either the control or experimental observations are replicated, it is possible to assess the significance of the difference between the control and experimental data relative to the observed level of within class variation. This results in smaller fold changes being significant for genes whose expression levels are measured with great accuracy and large fold changes being nonsignificant for genes whose expression levels cannot be measured very accurately.Ignoring the variation among replicates or not carrying out replication and determining significance based solely on fold change do not negate the problems discussed above; it amounts to assuming the variance among replicates is equal for all genes. The support for an expression level difference being meaningful relative to the observed variation within treatments is conveniently represented by the t statistic (see "Materials and Methods").Observed differences between a single replication of a controlversus experimental treatment can be due to inadequately controlled experimental factors as opposed to the experimental condition itself. Examples of such variables may include small differences in the time or method of harvesting cells, differences between tissue samples not related to the experiment, and variation induced by the RNA isolation or labeling protocol. For this reason, replicates of high density array experiments are particularly useful. Replication will increase the likelihood of detecting subtle changes in expression between treatments while decreasing the likelihood of false positives. In an ideal world, high density array experiments would be replicated a "large" number of times (e.g. >10), and the t statistic would measure the relative support for a difference between control and experimental treatments being due to chance alone. In practice, high density array experiments are rarely replicated this many times, and the sampling variance on thet statistic is therefore quite large. In this context, using only the t statistic (or a corresponding p value) as a measure of whether or not a gene is significant can be misleading. Nonetheless, we have found it useful to sort genes on pvalues as an exploratory tool for identifying potentially interesting genes (5Arfin S.M. Long A.D. Ito E.T. Tolleri L. Riehle M.M. Paegle E.S. Hatfield G.W. J. Biol. Chem. 2000; 275: 29672-29684Abstract Full Text Full Text PDF PubMed Scopus (220) Google Scholar).A Bayesian approach to estimating the within treatment variation among replicates has been implemented within Cyber-T. The use of the weighted average of the "local" standard deviation for genes with similar expression levels and the observed gene-specific standard deviation stabilizes within treatment variance estimates. Increasing the precision of variance estimates in both the control and experimental treatments results in more stable t statistics. This allows inferences to be drawn from high density array experiments that have been carried out with nominal levels of replication. This is demonstrated in Fig. 3 where statistical inference using the Bayesian approach with only two replicates approaches that normally achieved with more replication. There remains a possibility that different genes of similar expression levels have widely differing true variances. Under this possibility and the current prior, poorly replicable genes will be falsely declared significant, and intrinsically highly replicable genes will be falsely declared not significant. Although this would represent an undesirable outcome, when experiments show little replication the relative error in inference introduced by an incorrect prior is likely to be less than the error in inference introduced from very poor estimates of within treatment variance. Ultimately, it will be important to derive empirical guidelines for the determination of the correct hyperparameter to use in weighting the prior information (i.e. the local average standard deviation) relative to the observed within treatment estimate of variation. It is possible that the best weighting will depend on factors such as the biological system being studied, experimental conditions employed, and high density array technology used.Analyses in Cyber-T are performed on both log-transformed and nontransformed data. Log transformations are carried out for three reasons. First, in plots of raw data (Fig. 2 A) many of the data points are clustered at the low end of the values. Plots of log-transformed data tend to expand these low values and make them easier to examine visually (Fig. 2 B). Second, an assumption of the t test is that the variances of the two groups being tested are equal. Although the t test is fairly robust with respect to violations of this assumption (especially when the sample sizes of the two groups are equal), if the variances of the two treatments are widely different the statistical test for a difference between means may not be valid. Often, unequal variances between treatments result from the variance in a set of observations scaling with their mean. Log transformations often reduce or eliminate this dependence. It can be seen from C and D of Fig. 2that the variance in raw expression level is a function of the mean, and in this case a log transformation may be appropriate. Eand F of Fig. 2 show that the dependence of the variance on the mean is somewhat uncoupled following a log transformation. Interestingly, in these plots it appears that the variance in log-transformed expression levels is higher for genes expressed at lower rather than at higher levels. These plots suggest that genes expressed at low or near background levels may be good candidates for ignoring in expression analyses. The variance in the measurement of genes expressed at a low level is large enough that in many cases it is difficult to detect significant changes in expression for this class of loci. Third, statistical tests of log-transformed data have an intuitive appeal. The difference between the log of two numbers raised to the base of the log is equivalent to the ratio of the two numbers (i.e. a/b =e lna−lnb). Thus a test of the significance of the difference between the log expression levels of two genes is equivalent to a test of whether or not their fold change is significantly different.We have shown that statistical tests for changes in genes that incorporate the within treatment variance and a Bayesian prior on the estimate of the within treatment variance have a number of desirable properties. They are generally more consistent than tests not employing a Bayesian prior, implying that they give similar results when high density array experiments are replicated. Tests incorporating a measure of experimental error into the test statistic do not identify genes showing large fold changes in expression that also show little correspondence over within treatment replicates. The IHF data presented in Fig. 3 suggest that a Bayesian statistical framework facilitates the identification of more true positives and fewer false positives with fewer replications. A primary deterrent to a more widespread adoption of statistical approaches incorporating a Bayesian prior for the analysis of high density array data is the lack of software that can easily be used to carry out such analyses. We have implemented the approaches described in this work and have created a simple to use web interface that make these tools widely available and accessible.In summary, Cyber-T provides an easily accessible interface that allows routine assessment of high density array data for statistical significance. The incorporation of a Bayesian prior into the commonly accepted t test allows statistical inferences to be drawn from high density array data that is not highly replicated. Although it is often difficult to achieve the levels of statistical significance necessary to satisfy a stringent criterion for experiment-wide significance, the p values generated in Cyber-T can be used to rank genes and determine those differences most likely to be real. At present, the significance of high density array data is often judged solely on the basis of observed fold change in expression. An arbitrary fold change threshold is created, with genes showing greater change than that threshold declared significant and all others declared nonsignificant. It is apparent from Fig. 2 that fold change in expression may not be a good proxy for statistical significance. If analyses are carried out on nontransformed data, any given threshold of fold change in expression will be liberal for genes expressed at a high level and conservative for genes expressed at a low level. Conversely, if analyses are carried out on log-transformed data, the threshold will be conservative for genes expressed at a high level and liberal for genes expressed at a low level. In cases where either the control or experimental observations are replicated, it is possible to assess the significance of the difference between the control and experimental data relative to the observed level of within class variation. This results in smaller fold changes being significant for genes whose expression levels are measured with great accuracy and large fold changes being nonsignificant for genes whose expression levels cannot be measured very accurately. Ignoring the variation among replicates or not carrying out replication and determining significance based solely on fold change do not negate the problems discussed above; it amounts to assuming the variance among replicates is equal for all genes. The support for an expression level difference being meaningful relative to the observed variation within treatments is conveniently represented by the t statistic (see "Materials and Methods"). Observed differences between a single replication of a controlversus experimental treatment can be due to inadequately controlled experimental factors as opposed to the experimental condition itself. Examples of such variables may include small differences in the time or method of harvesting cells, differences between tissue samples not related to the experiment, and variation induced by the RNA isolation or labeling protocol. For this reason, replicates of high density array experiments are particularly useful. Replication will increase the likelihood of detecting subtle changes in expression between treatments while decreasing the likelihood of false positives. In an ideal world, high density array experiments would be replicated a "large" number of times (e.g. >10), and the t statistic would measure the relative support for a difference between control and experimental treatments being due to chance alone. In practice, high density array experiments are rarely replicated this many times, and the sampling variance on thet statistic is therefore quite large. In this context, using only the t statistic (or a corresponding p value) as a measure of whether or not a gene is significant can be misleading. Nonetheless, we have found it useful to sort genes on pvalues as an exploratory tool for identifying potentially interesting genes (5Arfin S.M. Long A.D. Ito E.T. Tolleri L. Riehle M.M. Paegle E.S. Hatfield G.W. J. Biol. Chem. 2000; 275: 29672-29684Abstract Full Text Full Text PDF PubMed Scopus (220) Google Scholar). A Bayesian approach to estimating the within treatment variation among replicates has been implemented within Cyber-T. The use of the weighted average of the "local" standard deviation for genes with similar expression levels and the observed gene-specific standard deviation stabilizes within treatment variance estimates. Increasing the precision of variance estimates in both the control and experimental treatments results in more stable t statistics. This allows inferences to be drawn from high density array experiments that have been carried out with nominal levels of replication. This is demonstrated in Fig. 3 where statistical inference using the Bayesian approach with only two replicates approaches that normally achieved with more replication. There remains a possibility that different genes of similar expression levels have widely differing true variances. Under this possibility and the current prior, poorly replicable genes will be falsely declared significant, and intrinsically highly replicable genes will be falsely declared not significant. Although this would represent an undesirable outcome, when experiments show little replication the relative error in inference introduced by an incorrect prior is likely to be less than the error in inference introduced from very poor estimates of within treatment variance. Ultimately, it will be important to derive empirical guidelines for the determination of the correct hyperparameter to use in weighting the prior information (i.e. the local average standard deviation) relative to the observed within treatment estimate of variation. It is possible that the best weighting will depend on factors such as the biological system being studied, experimental conditions employed, and high density array technology used. Analyses in Cyber-T are performed on both log-transformed and nontransformed data. Log transformations are carried out for three reasons. First, in plots of raw data (Fig. 2 A) many of the data points are clustered at the low end of the values. Plots of log-transformed data tend to expand these low values and make them easier to examine visually (Fig. 2 B). Second, an assumption of the t test is that the variances of the two groups being tested are equal. Although the t test is fairly robust with respect to violations of this assumption (especially when the sample sizes of the two groups are equal), if the variances of the two treatments are widely different the statistical test for a difference between means may not be valid. Often, unequal variances between treatments result from the variance in a set of observations scaling with their mean. Log transformations often reduce or eliminate this dependence. It can be seen from C and D of Fig. 2that the variance in raw expression level is a function of the mean, and in this case a log transformation may be appropriate. Eand F of Fig. 2 show that the dependence of the variance on the mean is somewhat uncoupled following a log transformation. Interestingly, in these plots it appears that the variance in log-transformed expression levels is higher for genes expressed at lower rather than at higher levels. These plots suggest that genes expressed at low or near background levels may be good candidates for ignoring in expression analyses. The variance in the measurement of genes expressed at a low level is large enough that in many cases it is difficult to detect significant changes in expression for this class of loci. Third, statistical tests of log-transformed data have an intuitive appeal. The difference between the log of two numbers raised to the base of the log is equivalent to the ratio of the two numbers (i.e. a/b =e lna−lnb). Thus a test of the significance of the difference between the log expression levels of two genes is equivalent to a test of whether or not their fold change is significantly different. We have shown that statistical tests for changes in genes that incorporate the within treatment variance and a Bayesian prior on the estimate of the within treatment variance have a number of desirable properties. They are generally more consistent than tests not employing a Bayesian prior, implying that they give similar results when high density array experiments are replicated. Tests incorporating a measure of experimental error into the test statistic do not identify genes showing large fold changes in expression that also show little correspondence over within treatment replicates. The IHF data presented in Fig. 3 suggest that a Bayesian statistical framework facilitates the identification of more true positives and fewer false positives with fewer replications. A primary deterrent to a more widespread adoption of statistical approaches incorporating a Bayesian prior for the analysis of high density array data is the lack of software that can easily be used to carry out such analyses. We have implemented the approaches described in this work and have created a simple to use web interface that make these tools widely available and accessible. In summary, Cyber-T provides an easily accessible interface that allows routine assessment of high density array data for statistical significance. The incorporation of a Bayesian prior into the commonly accepted t test allows statistical inferences to be drawn from high density array data that is not highly replicated. Although it is often difficult to achieve the levels of statistical significance necessary to satisfy a stringent criterion for experiment-wide significance, the p values generated in Cyber-T can be used to rank genes and determine those differences most likely to be real. Suzanne Sandmeyer and the members of the Functional Genomics group of the University of California at Irvine Institute of Genomics and Bioinformatics provided helpful input and data during the development of the programs described here.

Referência(s)