Global Gene Expression Profiling in Escherichia coli K12
2003; Elsevier BV; Volume: 278; Issue: 32 Linguagem: Inglês
10.1074/jbc.m213060200
ISSN1083-351X
AutoresKirsty Salmon, She‐pin Hung, Kathy Mekjian, Pierre Baldi, G. Wesley Hatfield, Robert P. Gunsalus,
Tópico(s)Microbial Metabolic Engineering and Bioproduction
ResumoThe work presented here is a first step toward a long term goal of systems biology, the complete elucidation of the gene regulatory networks of a living organism. To this end, we have employed DNA microarray technology to identify genes involved in the regulatory networks that facilitate the transition of Escherichia coli cells from an aerobic to an anaerobic growth state. We also report the identification of a subset of these genes that are regulated by a global regulatory protein for anaerobic metabolism, FNR. Analysis of these data demonstrated that the expression of over one-third of the genes expressed during growth under aerobic conditions are altered when E. coli cells transition to an anaerobic growth state, and that the expression of 712 (49%) of these genes are either directly or indirectly modulated by FNR. The results presented here also suggest interactions between the FNR and the leucine-responsive regulatory protein (Lrp) regulatory networks. Because computational methods to analyze and interpret high dimensional DNA microarray data are still at an early stage, and because basic issues of data analysis are still being sorted out, much of the emphasis of this work is directed toward the development of methods to identify differentially expressed genes with a high level of confidence. In particular, we describe an approach for identifying gene expression patterns (clusters) obtained from multiple perturbation experiments based on a subset of genes that exhibit high probability for differential expression values. The work presented here is a first step toward a long term goal of systems biology, the complete elucidation of the gene regulatory networks of a living organism. To this end, we have employed DNA microarray technology to identify genes involved in the regulatory networks that facilitate the transition of Escherichia coli cells from an aerobic to an anaerobic growth state. We also report the identification of a subset of these genes that are regulated by a global regulatory protein for anaerobic metabolism, FNR. Analysis of these data demonstrated that the expression of over one-third of the genes expressed during growth under aerobic conditions are altered when E. coli cells transition to an anaerobic growth state, and that the expression of 712 (49%) of these genes are either directly or indirectly modulated by FNR. The results presented here also suggest interactions between the FNR and the leucine-responsive regulatory protein (Lrp) regulatory networks. Because computational methods to analyze and interpret high dimensional DNA microarray data are still at an early stage, and because basic issues of data analysis are still being sorted out, much of the emphasis of this work is directed toward the development of methods to identify differentially expressed genes with a high level of confidence. In particular, we describe an approach for identifying gene expression patterns (clusters) obtained from multiple perturbation experiments based on a subset of genes that exhibit high probability for differential expression values. The enteric bacterium Escherichia coli, like many commensal and pathogenic microorganisms, thrives in the gastrointestinal tract of humans and other warm-blooded animals. In this environment, oxygen required for respiration and energy generation is in limited supply. Thus, the cell must derive energy from anaerobic respiration with alternative electron acceptors such as nitrate and fumarate or by fermentation of simple sugars. Metabolic transitions between aerobic and anaerobic growth states occur when E. coli cells enter an animal host and colonize the gastrointestinal tract, and when individual cells reposition themselves in new microenvironments inside the host. Each of these transitions is accompanied by fluctuations in oxygen tension. The cell responds to these fluctuations by modulating its central metabolic pathways for carbon and energy flow (1Gunsalus R.P. Park S.J. Res. Microbiol. 1994; 145: 437-450Crossref PubMed Scopus (170) Google Scholar). Depending on the availability of oxygen, the cell can transition to the utilization of a variety of small carbon compounds as electron donors and/or acceptors for respiration (2Gennis R. Stewart V. Neidhartt F.C. Escherichia coli and Salmonella typhimurium: Cellular and Molecular Biology. Vol. 1. American Society for Microbiology, Washington, D. C.1996: 217-261Google Scholar). In addition, E. coli cells respond to these fluctuations in oxygen availability by altering the expression of a number of membrane-associated nutrient uptake or excretion systems, as well as a number of metabolic pathways such as those required for heme and quinone synthesis (1Gunsalus R.P. Park S.J. Res. Microbiol. 1994; 145: 437-450Crossref PubMed Scopus (170) Google Scholar). E. coli controls many of these systems in response to oxygen by altering gene expression levels. For example, expression of genes involved in oxygen utilization are switched off as oxygen is fully depleted from the environment. In a reciprocal fashion, expression of genes encoding alternative anaerobic electron transport pathways or genes needed for fermentation are switched on. Many of these metabolic transitions are controlled at the transcriptional level by the activities of a global regulatory protein, FNR, 1The abbreviations used are: FNR, fumarate nitrate reduction regulatory protein; PPDE, posterior probability of differential expression; ORF, open reading frame; PCA, principal component analysis; Lrp, leucine-responsive regulatory protein. and a two-component regulatory system ArcAB (3Guest J.R. Attwood M.M. Machado R.S. Matqi K.Y. Shaw J.E. Turner S.L. Microbiology. 1997; 143: 457-466Crossref PubMed Scopus (19) Google Scholar, 4Lynch A.S. Lin E.C.C. Lin E.C.C., and Lynch, A.S. Regulation of Gene Expression in Escherichia coli. Chapman & Hall, New York1996: 362-381Google Scholar). FNR is a CAP (catabolic activator protein) homologue that contains an oxygen labile iron-sulfur center as a sensor element for anaerobiosis (5Guest J.R. Green J. Irvine A.S. Spiro S. Lin E.C.C. Lynch A.S. Regulation of Gene Expression in Escherichia coli. Chapman & Hall, New York1996: 317-342Crossref Google Scholar, 6Kiley P.J. Beinert H. FEMS Microbiol. Lett. 1998; 22: 341-352Crossref Google Scholar). Mutations in the fnr gene are known to affect the synthesis of nitrite, nitrate, and fumarate reductases (7Lambden P.R. Guest J.R. J. Gen. Microbiol. 1976; 97: 145-160Crossref PubMed Scopus (164) Google Scholar), as well as fermentation pathway genes. Over 70 genes in 31 operons are currently recognized as members of the FNR gene regulatory network. The ArcAB (aerobic respiratory control) two-component regulatory system is composed of a classical OmpR-like receiver regulator, ArcA, and a membrane-associated sensor transmitter, ArcB (8Lynch A.S. Lin E.C.C. Neidhardt F.C. Escherichia coli and Salmonella typhimurium: Cellular and Molecular Biology. 2nd Ed. Vol. 1. American Society of Microbiology, Washington, D. C.1996: 1526-1538Google Scholar). Examples of ArcA-regulated genes include genes for the Krebs cycle (sdh-CDAB, icd, fumA, mdh, gltA, acnA, and acnB), for pyruvate metabolism and superoxide dismutase (pfl and sodA), and genes for the cytochrome o oxidase (cyoABCDE) and cytochrome d oxidase (cydAB) (1Gunsalus R.P. Park S.J. Res. Microbiol. 1994; 145: 437-450Crossref PubMed Scopus (170) Google Scholar). The purpose of this genome-based study is to identify additional genes differentially expressed in response to oxygen availability and to define further the network of genes controlled by the global regulatory protein, FNR. To identify the global changes and adjustments of gene expression patterns that facilitate a change from aerobic to anaerobic growth conditions, we used DNA microarrays to analyze E. coli gene expression profiles of cells cultured at steady state growth rates under aerobic or anaerobic growth conditions (+O2 or –O2). To identify the genes controlled by FNR, we analyzed gene expression profiles of cells cultured under anaerobic growth conditions in the presence or absence of FNR (–O2, +FNR or –O2, –FNR) in otherwise isogenic strains. Chemicals and Reagents—Avian myeloblastosis virus (AMV)-reverse transcriptase and Sephadex G-25 Quickspin Columns were obtained from Roche Applied Science. Phenol and the DNA-free Kit were purchased from Ambion Inc. Ribonuclease Inhibitor III was purchased from Panvera/Takara. Ultrapure deoxynucleoside triphosphates were purchased from Amersham Biosciences. Random hexamer oligonucleotides and T4 polynucleotide kinase were obtained from New England Biolabs, and [α-33P]dCTP (2–3000 Ci/mmol) was obtained from PerkinElmer Life Sciences. DNA filter arrays (Panorama E. coli Gene Arrays) were obtained from Sigma-Genosys Biotechnologies. SYBR Gold was purchased from Molecular Probes. All other chemicals were obtained from Sigma. All reagents and baked glassware used in RNA manipulations were treated with diethylpyrocarbonate. Bacterial Strains and Growth Conditions—E. coli strains MC4100 (F – araD139 Δ(argF-lac)U169 rpsL150 relA1 flb-5301 deoC1 ptsF25 rbsR) (9Casadaban M.J. J. Mol. Biol. 1976; 104: 541-555Crossref PubMed Scopus (1334) Google Scholar) and PC2 (MC4100 Δfnr-2) (10Cotter P.A. Gunsalus R.P. J. Bacteriol. 1989; 171: 3817-3823Crossref PubMed Google Scholar) were used in this study. Aerobic cultures were grown in 125-ml Erlenmeyer flasks with constant aeration. Anaerobic cultures were grown in 15-ml anaerobic tubes fitted with butyl rubber stoppers (10Cotter P.A. Gunsalus R.P. J. Bacteriol. 1989; 171: 3817-3823Crossref PubMed Google Scholar). The medium was made anaerobic by flushing with O2-free N2 gas for 20 min and then dispensed anaerobically into N2-flushed tubes. Cultures of the indicated strain were inoculated from overnight cultures grown under identical conditions (10Cotter P.A. Gunsalus R.P. J. Bacteriol. 1989; 171: 3817-3823Crossref PubMed Google Scholar). Total RNA Isolation, cDNA Synthesis, and Target Labeling Conditions—Total RNA was isolated from 10-ml cultures; cDNA was synthesized and labeled with [α-33P]dCTP, and filters were hybridized exactly as described by Hung et al. (11Hung S.P. Baldi P. Hatfield G.W. J. Biol. Chem. 2002; 277: 40309-40323Abstract Full Text Full Text PDF PubMed Scopus (116) Google Scholar). Stripping and reusing filters four times as described here results in a less than 3% increase in variance (12Baldi P. Hatfield G.W. DNA Microarrays and Gene Expression: From Experiments to Data Analysis and Modeling. Cambridge University Press, Cambridge, UK2002Crossref Google Scholar). 2S. P. Hung, G. W. Hatfield, S. Sundaresh, and P. Baldi, unpublished results. Data Acquisition—A commercial software package obtained from Research Imaging Inc. (DNA ArrayVision) was used to grid the 16-bit image file, obtained from the PhosphorImager, to record the pixel density of each of the 18,432 addresses on each filter and to perform the background subtractions. 8,580 of the addresses on each filter are spotted with duplicate copies of each of the 4,290 E. coli ORFs. The remaining 9,852 empty addresses were used for background measurements. Because the backgrounds were quite constant, a global average background measurement was subtracted from each experimental measurement, although local background calculations are possible. Greater than 4 logs of linearity for the PhosphorImager derived data were observed. Experimental Design—The experimental design for the experiments reported here is diagrammed in Fig. 1. In Experiment 1, filters 1 and 2 were hybridized with 33P-labeled, random hexamer-generated cDNA fragments complementary to each of three RNA preparations (RNA 1–3) obtained from the cells of three individual cultures of the FNR+ strain, MC4100, grown under aerobic conditions. These three 33P-labeled cDNA target preparations were pooled prior to hybridization. Equal aliquots were hybridized to the duplicate E. coli Sigma-Genosys Panorama™ nylon filter arrays (Experiment 1, replicate 1, filters 1 and 2). Following PhosphorImager analysis, these filters were stripped and again hybridized with pooled 33P-labeled cDNA target fragments complementary to each of another three independently prepared RNA preparations (RNA 4–6) from the same strain (MC4100; Experiment 1, replicate 2). This procedure was repeated two more times with filters 1 and 2 using two more independently prepared pools of cDNA targets (Experiment 1, replicates 3 and 4; RNA 7–9 and RNA 10–12). In Experiment 2, filters 3 and 4 were hybridized with 33P-labeled, random hexamer-generated cDNA fragments complementary to each of three RNA preparations (RNA 13–15) obtained from the cells of three individual cultures of the FNR+ strain MC4100 grown under anaerobic conditions. As for Experiment 1, these three 33P-labeled cDNA target preparations were pooled prior to hybridization to the full-length ORF probes on the filters (Experiment 2, replicate 1, filters 3 and 4). Following PhosphorImager analysis, these filters were stripped and again hybridized with pooled, 33P-labeled cDNA target fragments complementary to each of another three independently prepared RNA preparations (RNA 16–18) from the same strain (MC4100; Experiment 2, replicate 2). This procedure was repeated two more times with filters 3 and 4 using two more independently prepared pools of cDNA targets (Experiment 2, replicates 3 and 4; RNA 19–21 and RNA 22–24). In Experiment 3, filters 5 and 6 were hybridized with 33P-labeled, random hexamer-generated, cDNA fragments complementary to each of three RNA preparations (RNA 25–27) obtained from the cells of three individual cultures of the FNR– strain PC2 grown under anaerobic conditions. These three 33P-labeled cDNA target preparations were pooled prior to hybridization to the full-length ORF probes on the filters (Experiment 3, replicate 1, filters 5 and 6). Following PhosphorImager analysis, these filters were stripped and again hybridized with pooled, 33P-labeled cDNA target fragments complementary to each of another three independently prepared RNA preparations (RNA 28–30) from the same strain (PC2; Experiment 3, replicate 2). This procedure was repeated one more time with filters 5 and 6 with another independently prepared pool of cDNA targets (Experiment 3, replicates 3; RNA 31–33). The data for the fourth replicate of this experiment was lost. This experimental design produces duplicate filter data for four replicates performed with cDNA targets complementary to four independent sets of pooled RNA preparations for each experiment. Thus, because each filter contains duplicate spots for each ORF and duplicate filters were used for each experiment, a total of 16 measurements were obtained, 4 measurements for each ORF from each of 4 replicates for wild-type experiments, and 4 for each of the 3 replicates for the FNR– experiments (12 measurements). Data Analysis—For each target signal, a background-subtracted estimate of expression level was obtained and scaled to total counts on the membrane by dividing each individual gene expression value by the total of all target signals on the membrane. Thus, each normalized gene level is expressed as a fraction of the total mRNA hybridized to each DNA array. For any given measurement, a value greater than zero (indicating an expression level) or a zero (indicating an expression level lower than background) is obtained. Only those genes exhibiting an expression level greater than zero in all replicates were used for statistical analysis. These gene expression level measurements were analyzed by a regularized t test based on a Bayesian statistical framework (11Hung S.P. Baldi P. Hatfield G.W. J. Biol. Chem. 2002; 277: 40309-40323Abstract Full Text Full Text PDF PubMed Scopus (116) Google Scholar, 12Baldi P. Hatfield G.W. DNA Microarrays and Gene Expression: From Experiments to Data Analysis and Modeling. Cambridge University Press, Cambridge, UK2002Crossref Google Scholar, 13Hatfield G.W. Hung S.P. Baldi P. Mol. Microbiol. 2003; 47: 871-877Crossref PubMed Scopus (102) Google Scholar, 14Long A.D. Mangalam H.J. Chan B.Y. Tolleri L. Hatfield G.W. Baldi P. J. Biol. Chem. 2001; 276: 19937-19944Abstract Full Text Full Text PDF PubMed Scopus (309) Google Scholar, 15Baldi P. Long A.D. Bioinformatics. 2001; 17: 509-519Crossref PubMed Scopus (1330) Google Scholar). For the analysis of the data reported here, we ranked the mean gene expression levels of the replicate experiments in ascending order, used a sliding window of 101 genes, and we assigned the average standard deviation of the 50 genes ranked below and above each gene as the Bayesian standard deviation for that gene. The p values for each gene measurement based on a regularized t test with a confidence value of 10 are reported in the Supplemental Material. A comprehensive discussion of the use of a regularized t test and the modifications applicable to the analysis of DNA microarray data of the type presented here are described in detail elsewhere (12Baldi P. Hatfield G.W. DNA Microarrays and Gene Expression: From Experiments to Data Analysis and Modeling. Cambridge University Press, Cambridge, UK2002Crossref Google Scholar). Gene measurements containing zero expression values in one or more replicates were set aside. Among this set of genes, those with zero expression values for all replicates in one experiment, and all values greater than zero for all measurements of another experiment were identified. Because these gene measurements could not be analyzed with a t test, the significance of these results was evaluated by ranking these genes in ascending order according to their coefficients of variance of the four greater than zero measurements of each experiment(s). To interpret the results of a high dimensional DNA array experiment, it is necessary to determine the global false-positive and -negative levels inherent in the data set being analyzed. We have implemented a mixture model-based method described by Allison et al. (16Allison D.B. Gadbury G.L. Heo M. Fernandez J.R. Lee C.K. Prolla T.A. Weindruch R. Comput. Statist. Data Anal. 2002; 39: 1-20Crossref Google Scholar) for the computation of the global false-positive and -negative levels inherent in a DNA microarray experiment (11Hung S.P. Baldi P. Hatfield G.W. J. Biol. Chem. 2002; 277: 40309-40323Abstract Full Text Full Text PDF PubMed Scopus (116) Google Scholar, 12Baldi P. Hatfield G.W. DNA Microarrays and Gene Expression: From Experiments to Data Analysis and Modeling. Cambridge University Press, Cambridge, UK2002Crossref Google Scholar). The basic idea is to consider the p values as a new data set and to build a probabilistic model for these new data. When control data sets are compared with one another (i.e. no differential gene expression), it is easy to see that the p values ought to have a uniform distribution between zero and one. In contrast, when data sets from different genotypes or treatment conditions are compared with one another, a non-uniform distribution will be observed in which p values will tend to cluster more closely to zero than one (Fig. 2), i.e. there will be a subset of differentially expressed genes with "significant" p values. The computational method of Allison (16Allison D.B. Gadbury G.L. Heo M. Fernandez J.R. Lee C.K. Prolla T.A. Weindruch R. Comput. Statist. Data Anal. 2002; 39: 1-20Crossref Google Scholar) is used to model this mixture of uniform and non-uniform distributions to determine the probability, PPDE(p) ranging from 0 to 1, that any gene at any given p value is differentially expressed, i.e. it is a member of the uniform (not differentially expressed) or the non-uniform (differentially expressed) distribution. With this method, we can estimate the rates of false positives and false negatives as well as true positives and true negatives at any given p value threshold, PPDE(<p). In other words, we can obtain a posterior probability of differential expression PPDE(p) value for each gene measurement and a PPDE(<p) value at any given p value threshold based on the experiment-wide global false-positive level and the p value exhibited by that gene (11Hung S.P. Baldi P. Hatfield G.W. J. Biol. Chem. 2002; 277: 40309-40323Abstract Full Text Full Text PDF PubMed Scopus (116) Google Scholar, 12Baldi P. Hatfield G.W. DNA Microarrays and Gene Expression: From Experiments to Data Analysis and Modeling. Cambridge University Press, Cambridge, UK2002Crossref Google Scholar). It should also be emphasized that this information allows us to infer the genome-wide number of genes that are differentially expressed, i.e. the fraction of genes in the non-uniform distribution (differentially expressed) and the fraction of genes in the uniform distribution (not differentially expressed). The PPDE(<p) and PPDE(p) values plotted against p values for the gene measurements of the +O2 versus –O2 and –O2, +FNR versus –O2, –FNR experiments are shown in Fig. 3. In Fig. 3A we see that at a p value less than 1, which includes all gene measurements, the PPDE(<p) is 0.63. This means that 63% of the 2,820 genes expressed above background in all of four replicate experiments are inferred to be differentially expressed between growth in the presence and absence of oxygen. In most instances, PPDE(<p) values are reported in the text and tables of this article. However, both PPDE(p) and PPDE(<p) values are given for each gene in the Supplemental Material.Fig. 3PPDE(p) or PPDE(<p) versus p values. PPDE(p), black dots, is the posterior probability for differential expression for any gene of a given p value. PPDE(<p), gray dots, is the posterior probability for differential gene expression of the group of genes below any given p value. A, +O2, +FNR versus –O2, +FNR. B, –O2, +FNR versus –O2, –FNR.View Large Image Figure ViewerDownload Hi-res image Download (PPT) The statistical methods described above are implemented in the Cyber-T software package available for on-line use at the website of the Institute for Genomics and Bioinformatics at the University of California, Irvine (www.igb.uci.edu). The clustering methods used to determine the regulatory patterns reported below are those implemented in the GeneSpring™ (Silicon Genetics, Redwood City, CA) software package. Differential Gene Expression in the Presence or Absence of Oxygen—In the following discussions we often simply refer to the fold change for differentially expressed genes. However, it is important to emphasize that reporting fold changes is incomplete and can be misleading (12Baldi P. Hatfield G.W. DNA Microarrays and Gene Expression: From Experiments to Data Analysis and Modeling. Cambridge University Press, Cambridge, UK2002Crossref Google Scholar). For this reason, the mean expression levels, standard deviations, p values, and PPDE(<p) values for all differentially expressed genes are included in the Supplemental Material. However, in the tables of this article we report only p values, PPDE (<p) values, and fold changes. A comparison of the gene expression levels between cells grown in the presence and absence of oxygen revealed 2,820 genes that exhibited expression levels above background for all replicates of Experiments 1 and 2 (+O2, +FNR versus –O2, +FNR; Fig. 1). If two data sets for which no differences are expected (e.g. +O2 versus +O2) were compared, then the p values would be equally distributed between 0 and 1.0 (dashed lines in Fig. 2). On the other hand, if differences among measurement levels of some genes are present (e.g. +O2 versus –O2; Fig. 2A), then the p values for those genes will be low and cluster toward 0. In Fig. 2A, the p values for all 2,820 gene measurements are distributed into 100 bins ranging from 0 to 1.0 and plotted against the number of genes in each bin. It is evident from an examination of the p value distribution in Fig. 2A that about one-half of the genes expressed during aerobic growth are modulated during the transition to anaerobic growth. Whereas the demarcation between differentially and non-differentially expressed genes is arbitrary, the data in Fig. 2A suggest that a lower threshold of p = 0.05, which corresponds to a PPDE(<p) value of 0.96 described under "Materials and Methods," is reasonable. Thus, of the 1,445 differentially expressed genes that exceed this threshold, 58 are expected to be false positives. Furthermore, it must be kept in mind that the remaining genes classified as not differentially expressed contain false negatives. The complete computational method to determine the fraction of differentially expressed genes and the fraction of falsely identified differentially expressed genes at any given PPDE(<p) value has been described by Hung et al. (11Hung S.P. Baldi P. Hatfield G.W. J. Biol. Chem. 2002; 277: 40309-40323Abstract Full Text Full Text PDF PubMed Scopus (116) Google Scholar). The p values and PPDE(<p) values, as well as additional statistical data, for all genes are contained in the Supplemental Material. Differential Gene Expression in the Absence of Oxygen in the Presence and Absence of the FNR Global Regulatory Protein—A comparison of the gene expression levels between cells grown in the absence of oxygen and in the presence or absence of FNR revealed 2,402 genes that exhibited expression levels above background for all replicates of Experiments 2 and 3 (–O2, +FNR versus –O2, –FNR; Fig. 1). Again, about one-half of the gene expression levels are modulated by this treatment condition. An examination of the distribution of p values, shown in Fig. 2B, suggest that the expression levels of 1,256 genes with p values less than 0.05 are modulated, either directly or indirectly, by FNR during growth under anaerobic conditions (Fig. 2B). Again the PPDE(<p) value for this group of genes is 0.96; thus, 50 false positives are expected among this set of differentially expressed genes. The individual p values and PPDE values, as well as additional statistical data, for all genes are contained in the Supplemental Material. Identification of Differential Gene Expression Patterns Resulting from Two-variable Perturbation Experiments—A basic paradigm for understanding regulatory networks at a system level involves performance of perturbation experiments. When only one parameter is perturbed, gene regulation patterns can be of only two types, they can go up or they can go down. However, when two or more parameters are perturbed, data mining methods designed to identify more complex gene expression patterns are needed. This is the case for the set of experiments described here where we examined the effects of perturbing two variables, one genetic variable and one environmental variable. To identify the global changes and adjustments of gene expression patterns that facilitate a transition from aerobic to anaerobic growth conditions, and to determine the effects of genotype on these gene expression patterns, we analyzed E. coli gene expression profiles obtained from cells cultured under aerobic or anaerobic growth conditions (+O2 or –O2) and under anaerobic growth conditions in the presence or absence of the global regulatory protein for anaerobic metabolism, FNR (–O2, +FNR or –O2, –FNR). Because FNR is presumed to be inactive under aerobic (+O2) conditions, we did not perform experiments comparing fnr genotypes under aerobic conditions. Only two general regulatory patterns can be observed when only two experimental conditions are compared, for example growth in the presence or absence of oxygen. However, when two conditions are compared, at least eight general regulatory patterns are expected. The data in Fig. 4 diagram the eight basic regulatory patterns that could be observed among three experiments conducted in the presence and absence of oxygen in an fnr + strain and in the absence of oxygen in an fnr – strain. For simplicity, only three expression levels for each of these three experimental conditions are assumed: low, medium, and high. An intuitive method to identify genes with these regulatory patterns could be to simply use any of several popular clustering methods on the entire data set. However, in experiments like the ones presented here where a limited number of replications or sample measurements are performed, resulting in many genes being measured with low confidence levels that result in false positives as well as false negatives, such a clustering approach could be misleading placing many genes in wrong clusters. To circumvent this problem, the approach described here is based on selecting those genes differentially expressed with high confidence levels for the initial clustering. Once the genes of these regulatory patterns are established, it is possible to "fish" for other genes with similar regulatory patterns with lower confidence levels that can be included at the discretion of the investigator. To identify genes differentially expressed at a high confidence level that correspond to each of the patterns (I–VIII) diagrammed in Fig. 4, the genes differentially expressed due to the treatment condition of Experiments 1 and 2 were sorted in ascending order according to their p values based on the regularized t test as described under "Materials and Methods." Next, the genes differentially expressed due to the treatment condition of Experiments 2 and 3 were sorted in ascending order according to their p values. 100 genes with the lowest p values present in both lists were selected. These genes exhibited either an increased or decreased expression level between both treatment conditions (i.e. between Experiment 1 and 2 and Experiment 2 and
Referência(s)