Artigo Acesso aberto Revisado por pares

Global analysis of gene function in yeast by quantitative phenotypic profiling

2006; Springer Nature; Volume: 2; Issue: 1 Linguagem: Inglês

10.1038/msb4100043

ISSN

1744-4292

Autores

James A. L. Brown, Gavin Sherlock, Chad L. Myers, Nicola M Burrows, Changchun Deng, H. Irene Wu, Kelly McCann, Olga G. Troyanskaya, J. Martin Brown,

Tópico(s)

Gene expression and cancer classification

Resumo

Report17 January 2006Open Access Global analysis of gene function in yeast by quantitative phenotypic profiling James A Brown James A Brown Department of Radiation Oncology, Stanford University School of Medicine, Stanford, CA, USA Search for more papers by this author Gavin Sherlock Gavin Sherlock Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA Search for more papers by this author Chad L Myers Chad L Myers Lewis-Sigler Institute for Integrative Genomics, Department of Computer Science, Princeton University, Princeton, NJ, USA Search for more papers by this author Nicola M Burrows Nicola M Burrows Department of Radiation Oncology, Stanford University School of Medicine, Stanford, CA, USA Search for more papers by this author Changchun Deng Changchun Deng Department of Radiation Oncology, Stanford University School of Medicine, Stanford, CA, USA Search for more papers by this author H Irene Wu H Irene Wu Department of Radiation Oncology, Stanford University School of Medicine, Stanford, CA, USA Search for more papers by this author Kelly E McCann Kelly E McCann Department of Radiation Oncology, Stanford University School of Medicine, Stanford, CA, USA Search for more papers by this author Olga G Troyanskaya Olga G Troyanskaya Lewis-Sigler Institute for Integrative Genomics, Department of Computer Science, Princeton University, Princeton, NJ, USA Search for more papers by this author J Martin Brown Corresponding Author J Martin Brown Department of Radiation Oncology, Stanford University School of Medicine, Stanford, CA, USA Search for more papers by this author James A Brown James A Brown Department of Radiation Oncology, Stanford University School of Medicine, Stanford, CA, USA Search for more papers by this author Gavin Sherlock Gavin Sherlock Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA Search for more papers by this author Chad L Myers Chad L Myers Lewis-Sigler Institute for Integrative Genomics, Department of Computer Science, Princeton University, Princeton, NJ, USA Search for more papers by this author Nicola M Burrows Nicola M Burrows Department of Radiation Oncology, Stanford University School of Medicine, Stanford, CA, USA Search for more papers by this author Changchun Deng Changchun Deng Department of Radiation Oncology, Stanford University School of Medicine, Stanford, CA, USA Search for more papers by this author H Irene Wu H Irene Wu Department of Radiation Oncology, Stanford University School of Medicine, Stanford, CA, USA Search for more papers by this author Kelly E McCann Kelly E McCann Department of Radiation Oncology, Stanford University School of Medicine, Stanford, CA, USA Search for more papers by this author Olga G Troyanskaya Olga G Troyanskaya Lewis-Sigler Institute for Integrative Genomics, Department of Computer Science, Princeton University, Princeton, NJ, USA Search for more papers by this author J Martin Brown Corresponding Author J Martin Brown Department of Radiation Oncology, Stanford University School of Medicine, Stanford, CA, USA Search for more papers by this author Author Information James A Brown1, Gavin Sherlock2, Chad L Myers3, Nicola M Burrows1, Changchun Deng1, H Irene Wu1, Kelly E McCann1, Olga G Troyanskaya3 and J Martin Brown 1 1Department of Radiation Oncology, Stanford University School of Medicine, Stanford, CA, USA 2Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA 3Lewis-Sigler Institute for Integrative Genomics, Department of Computer Science, Princeton University, Princeton, NJ, USA *Corresponding author. Department of Radiation Oncology, Stanford University School of Medicine, Stanford, 269 Campus Drive West, CCSR So. Room 1255, Stanford, CA 94305-5152, USA. Tel.: +1 650 723 5881; Fax: +1 650 723 7382; E-mail: [email protected] Molecular Systems Biology (2006)2:2006.0001https://doi.org/10.1038/msb4100043 PDFDownload PDF of article text and main figures. ToolsAdd to favoritesDownload CitationsTrack CitationsPermissions ShareFacebookTwitterLinked InMendeleyWechatReddit Figures & Info We present a method for the global analysis of the function of genes in budding yeast based on hierarchical clustering of the quantitative sensitivity profiles of the 4756 strains with individual homozygous deletion of nonessential genes to a broad range of cytotoxic or cytostatic agents. This method is superior to other global methods of identifying the function of genes involved in the various DNA repair and damage checkpoint pathways as well as other interrogated functions. Analysis of the phenotypic profiles of the 51 diverse treatments places a total of 860 genes of unknown function in clusters with genes of known function. We demonstrate that this can not only identify the function of unknown genes but can also suggest the mechanism of action of the agents used. This method will be useful when used alone and in conjunction with other global approaches to identify gene function in yeast. Introduction A major challenge facing biologists today is the assignment of function to the novel genes identified during the sequencing phase of the human genome project. A useful resource for this task in the baker's yeast, Saccharomyces cerevisiae, has been the creation of a set of homozygous deletions of all nonessential genes, with each gene replaced by a cassette containing a 20-mer molecular ‘barcode’ unique to each deletion mutant (Giaever et al, 2002). This set of deletion mutants has been used by a number of investigators to identify genes involved in response to DNA-damaging agents and in other processes (reviewed in Scherens and Goffeau, 2004). In most cases, investigators have tested the deletion strains individually rather than by hybridizing the amplified barcodes from a pool of all mutants to a high-density oligonucleotide array, which allows the relative abundance of all the strains in a pool of all deletion mutants to be determined. The hybridization method has the advantage that it allows each deletion strain to be rapidly ranked on a continuum for sensitivity or resistance to the environmental change rather than in discrete bins, such as sensitive, refractory or neutral, where the boundaries are subjective. We and others have shown that hybridization of the amplified DNA barcodes is a highly reproducible method of identifying genes responsible for resistance to DNA damage (Birrell et al, 2001; Wu et al, 2004; Lee et al, 2005). In the present study, we have explored further the use of quantitative phenotypic profiling of the 4756 viable yeast deletion mutants in response to a variety of agents, to identify gene function (or the ‘biological process’ in gene ontology (GO) terms). We show that, at least for the processes interrogated, this is a powerful method and appears superior to other genome-wide methods of identifying gene function, including protein–protein interactions, gene expression profiling and synthetic lethality. Results and discussion Generation of phenotypic profiles for nonessential gene deletions We obtained phenotypic profiles of the pool of homozygous diploid deletions of all nonessential genes for a total of 51 diverse stresses, including some that we reanalyzed from publicly available databases (Table I). Each of these profiles provides a quantitative distribution of the sensitivity or resistance for an individual gene deletion. The final complete data set is provided in the Supplementary Information I, and all of the raw cel files generated by us are available on the supporting website (http://microarray-pubs.stanford.edu/phenotypic_profiling/). Table 1. List of agents used in clustering analysis Name Rep Treatment Control Ref AAPO 2 Antimycin A 1 μg/ml, 1 mM hydrogen peroxide chronic 16 h Mock a ACTD 3 Actinomycin D 400 μM 4 h, YPD 17 h Mock a ALK-15G 2 pH 8.0 15 generations Historic b ALK-5G 2 pH 8.0 five generations Historic b ANTA 3 Antimycin A 1 μg/ml in YPD chronic 16 h Mock a ARAC 2 AraC 400 μM 4 h, YPD 16 h Mock a ARN 4 Arsenite 1 mM (2), 2.5 mM (1), 5 mM (1) 1 h, YPD 16 h Mock f ARS 3 Arsenic 20 μM (1) or 100 μM (2) 2 h, YPD 16 h Mock a BEN 3 Benomyl 10 μM (2) or 15 μM (1) 2 h, YPD 16 h Mock a BLEO 4 Bleomycin 0.01 U/ml 4 h, YPD 16 h Mock a CAFF 3 Caffeine 6 mM chronic 16 h Mock a CALC 3 Calcofluor white 3 μg/ml chronic 16 h Mock a CIS1 6 Cisplatin 1.0 mM 1 h, YPD 16 h Mock e CIS4 6 Cisplatin 0.23 mM 4 h, YPD 16 h Mock e CPTA 3 Camptothecin 250 μM (2) or 300 μM (1) 2 h, YPD 16 h Mock a CPTC 3 Camptothecin 5 μg/ml chronic 16 h Mock a DOX 6 Doxorubicin 0.2 mM 4 h, YPD 16 h Mock a GAL-15G 2 YPGalactose 15 generations Historic b GAL-5G 2 YPGalactose five generations Historic b GLYE 3 YEP 2% glycerol 2% ethanol chronic 16 h Mock a H2O2 4 Hydrogen peroxide 3 mM chronic 16 h Mock a HU 3 Hydroxyurea 100 mM chronic 16 h Mock a HYGB 3 Hygromycin B 7 μg/ml chronic 16 h Mock a IDA 3 Idarubicin 50 μM (2) or 100 μM (1) 2 h, YPD 16 h Mock a IR 3 IR 200Gy Cs137, YPD 18 h Mock d LOVA 3 Lovastatin 100 μg/ml (0.75% EtOH) chronic 16 h Mock a LYS 2 Lys minus five generations Mock b MECH 1 Mechlorethamine 20 μM 3 h, YPD 16 h Mock a MEL 3 Melphalan 800 μM 4 h, YPD 16 h Mock a MIN-15G 2 Minimal+his/leu/ura 15 generations Historic b MIN-5G 2 Minimal+his/leu/ura five generations Historic b MMC 5 MitomycinC 0.5 mM 4 h, YPD 16 h Mock e MMS 3 Methyl methanesulfonate 0.03% chronic 16 h Mock a NACL-15G 2 NaCl 1 M 15 generations Historic b NACL-5G 2 NaCl 1 M five generations Historic b NYS-15G 2 Nystatin 10 μM 15 generations Historic b NYS-5G 2 Nystatin 10 μM five generations Historic b OXA 3 Oxaloplatin 4 h 10 mM, YPD 16 h Mock e RAFA 3 Raffinose 6% with 1 μg/ml antimycin A chronic 16 h Mock a SC 2 Minimal complete five generations Historic b SORB-15G 2 Sorbitol 1.5 M 15 generations Historic b SORB-5G 2 Sorbitol 1.5 M five generations Historic b THR 1 Thr minus five generations Historic b TPT 1 Topotecan 20 μM 3 h, YPD 16 h Mock a TPZ 4 Tirapazamine 250 μM (3) or 300 μM (1) 2 h, YPD 16 h Mock a TRP 2 Trp minus five generations Historic b UVA 4 UVA 36 J/cm2 (1), 288 J/cm2 (3), 16 h YPD Mock a UVB 5 UVB 3400 J/m2, YPD 16 h Mock c UVC 5 UVC 200 J/m2, YPD 16 h Mock c WORT 3 Wortmannin 1.5 μM (DMSO 1 μg/ml SC) chronic 16 h Mock a YPD 3 Growth in YPD media 16 h Time 0 a Name given to each type of treatment followed by the number of repetitions that make up the geometric mean ratio of treated over the control. Treatment is a brief description of the treatment parameters: drug, concentration, and time. Chronic exposure is batch growth in continuous presence of the treatment. Acute exposures are for a defined time period followed by a recovery phase in YPD media. The type of control is indicated as a matched ‘mock’ control, a ‘time 0’ control used for change over time, and the ‘historic’ controls taken from Giaever et al (2002) in which a highly replicated control condition was tested for the given number of generations. The references cited are (a) this work; (b) Giaever et al (2002); (c) Birrell et al (2001); (d) Game et al (2003); (e) Wu et al (2004); and (f) Haugen et al (2004). Hierarchical clustering of the phenotypic profiles identifies treatments by mechanism of action To assess the degree of similarity of the phenotypic profiles of individual deletion strains, and therefore the likely gene products in various functional pathways, we clustered all the data without filtering out any of the phenotypically neutral deletions. Figure 1 shows the hierarchical clustering of the phenotypic profiles of 4281 genes after filtering for data quality against the 51 different treatments. We employed two-way unsupervised uncentered clustering employing a Pearson's correlation coefficient in order to favor trends in the profiles rather than the absolute magnitudes. It is apparent from the vertical axis of Figure 1 showing the treatments used that hierarchical clustering groups the agents by mechanism of action. This is expected, as agents with the same mechanism of action should produce similar phenotypic profiles. The shorter the vertical lengths of the arms of the dendrogram connecting adjacent treatments, the closer the mechanisms of action of the agents used. For example, UVB and UVC with two different wavelengths and intensities produce the same lesions and have almost indistinguishable phenotypes, whereas a third wavelength (UVA) producing a different spectrum of lesions does not (Cadet et al, 2005). Similar profiles are also produced by chronic or acute exposures to camptothecin (CPTC and CPTA), amino-acid deprivation (TRP, LYS and SC), the two platinum analogs, cisplatin (CIS1 and CIS4 for a 1 and 4 h exposure to cisplatin) and oxaliplatin (OXA), and two bifunctional alkylating anticancer agents that kill cells by forming interstrand crosslinks, mitomycin C and mechlorethamine. Also, the novel anticancer agent tirapazamine, which we have recently shown produces DNA double-strand breaks by poisoning topoisomerase II (Peters and Brown, 2002), has a profile similar to that for the known topoisomerase II poison idarubicin. Figure 1.Two-way unsupervised uncentered unnormalized hierarchical clustering using a Pearson's correlation of the phenotypic profiles of 4281 nonessential genes to 51 different treatments. The expanded region shows the DNA-damage cluster, which contains the components of the DNA-damage checkpoint function, nucleotide excision repair, and homologous recombination. Download figure Download PowerPoint Hierarchical clustering identifies gene function and compares favorably with other methods in identifying the genes in the DNA-damage response pathway The expanded portion of Figure 1 shows the genes whose deletion produces sensitivity to the diverse set of DNA-damaging agents used. Of note is the fact the DNA-damage checkpoints genes RAD17, RAD24, MEC3, RAD9 and DDC1 form a tight group and represent all of the nonessential DNA-damage cell-cycle checkpoint genes involved in sensing DNA damage (Zhou and Elledge, 2000). In addition, all of the nonessential genes involved in nucleotide excision repair (NER) are in their own subcluster with no false positives (genes in the cluster not involved in NER). Note that the two uncharacterized open-reading frames (ORFs) at the top of Figure 1 (inset), YBR099C and YBR100W, are not separate bona fide genes—YBR099C is characterized as a dubious ORF on the opposite strand of MMS4 (so its deletion would also delete MMS4), and YBR100W has now been annotated as part of MMS4 following correction of a sequencing error (Brachat et al, 2003). The fact that both of these ORFs cocluster with MMS4 provides additional support for the robustness of the clustering analysis. Despite this efficient functional classification of the genes involved in the response of the cell to DNA damage, some are missing. These have hybridization signals in the control pool that are too low to give informative data (e.g. RAD6, RAD52, MRE11 and XRS2). This applied to 9% of the nonessential genes (see Supplementary Information II). We analyzed the other three global methods for their ability to group the five genes involved in the DNA-damage checkpoint and NER (Table II). Protein–protein interactions identify only three members, Rad17p, Ddc1p and Mec3p, of the checkpoint group, and, in addition, identify 105 other proteins, most, if not all, of which are likely to be false positives. The data on synthetic lethality for these five genes also identify a number of genes that are likely not involved in the DNA-damage checkpoint, although hierarchical clustering of the profiles of synthetic lethality clusters four of the genes (all except MEC3) (Tong et al, 2004). Only four genes are synthetically lethal to all the DNA-damage checkpoint clusters of genes, of which the remaining unique lethalities argue incorrectly that they do not function in the same pathway. Table II also shows that expression profiling of yeast with highly similar sorts of treatments cannot cluster this group of genes, as none of the genes are coordinately regulated in response to DNA damage (Gasch et al, 2001) or to stress (Gasch et al, 2000). The nonessential genes involved in NER represent another well-studied pathway that is successfully clustered by phenotypic profiles as shown in Figure 1, but fails by the other methods. Combining all the protein–protein interactions from the Yeast Grid database (http://biodata.mshri.on.ca/yeast_grid/servlet/SearchPage), a network of direct linkages can be constructed linking all but one, Rad2, of the known members of the NEFs (NER factors), although incorporating 81 other potential false-positive interactions as shown in Table II. Synthetic lethality fails to show any shared interactions with the NER genes, and this failure is not due to the lack of a coessential function, as 10 unique lethalities are annotated on the Yeast Grid website. As seen with the DNA-damage checkpoint genes, there is an overall lack of transcriptional coordination that would implicate the NER genes in a common function. Of the nine NER genes, only RAD16 and RAD4 share a similar expression profile as shown in Table II, and the large number of additional genes implicated by coordinated expression are not functionally related. Table 2. Interacting proteins, synthetic lethal interactions, and coordinated gene expression for the nonessential genes in the DNA-damage checkpoint and NER pathways All nonessential genes in pathway Clustered by phenotypic profiling? Interacting proteins by two-hybrid, co-IP and mass spec. analysisa Common synthetic lethalitya Coordinated expression to stressb Coordinated expression to DNA damageb Cluster no. by integration analysisc DNA damage checkpoint DDC1 Yes (0) Mec3, Rad17 (3) 4 (17)d None (0) None (0) 8 MEC3 Yes (0) Ddc1, Rad17 (84) 4 (1) None (0) None (0) 14 RAD9 Yes (0) None (4) 4 (11)d None (0) None (0) 14 RAD17 Yes (0) Ddc1, Mec3 (3) 4 (1)d None (0) None (0) 14 RAD24 Yes (0) None (11) 4 (6)d None (0) None (0) 14 NER RAD1 Yes (0) Rad10, Rad14 (33) 0 (0) None (0) None (0) NC RAD10 Yes (0) Rad1 (13) 0 (1) None (0) None (2) 9 RAD14 Yes (0) Rad1, Rad4, Rad16 (3) 0 (0) None (0) None (8) 9 RAD4 Yes (0) Rad14, Rad23 (1) 0 (3) Rad16 (>20) None (0) 9 RAD23 Yes (0) Rad4 (9) 0 (4) None (0) None (0) NC RAD2 Yes (0) None (1) 0 (0) None (>20) None (1) 9 RAD7 Yes (0) Rad16, Elc1 (1) 0 (0) None (0) None (4) 9 RAD16 No (0) Rad14, Rad7 (20) 0 (2) Rad4 (>20) None (10) 9 ELC1 Yes (0) Rad7 (0) 0 (0) None (0) None (0) NC a Interaction data show the gene names, intrapathway interactions as well as the number of additional nonpathway interactions in parenthesis obtained from Yeast Grid as well as the number of synthetically lethal interactions found at (http://biodata.mshri.on.ca/yeast_grid/servlet/SearchPage). b The number of genes that are coordinately regulated using either response to DNA damage (Gasch et al, 2001) or to stress (Gasch et al, 2000) with a Pearson correlation of >0.8 to the query gene from http://db.yeastgenome.org/cgi-bin/expression/expressionConnection.pl. c Coclusters identified by probabilistic functional analysis by Lee et al (2004). Cluster number is given or NC for genes which failed to cluster. d Genes that are coclustered by Tong et al (2004). Recently, efforts to combine the data sets and filter out the inherent noise have improved the ability to predict functional clusters of genes (Troyanskaya et al, 2003; Lee et al, 2004). As shown in Table II, combining the data sets filtered for quality is an improvement over any of the individual methods. The fact that phenotypic profiling is as good if not better than all of the other methods combined for at least these two functional groups demonstrates the interrogative power of the methodology. Analysis of gene clusters by GO A critical test of the value of phenotypic profiling is to identify the function of genes of previously unknown function. To determine the feasibility of this on a genome-wide scale, we first performed a rigorous statistical analysis (see Materials and methods) to divide the hierarchical cluster into subclusters of genes, such that the correlations by which the members of a subcluster were joined are significant. Using a false discovery rate (FDR) of 10%, we found 630 nonoverlapping subclusters, containing 3084 of the original 4281 genes. Some 860 and 1151 genes in these 630 subclusters are of unknown biological process or molecular function, respectively (not counting ‘dubious’ ORFs). With the remaining 1197 unessential genes not currently assigned to a cluster at this cutoff, we have failed to elicit a significant shared phenotype for functionally related genes, suggesting that testing of more conditions designed to probe other cellular functions would cluster more functionally related genes. Next, we used GO (Ashburner et al, 2000), a set of three structured, controlled vocabularies that define the biological processes, molecular functions and cellular components of gene products, in conjunction with GO annotations for yeast gene products curated by the Saccharomyces Genome Database (http://www.yeastgenome.org/GOContents.shtml). Using these, we determined whether GO annotations were significantly enriched among the genes within each significant subcluster using GO::TermFinder (Boyle et al, 2004; http://search.cpan.org/dist/GO-TermFinder/), which when given a list of genes, determines whether any of the GO terms are significantly enriched as compared to the background of GO annotations in the population of all genes. A caveat to this is that as a large proportion of the GO terms are based on mutant phenotypes, the process of relating phenotypes to GO terms is somewhat circular. However, there is at present no alternative gold standard against which to test functional clusters. Of the 630 subclusters generated, 84 showed significant associations (with a Bonferroni corrected P-value to allow for multiple hypothesis testing of less than 0.01) with one or more biological processes, 51 with molecular functions and 61 with cellular components. The lack of significance of the majority of the subclusters with GO terms does not mean that the genes in these clusters are not functionally related, because (i) we have tested only 51 conditions, and therefore have not interrogated all the biological processes in the cell, and (ii) 176 of the subclusters had only two genes, thereby precluding statistical significance. We have developed a web-based tool for rapidly browsing the results of our analyses that displays the GO structure and the phenotypic profiling data for our significant subclusters (http://microarray-pubs.stanford.edu/phenotypic_profiling/index.shtml). This should prove useful for other researchers to find genes clustered with their genes, or process of interest and its utility will grow as new phenotypic profiles interrogating additional cellular processes are added. Comparison of phenotypic profiles as indicators of functional relationships with other genome-wide approaches In order to compare the present phenotypic profiling method with the other genome-wide data sets, we evaluated the enrichment of known functional relationships between pairs of genes with highly correlated phenotypic profiles using the biological process GO as a gold standard (Ashburner et al, 2000). To test the predictive power of our data for biological processes that were directly targeted with our selection of 51 conditions and agents, we limited the gold standard for comparison to the GO terms: DNA repair (GO:0006281), amino-acid biosynthesis (GO:0008652), cell-cycle checkpoint (GO:0000075), response to osmotic stress (GO:0006970), aerobic respiration (GO:0009060) and galactose metabolism (GO:0006012). Figure 2A illustrates the precision–recall characteristics of Pearson's correlations over pairs of phenotype profiles relative to a variety of high-throughput genomic data types (see Supplementary Information II). For the biological processes our study focuses on, the phenotype data are both more precise and sensitive than any of the other evidence types evaluated. For instance, at comparable specificity, phenotypic profile correlations predict five-fold more gene relationships than synthetic lethal interactions and 10-fold more than both high-throughput yeast two-hybrid or affinity precipitation experiments. For the processes evaluated here, the phenotypic data also provide more predictive power than microarray expression correlation over a variety of conditions. At the highest precision achieved by microarray correlation (∼7% at 0.6% recall), the phenotype data predict the same number of functional relationships at seven-fold higher precision (50% compared to 7%). Figure 2.Precision–recall evaluation of phenotype data on GO biological processes. The predictive power of phenotype profile correlations was evaluated against a gold standard based on six biological processes as defined by the GO: DNA repair, amino-acid biosynthesis, cell cycle checkpoint, response to osmotic stress, aerobic respiration, and galactose metabolism (A). The fraction of known functionally related gene pairs to total predictions (precision) at a range of thresholds is plotted versus the percentage of the number of known gene relationships recovered (recall) (). The characteristics of other high-throughput experimental data, affinity precipitation (), yeast two hybrid (), synthetic lethality (), transcription factor binding site data (), microarray correlation (), and functional data derived from Hughes et al (2000) () are shown for comparison. Two supervised feature selection methods were used to select the relevant features from the diverse collection of microarray data, one selecting single data set features independently and the other including or excluding entire data sets. The phenotype data is both more sensitive and precise than other high-throughput data on this set of processes. The phenotype profiles were also evaluated against a more general set of GO terms for comparison against existing data including (B) and excluding (C) the ribosome biogenesis GO term (GO:0007046), which tends to dominate gene pairs implicated by coexpression. The phenotype profiles implicate gene relationships over a broad range of biological processes. Download figure Download PowerPoint Since the expression correlation used for comparison was computed over a set of 11 diverse data sets, one might expect that functional signal for the processes evaluated here might be obscured by the variety of other relationships present in the data. To test this hypothesis, we applied two different approaches to supervised feature selection on the 11 microarray data sets. First, we used a rank-sum statistic to test each microarray experiment (column) individually and selected the most functionally relevant set for the six GO terms of interest (Figure 2A, single feature selection). We also tried a supervised feature selection at the data set level, in which sets of experiments from the same data set were either all included or all excluded based on a comparison of correlations between the genes of interest with random pairs of genes from the same data set (Figure 2A, data set selection). Details of these approaches are discussed in Materials and methods. While supervised feature selection amplifies the functional signal present in correlations of expression profiles, neither approach yields comparable results to unsupervised correlations on the phenotype profiles. For instance, at 50% precision, the phenotype data predict 10-fold more gene relationships than the microarray data set feature selection, the more successful of the two supervised approaches (Figure 2A). Overall, the phenotypic data are clearly superior to existing high-throughput studies in predicting functional relationships specific to the processes interrogated. We have also studied the enrichment of gene relationships in highly correlated phenotypic profiles across a broader range of biological processes. Figure 2B and C illustrate the precision–recall characteristics of our data compared to other high-throughput studies evaluated against a more general gold standard based on the biological process GO (see Materials and methods for details). Although the phenotypic data are not as precise or sensitive at predicting general functional relationships as it is in the target processes, it compares favorably with previous studies in this general evaluation. This is particularly evident if we exclude the ribosome biogenesis GO term (GO:0007046), which often dominates the gene relationships implicated by microarray coexpression (Figure 2C). Excluding this GO term, we find that the phenotype data can predict 100 gene–gene relationships correctly at 67% precision, while microarray coexpression and the Hughes et al (2000) data set, a functional profiling of deletion mutants, both predict 100 correct relationships at 30% precision. Identifying the function of unknown genes coclustered with known genes To test the hypothesis that an uncharacterized gene would function in the same pathway as the other genes in that cluster, we chose one of the subclusters identified by the GO analysis that included an ORF of unknown function. We chose the RIM subcluster (Supplementary Figure S1, Supplementary Information II), which contains many of the known proteins invo

Referência(s)