Pooled ORF Expression Technology (POET)
2005; Elsevier BV; Volume: 4; Issue: 11 Linguagem: Inglês
10.1074/mcp.m500128-mcp200
ISSN1535-9484
AutoresWilliam Gillette, Dominic Esposito, Peter Frank, Ming Zhou, Li‐Rong Yu, Catherine Jozwik, Xiuying Zhang, Brighid McGowan, David M. Jacobowitz, Harvey B. Pollard, Tong Hao, David E. Hill, Marc Vidal, Thomas P. Conrads, Timothy D. Veenstra, James L. Hartley,
Tópico(s)Genomics and Phylogenetic Studies
ResumoWe have developed a pooled ORF expression technology, POET, that uses recombinational cloning and proteomic methods (two-dimensional gel electrophoresis and mass spectrometry) to identify ORFs that when expressed are likely to yield high levels of soluble, purified protein. Because the method works on pools of ORFs, the procedures needed to subclone, express, purify, and assay protein expression for hundreds of clones are greatly simplified. Small scale expression and purification of 12 positive clones identified by POET from a pool of 688 Caenorhabditis elegans ORFs expressed in Escherichia coli yielded on average 6 times as much protein as 12 negative clones. Larger scale expression and purification of six of the positive clones yielded 47–374 mg of purified protein/liter. Using POET, pools of ORFs can be constructed, and the pools of the resulting proteins can be analyzed and manipulated to rapidly acquire information about the attributes of hundreds of proteins simultaneously. We have developed a pooled ORF expression technology, POET, that uses recombinational cloning and proteomic methods (two-dimensional gel electrophoresis and mass spectrometry) to identify ORFs that when expressed are likely to yield high levels of soluble, purified protein. Because the method works on pools of ORFs, the procedures needed to subclone, express, purify, and assay protein expression for hundreds of clones are greatly simplified. Small scale expression and purification of 12 positive clones identified by POET from a pool of 688 Caenorhabditis elegans ORFs expressed in Escherichia coli yielded on average 6 times as much protein as 12 negative clones. Larger scale expression and purification of six of the positive clones yielded 47–374 mg of purified protein/liter. Using POET, pools of ORFs can be constructed, and the pools of the resulting proteins can be analyzed and manipulated to rapidly acquire information about the attributes of hundreds of proteins simultaneously. Projects aiming to convert the thousands of genes made accessible by genomic sequences into their corresponding proteins have met with limited success (1Service R.F. Tapping DNA for structures produces a trickle.Science. 2002; 298: 948-950Google Scholar) despite the expenditure of significant resources (2Lattman E. The state of the Protein Structure Initiative.Proteins. 2004; 54: 611-615Google Scholar, 3Frazier M.E. Johnson G.M. Thomassen D.G. Oliver C.E. Patrinos A. Realizing the potential of the genome revolution: the genomes to life program.Science. 2003; 300: 290-293Google Scholar). Expression of recombinant proteins in Escherichia coli, the primary host organism for high throughput applications, has been especially unsuccessful for metazoan proteins. For example, one effort directed at producing Caenorhabditis elegans proteins successfully purified only about 2% of those attempted (4Luan C.H. Qiu S. Finley J.B. Carson M. Gray R.J. Huang W. Johnson D. Tsao J. Reboul J. Vaglio P. Hill D.E. Vidal M. Delucas L.J. Luo M. High-throughput expression of C. elegans proteins.Genome Res. 2004; 14: 2102-2110Google Scholar). In the standard approach to high throughput protein expression and purification used in these programs, genes are cloned individually into an expression vector, introduced into an expression host, and expressed in separate cultures. Each culture is subsequently tested for expression and solubility of the corresponding protein at which point new cultures of positive clones are grown and induced so that protein purification can be attempted. Even with intensive use of robotics, the logistics and costs of this strategy are considerable when thousands of genes are put into such a pipeline. Here we describe a method called pooled ORF expression technology (POET) 1The abbreviations used are: POET, pooled ORF expression technology; attL, site-specific recombination sites for Gateway cloning; LR Clonase, site-specific recombination enzyme mixture for Gateway cloning; 2D, two-dimensional; 2DGE, two-dimensional gel electrophoresis; LIT, linear ion trap. 1The abbreviations used are: POET, pooled ORF expression technology; attL, site-specific recombination sites for Gateway cloning; LR Clonase, site-specific recombination enzyme mixture for Gateway cloning; 2D, two-dimensional; 2DGE, two-dimensional gel electrophoresis; LIT, linear ion trap. that avoids many of the logistical issues associated with high throughput protein expression and purification. POET combines recombinational cloning and collections of sequenced ORFs with proteomic methods (two-dimensional gel electrophoreses (2DGE) and MS) to predict which ORFs in a pool will yield soluble, purified protein. We applied POET to a pool of 688 C. elegans ORFs. A high percentage of ORFs identified in this experiment yielded expressed, soluble, purified proteins in agreement with POET predictions. The C. elegans ORFeome version 1.1 has been described previously (5Li S. Armstrong C.M. Bertin N. Ge H. Milstein S. Boxem M. Vidalain P.O. Han J.D. Chesneau A. Hao T. Goldberg D.S. Li N. Martinez M. Rual J.F. Lamesch P. Xu L. Tewari M. Wong S.L. Zhang L.V. Berriz G.F. Jacotot L. Vaglio P. Reboul J. Hirozane-Kishikawa T. Li Q. Gabel H.W. Elewa A. Baumgartner B. Rose D.J. Yu H. Bosak S. Sequerra R. Fraser A. Mango S.E. Saxton W.M. Strome S. Van Den Heuvel S. Piano F. Vandenhaute J. Sardet C. Gerstein M. Doucette-Stamm L. Gunsalus K.C. Harper J.W. Cusick M.E. Roth F.P. Hill D.E. Vidal M. A map of the interactome network of the metazoan C. elegans.Science. 2004; 303: 540-543Google Scholar). The predicted initiating methionine of each ORF was changed to leucine (ATG → TTG), and the stop codon of each ORF was omitted. The DNA concentrations of the 752 Gateway entry clones used in this experiment were determined by PicoGreen (Molecular Probes) fluorescence and used to calculate the molar concentration of each plasmid (based on the size of each ORF and the size of the pDONR201 backbone), which ranged from 0 to 8.73 nm. All wells containing 0.08 and charge state-dependent cross-correlation (Xcorr) criteria as follows were considered as legitimate identifications: >1.9 for +1 charged peptides, >2.2 for +2 charged peptides, and >3.1 for +3 charged peptides. Based on visual inspection of the 2D gel and mass spectrometer identifications of the 165 total spots, 12 individual ORFs were retrieved from the ORF plates. These were subcloned into pDest527 and expressed in E coli Rosetta in 700-μl cultures in a 24-well dish to an A600 of 0.5, then transferred to 17 × 100-mm polypropylene tubes (Falcon 2059), cooled to 16 °C, induced with isopropyl 1-thio-β-d-galactopyranoside (0.5 mm), and expressed overnight at 16 °C. To determine the fraction of soluble and insoluble protein, cells were lysed with detergent (ReadyPreps, Epicenter), and soluble and insoluble fractions were applied to SDS-PAGE. Recombinant His6 fusion proteins were purified from the soluble fractions with Swell Gel beads (Pierce) and spin columns. Six ORFs chosen from this small scale experiment were grown at 1-liter scale and purified using the same procedure as the pool of 688 (above). Concentrations of the proteins in the small and large scale preparations were determined using the Bio-Rad protein assay. The POET scheme is shown in Fig. 1. Hundreds of ORFs are pooled, and the pooled ORFs are subcloned en masse into a protein expression vector supplying an affinity purification tag. The resulting pool of expression plasmids is introduced into an appropriate host and expressed in a single culture of host cells, and the tagged expressed proteins are purified away from host proteins. (For a culture of volume V containing n ORFs, the protein from any particular ORF is derived from V/n cells, whereas host proteins are derived from V cells. Thus from mass considerations host proteins are much more abundant than proteins from individual ORFs.) The mixture of ORF proteins is then separated by 2DGE, and individual proteins are identified by MS. Proteins in intensely staining spots are predicted to be expressed as abundant, soluble proteins that can be easily purified. These predictions are confirmed by retrieving and expressing individual ORFs from the original ORF collection. A pool of 688 ORFs in the form of Gateway entry clones (6Hartley J.L. Temple G.F. Brasch M.A. DNA cloning using in vitro site-specific recombination.Genome Res. 2000; 10: 1788-1795Google Scholar) was created from the C. elegans ORFeome (5Li S. Armstrong C.M. Bertin N. Ge H. Milstein S. Boxem M. Vidalain P.O. Han J.D. Chesneau A. Hao T. Goldberg D.S. Li N. Martinez M. Rual J.F. Lamesch P. Xu L. Tewari M. Wong S.L. Zhang L.V. Berriz G.F. Jacotot L. Vaglio P. Reboul J. Hirozane-Kishikawa T. Li Q. Gabel H.W. Elewa A. Baumgartner B. Rose D.J. Yu H. Bosak S. Sequerra R. Fraser A. Mango S.E. Saxton W.M. Strome S. Van Den Heuvel S. Piano F. Vandenhaute J. Sardet C. Gerstein M. Doucette-Stamm L. Gunsalus K.C. Harper J.W. Cusick M.E. Roth F.P. Hill D.E. Vidal M. A map of the interactome network of the metazoan C. elegans.Science. 2004; 303: 540-543Google Scholar). Wide variations in ORF DNA concentration were corrected by first creating subpools of clones having similar concentrations and then combining the subpools volumetrically (overall molar variation ≤2-fold). The pooled ORFs were subcloned via a Gateway LR reaction into an E coli expression vector (pDest527) that added a hexahistidine (His6) tag to the amino terminus of the protein expressed from each ORF. The LR reaction was transformed into a non-expression E coli strain (DH5α), and purified DNA from this transformation was then transformed into E coli strain Rosetta (DE3) for subsequent protein expression. Proteins were expressed from the pooled ORFs at 16 °C (1-liter culture, equivalent to 1.45 ml for each ORF in the pool), the E coli cells were lysed by sonication, and soluble proteins were purified by IMAC. The purified protein pool was precipitated with acid, dissolved in urea/CHAPS buffer, and resolved by 2DGE (Fig. 2). A large number of spots on the gel were identified by MS to understand more fully parameters important to the POET process. The most intense spots on the gel were identified as E coli proteins DnaK, GroEL, SlyD, and OmpF (Fig. 2). Because the isoelectric focusing range on the gel was pH 4–7, about 200 of the 688 C. elegans proteins were predicted to appear on the 2D gel based on their calculated pI values. A total of 50 C. elegans proteins and 37 E coli proteins were identified by MS of 165 spots selected from the gel for analysis (see the supplemental table). Twelve C. elegans proteins were chosen from the 2D gel in a blinded fashion using only the number of spots and their intensity and purity (i.e. spots containing more than one protein were given less weight) as selection parameters (Table I). These 12 positive proteins were identified by an average of nine different tryptic peptides. Twelve negative proteins were also chosen from the set of 688 as having a predicted pI between 4 and 7 and not being identified in the 2D gel spots that were examined. The 24 ORFs corresponding to these proteins were cloned individually into the same E coli expression vector (amino His6 fusions) described above and expressed in 700-μl cultures at 16 °C, and the resulting proteins were affinity-purified with spin columns. Total and soluble proteins from each culture (Fig. 3, a and b) and comparison of the purified proteins (Fig. 3c and Table I) validate the predictions of the POET method. Proteins from the positive clones were much more likely to be soluble (Fig. 3, a versus b) and give abundant purified protein (Fig. 3c) than the negative clones. An average of 6 times as much protein was recovered from the positive clones as from the negative clones (Table I). Most of the proteins migrated in accord with their predicted molecular weights. Proteins in lanes 2 ("protein with tau-like repeats, isoform a") and 6 ("hypothetical protein F09G2.9") had extensive predicted hydrophobic regions that may account for their abnormally low electrophoretic mobilities in both the 2D (Fig. 2) and one-dimensional (Fig. 3) gels. We speculate that the "negative" protein in lane 13 was not identified on the 2D gel because its solubility was low at its isoelectric point, and it failed to leave the isoelectric focusing strip.Table IExpression and purification of 12 POET positive (1–12) and negative (13–24) ORFsLane no.Protein annotationSwiss-Prot accession no.Protein molecular weightaPredicted molecular weight of the protein expressed in pDest527.YieldSmall scaleLarge scalemg/liter1TAB1-like protein TAP-1Q9337547,73537ND2Protein with tau-like repeats, isoform aO0259253,6825ND3Machado-Joseph disease-like proteinO1785040,13010ND4Hypothetical protein C17G10.2Q0997453,316981005Skp1p homolog (SKR-12)Q2287123,19234ND6Hypothetical protein F09G2.9O1740648,55377477Bag1 homolog protein 1O4473928,27632ND8Hypothetical protein F53F4.3Q2072829,7071443649Hypothetical protein D2096.8Q1900739,9311303741014-3-3-like protein 1P4193232,428596011Troponin C, isoform 2Q0966522,493163ND12Troponin TQ2737151,3078012513Hypothetical protein F46A9.5Q936472429876ND14Mitochondrial import receptor subunit TOM20 homologQ1976624,69220ND15Probable mediator complex subunit soh-1P9186923,7449ND16Eukaryotic peptide chain release factor subunit 1O1652053,4840ND17CCR4-NOT transcription complex subunit 7Q1734538,1040ND18Probable coatomer ζ subunitO1790125,0230ND19Glutathione S-transferase 3O1611628,00131ND20Spermatocyte protein spe-27 (precursor)P5421819,4490ND21CCR4-NOT transcription complex subunit 7Q1734538,1040ND22Myoblast determination protein 1 homologP2298040,7150ND23G2/mitotic-specific cyclin A1P3463859,7760ND24Hypothetical protein B0035.11Q1743152,3290NDa Predicted molecular weight of the protein expressed in pDest527. Open table in a new tab To verify that the small scale experiments could predict successful larger scale behavior, six of the positive ORFs were expressed in E coli individually in 1-liter cultures. Soluble proteins were released from cells by sonication and ultracentrifugation and purified on preparative IMAC columns (Fig. 4 and Table I). All six ORFs yielded large amounts (47–374 mg/liter) of purified protein. POET is a procedure for finding which ORFs in a collection of hundreds of ORFs can be most efficiently converted, by cloning, expression, and purification, into their corresponding recombinant proteins. By first combining n ORFs into a single pool, tasks that are difficult to accomplish hundreds of times (transformation, plating, colony picking, culture, induction, lysis, assays of solubility, and purification) are reduced in number n-fold. The problem that then arises, of course, is how to identify which proteins in the purified pool are the most abundant. MS can identify proteins with spectacular sensitivity, but it is also dramatically non-quantitative (due to the unpredictable ionization of peptides of different amino acid sequence). Thus MS alone, although it can identify all the proteins in the pool, cannot distinguish between the most abundant and least abundant proteins in that pool. (Isotopic labeling methods such as ICAT (10Gygi S.P. Rist B. Gerber S.A. Turecek F. Gelb M.H. Aebersold R. Quantitative analysis of complex protein mixtures using isotope-coded affinity tags.Nat. Biotechnol. 1999; 17: 994-999Google Scholar) are not useful, because they give relative quantitation of the same protein in two different samples, whereas POET requires relative quantitation of many different proteins in one sample.) We chose 2DGE to determine abundance of the expressed, purified proteins in our POET pool. 2D gels have limitations of size and pI, and running them is not trivial. But they can resolve thousands of proteins, and often individual spots contain a single protein (see the supplemental table). For the purposes of the present study we assumed that the size and intensity of stained spots was a reasonable indicator of the abundance of each protein. Combinations of liquid chromatography columns could be used to generate dozens of fractions of the protein pool, but few if any fractions would contain single proteins, and quantitation would thus be unsatisfactory. We wished to identify a large number of the spots on the 2D gel so that we could understand the parameters of the POET experiment more completely. Many of the highly abundant spots that were identified were E coli stress response proteins that would not require reidentification in the analysis of subsequent ORF pools, allowing the focus to be only on new spots from C. elegans clones. Because many of the manual steps at the end of POET can be automated (spot identification, picking, digestion, MALDI plate preparation, and MALDI-TOF/TOF peptide identification), we estimate the net efficiency gain for POET to be 10–100-fold when compared with automated or manual one-by-one methods, respectively. Although the POET scheme is straightforward, we recognize there are underlying assumptions that can affect its results. 1) All the ORFs in a pool should retain their representation during subcloning and subsequent transfer into expression hosts. Recombinational cloning appears to be essential to minimize size bias and maximize efficiency. Loss of some clones due to toxic effects of ORF expression will tend to increase the representation of remaining clones on the 2D gels. 2) POET assumes that the intensities of spots on 2D gels reflect the amounts of each protein in the purified pool prior to electrophoresis. However, not all proteins remain soluble during isoelectric focusing, and some proteins migrate as multiple spots. 3) Soluble proteins may interact as they are released from cells. The assumption is that as the pool is purified the effect of any one recombinant protein on the behavior of any other member of the pool is small. 4) More than one ORF may be expressed in a particular host cell. POET assumes that coexpression of any two ORFs in the same cell will be distributed more or less randomly among all the ORFs in the pool, and the influence of any one ORF on the behavior of any other ORF is small. We foresee numerous applications of POET. 1) The large data sets that can be produced by POET experiments could provide researchers with a priori knowledge of what is the most appropriate context to express and purify any candidate protein of interest. 2) Because the solubility of overexpressed proteins is often low, one could take the insoluble fraction of proteins from a POET experiment, divide the insoluble proteins into aliquots, subject each aliquot to a different refolding regimen, and identify which procedure works best for any protein in the pool, thus obtaining hundreds of results for each refolding protocol. 3) Small differences in amino acid sequence can cause vastly different behaviors of proteins during overexpression and purification. Using POET, proteins from mouse, rat, and other model organisms can be attempted if purification of the homologous human proteins fail. 4) Because membrane proteins are difficult to extract and purify, ORF pools comprising the extracellular and/or intracellular domains of hundreds of membrane proteins could be constructed, expressed, purified, and analyzed in POET experiments. 5) The optimal number of ORFs in a pool can be adjusted in conjunction with protein expression, purification, and separation parameters. Clearly the larger the number of ORFs in each pool, the fewer the number of overall experiments are required. However, as the number of ORFs increases the average intensity of each ORF spot on the 2D gel decreases, and the intensities of ORF spots decrease compared with spots from expression host proteins. Analysis of relatively large pools of ORFs should be improved by comparison of multiple pools that contain unrelated ORFs because host proteins in each pool are relatively constant and can be ignored. Gene libraries such as the National Institutes of Health Mammalian Gene Collection (mgc.nci.nih.gov/) and collections of ORFs from Homo sapiens (11Rual J.F. Hirozane-Kishikawa T. Hao T. Bertin N. Li S. Dricot A. Li N. Rosenberg J. Lamesch P. Vidalain P.O. Clingingsmith T.R. Hartley J.L. Esposito D. Cheo D. Moore T. Simmons B. Sequerra R. Bosak S. Doucette-Stamm L. Le Peuch C. Vandenhaute J. Cusick M.E. Albala J.S. Hill D.E. Vidal M. Human ORFeome version 1.1: a platform for reverse proteomics.Genome Res. 2004; 14: 2128-2135Google Scholar, 12Pearlberg J. LaBaer J. Protein expression clone repositories for functional proteomics.Curr. Opin. Chem. Biol. 2004; 8: 98-102Google Scholar, 13Messina D.N. Glasscock J. Gish W. Lovett M. An ORFeome-based analysis of human transcription factor genes and the construction of a microarray to interrogate their expression.Genome Res. 2004; 14: 2041-2047Google Scholar, 14Oyama M. Itagaki C. Hata H. Suzuki Y. Izumi T. Natsume T. Isobe T. Sugano S. Analysis of small human proteins reveals the translation of upstream open reading frames of mRNAs.Genome Res. 2004; 14: 2048-2052Google Scholar, 15Wiemann S. Arlt D. Huber W. Wellenreuther R. Schleeger S. Mehrle A. Bechtel S. Sauermann M. Korf U. Pepperkok R. Sultmann H. Poustka A. From ORFeome to biology: a functional genomics pipeline.Genome Res. 2004; 14: 2136-2144Google Scholar, 16Collins J.E. Wright C.L. Edwards C.A. Davis M.P. Grinham J.A. Cole C.G. Goward M.E. Aguado B. Mallya M. Mokrab Y. Huckle E.J. Beare D.M. Dunham I. A genome annotation-driven approach to cloning the human ORFeome.Genome Biol. 2004; 5: R84Google Scholar) and other model organisms (5Li S. Armstrong C.M. Bertin N. Ge H. Milstein S. Boxem M. Vidalain P.O. Han J.D. Chesneau A. Hao T. Goldberg D.S. Li N. Martinez M. Rual J.F. Lamesch P. Xu L. Tewari M. Wong S.L. Zhang L.V. Berriz G.F. Jacotot L. Vaglio P. Reboul J. Hirozane-Kishikawa T. Li Q. Gabel H.W. Elewa A. Baumgartner B. Rose D.J. Yu H. Bosak S. Sequerra R. Fraser A. Mango S.E. Saxton W.M. Strome S. Van Den Heuvel S. Piano F. Vandenhaute J. Sardet C. Gerstein M. Doucette-Stamm L. Gunsalus K.C. Harper J.W. Cusick M.E. Roth F.P. Hill D.E. Vidal M. A map of the interactome network of the metazoan C. elegans.Science. 2004; 303: 540-543Google Scholar, 17Labaer J. Qiu Q. Anumanthan A. Mar W. Zuo D. Murthy T.V. Taycher H. Halleck A. Hainsworth E. Lory S. Brizuela L. The Pseudomonas aeruginosa PA01 gene collection.Genome Res. 2004; 14: 2190-2200Google Scholar, 18Dricot A. Rual J.F. Lamesch P. Bertin N. Dupuy D. Hao T. Lambert C. Hallez R. Delroisse J.M. Vandenhaute J. Lopez-Goni I. Moriyon I. Garcia-Lobo J.M. Sangari F.J. Macmillan A.P. Cutler S.J. Whatmore A.M. Bozak S. Sequerra R. Doucette-Stamm L. Vidal M. Hill D.E. Letesson J.J. De Bolle X. Generation of the Brucella melitensis ORFeome version 1.1.Genome Res. 2004; 14: 2201-2206Google Scholar, 19Bonaldo M.F. Bair T.B. Scheetz T.E. Snir E. Akabogu I. Bair J.L. Berger B. Crouch K. Davis A. Eyestone M.E. Keppel C. Kucaba T.A. Lebeck M. Lin J.L. de Melo A.I. Rehmann J. Reiter R.S. Schaefer K. Smith C. Tack D. Trout K. Sheffield V.C. Lin J.J. Casavant T.L. Soares M.B. 1274 full-open reading frames of transcripts expressed in the developing mouse nervous system.Genome Res. 2004; 14: 2053-2063Google Scholar, 20Wilm M. Shevchenko A. Houthaeve T. Breit S. Schweigerer L. Fotsis T. Mann M. Femtomole sequencing of proteins from polyacrylamide gels by nano-electrospray mass spectrometry.Nature. 1996; 379: 466-469Google Scholar, 21Hilson P. Allemeersch J. Altmann T. Aubourg S. Avon A. Beynon J. Bhalerao R.P. Bitton F. Caboche M. Cannoot B. Chardakov V. Cognet-Holliger C. Colot V. Crowe M. Darimont C. Durinck S. Eickhoff H. de Longevialle A.F. Farmer E.E. Grant M. Kuiper M.T. Lehrach H. Leon C. Leyva A. Lundeberg J. Lurin C. Moreau Y. Nietfeld W. Paz-Ares J. Reymond P. Rouze P. Sandberg G. Segura M.D. Serizet C. Tabrett A. Taconnat L. Thareau V. Van Hummelen P. Vercruysse S. Vuylsteke M. Weingartner M. Weisbeek P.J. Wirta V. Wittink F.R. Zabeau M. Small I. Versatile gene-specific sequence tags for Arabidopsis functional genomics: transcript profiling and reverse genetics applications.Genome Res. 2004; 14: 2176-2189Google Scholar) already exist or are coming into being. In combination with these resources, POET may help make large numbers of important proteins accessible to the scientific community. We thank Robert Stephens of the Advanced Biomedical Computing Center, SAIC-Frederick, Inc., for valuable bioinformatics support and Sukanya Chowdhury, Megan Bucheimer, and Kelly Esposito of the Protein Expression Laboratory for skilled technical assistance. Download .pdf (.09 MB) Help with pdf files
Referência(s)