Artigo Acesso aberto Revisado por pares

Analysis of High Throughput Protein Expression in Escherichia coli

2006; Elsevier BV; Volume: 5; Issue: 9 Linguagem: Inglês

10.1074/mcp.m600140-mcp200

ISSN

1535-9484

Autores

Yair Benita, Michael J. Wise, M.C. Lok, Ian Humphery‐Smith, Ronald S. Oosting,

Tópico(s)

Viral Infectious Diseases and Gene Expression in Insects

Resumo

The ability to efficiently produce hundreds of proteins in parallel is the most basic requirement of many aspects of proteomics. Overcoming the technical and financial barriers associated with high throughput protein production is essential for the development of an experimental platform to query and browse the protein content of a cell (e.g. protein and antibody arrays). Proteins are inherently different one from another in their physicochemical properties; therefore, no single protocol can be expected to successfully express most of the proteins. Instead of optimizing a protocol to express a specific protein, we used sequence analysis tools to estimate the probability of a specific protein to be expressed successfully using a given protocol, thereby avoiding a priori proteins with a low success probability. A set of 547 proteins, to be used for antibody production and selection, was expressed in Escherichia coli using a high throughput protein production pipeline. Protein properties derived from sequence alone were correlated to successful expression, and general guidelines are given to increase the efficiency of similar pipelines. A second set of 68 proteins was expressed to investigate the link between successful protein expression and inclusion body formation. More proteins were expressed in inclusion bodies; however, the formation of inclusion bodies was not a requirement for successful expression. The ability to efficiently produce hundreds of proteins in parallel is the most basic requirement of many aspects of proteomics. Overcoming the technical and financial barriers associated with high throughput protein production is essential for the development of an experimental platform to query and browse the protein content of a cell (e.g. protein and antibody arrays). Proteins are inherently different one from another in their physicochemical properties; therefore, no single protocol can be expected to successfully express most of the proteins. Instead of optimizing a protocol to express a specific protein, we used sequence analysis tools to estimate the probability of a specific protein to be expressed successfully using a given protocol, thereby avoiding a priori proteins with a low success probability. A set of 547 proteins, to be used for antibody production and selection, was expressed in Escherichia coli using a high throughput protein production pipeline. Protein properties derived from sequence alone were correlated to successful expression, and general guidelines are given to increase the efficiency of similar pipelines. A second set of 68 proteins was expressed to investigate the link between successful protein expression and inclusion body formation. More proteins were expressed in inclusion bodies; however, the formation of inclusion bodies was not a requirement for successful expression. The completion of the human genome project and the biotechnical advances in the field of genomics have radically transformed biological and medical research. We now have the ability to monitor the mRNA expression of thousands of genes simultaneously in cells and tissues. However, it is the proteins encoded by these genes that carry out most biological functions. The proteome is much more daunting in size and complexity than the genome, and to understand how cells work we must study which proteins are present, how they interact with each other, and what they do. The difficulty of studying proteins is that they are each distinctively different from the other and are usually present in tissue in very low amounts. In the absence of a PCR equivalent, it has been suggested to call upon affinity ligands, such as monoclonal antibodies, for detection and identification of proteins (1Humphery-Smith I. A human proteome project with a beginning and an end.Proteomics. 2004; 4: 2519-2521Crossref PubMed Scopus (38) Google Scholar). Regardless of the specific affinity ligand used, purified proteins must first be acquired in large quantities for generation and/or selection of specific affinity ligands. Thus, there is a need to define expression and purification conditions that are amenable to hundreds or even thousands of proteins in parallel. However, because proteins differ significantly in their physicochemical properties, the success rate of high throughput protein production is often too low, increasing the financial and technical constraints on such projects. Several groups have previously attempted high throughput expression of proteins or protein fragments. High throughput is defined as the ability to automate protein production, often using a 96-well format. Braun et al. (2Braun P. Hu Y. Shen B. Halleck A. Koundinya M. Harlow E. LaBaer J. Proteome-scale purification of human proteins from bacteria.Proc. Natl. Acad. Sci. U. S. A. 2002; 99: 2654-2659Crossref PubMed Scopus (218) Google Scholar) expressed 336 randomly selected human cDNAs in Escherichia coli and purified successfully 60% under denaturing conditions using His6 constructs and 50% under non-denaturing conditions using GST constructs. Luan et al. (3Luan C.H. Qiu S. Finley J.B. Carson M. Gray R.J. Huang W. Johnson D. Tsao J. Reboul J. Vaglio P. Hill D.E. Vidal M. Delucas L.J. Luo M. High-throughput expression of C. elegans proteins.Genome Res. 2004; 14: 2102-2110Crossref PubMed Scopus (96) Google Scholar) expressed 10,176 Caenorhabditis elegans proteins using a robotic pipeline and observed an overall expression of 50% (15% in soluble form). Agaton et al. (4Agaton C. Galli J. Höidén Guthenberg I. Janzon L. Hansson M. Asplund A. Brundell E. Lindberg S. Ruthberg I. Wester K. Wurtz D. Höög C. Lundeberg J. Ståhl S. Pontén F. Uhlén M. Affinity proteomics for systematic protein profiling of chromosome 21 gene products in human tissues.Mol. Cell. Proteomics. 2003; 2: 405-414Abstract Full Text Full Text PDF PubMed Scopus (105) Google Scholar) reported a success rate of 76% for the expression of 142 human proteins in E. coli. Other groups reported success rates in the range of 60–80% (5Christendat D. Yee A. Dharamsi A. Kluger Y. Gerstein M. Arrowsmith C.H. Edwards A.M. Structural proteomics: prospects for high throughput sample preparation.Prog. Biophys. Mol. Biol. 2000; 73: 339-345Crossref PubMed Scopus (68) Google Scholar, 6Pizza M. Scarlato V. Masignani V. Giuliani M.M. Aricò B. Comanducci M. Jennings G.T. Baldi L. Bartolini E. Capecchi B. Galeotti C.L. Luzzi E. Manetti R. Marchetti E. Mora M. Nuti S. Ratti G. Santini L. Savino S. Scarselli M. Storni E. Zuo P. Broeker M. Hundt E. Knapp B. Blair E. Mason T. Tettelin H. Hood D.W. Jeffries A.C. Saunders N.J. Granoff D.M. Venter J.C. Moxon E.R. Grandi G. Rappuoli R. Identification of vaccine candidates against serogroup B meningococcus by whole-genome sequencing.Science. 2000; 287: 1816-1820Crossref PubMed Scopus (1074) Google Scholar, 7Dobrovetsky E. Lu M.L. Andorn-Broza R. Khutoreskaya G. Bray J.E. Savchenko A. Arrowsmith C.H. Edwards A.M. Koth C.M. High-throughput production of prokaryotic membrane proteins.J. Struct. Funct. Genomics. 2005; 6: 33-50Crossref PubMed Scopus (40) Google Scholar). The three-dimensional structure of a protein can often provide functional clues, primarily by detecting structural homology with a protein of known function (8Cort J.R. Koonin E.V. Bash P.A. Kennedy M.A. A phylogenetic approach to target selection for structural genomics: solution structure of YciH.Nucleic Acids Res. 1999; 27: 4018-4027Crossref PubMed Scopus (39) Google Scholar, 9Zarembinski T.I. Hung L.W. Mueller-Dieckmann H.J. Kim K.K. Yokota H. Kim R. Kim S.H. Structure-based assignment of the biochemical function of a hypothetical protein: a test case of structural genomics.Proc. Natl. Acad. Sci. U. S. A. 1998; 95: 15189-15193Crossref PubMed Scopus (272) Google Scholar). Structural proteomics attempts to determine protein structure on a genome-wide scale. It not only requires high throughput expression of target proteins but also that the proteins be produced in a form that is soluble, correctly folded, and suitable for x-ray crystallography or NMR studies. Previous attempts to produce proteins on a large scale for structural studies resulted in success rates of ∼10% (10Bertone P. Kluger Y. Lan N. Zheng D. Christendat D. Yee A. Edwards A.M. Arrowsmith C.H. Montelione G.T. Gerstein M. SPINE: an integrated tracking database and data mining approach for identifying feasible targets in high-throughput structural proteomics.Nucleic Acids Res. 2001; 29: 2884-2898Crossref PubMed Scopus (98) Google Scholar, 11Goh C.S. Lan N. Echols N. Douglas S.M. Milburn D. Bertone P. Xiao R. Ma L.C. Zheng D. Wunderlich Z. Acton T. Montelione G.T. Gerstein M. SPINE 2: a system for collaborative structural proteomics within a federated database framework.Nucleic Acids Res. 2003; 31: 2833-2838Crossref PubMed Scopus (46) Google Scholar). This low success rate motivated studies that attempted to link the primary sequence of a protein to its propensity to be soluble upon overexpression in E. coli (10Bertone P. Kluger Y. Lan N. Zheng D. Christendat D. Yee A. Edwards A.M. Arrowsmith C.H. Montelione G.T. Gerstein M. SPINE: an integrated tracking database and data mining approach for identifying feasible targets in high-throughput structural proteomics.Nucleic Acids Res. 2001; 29: 2884-2898Crossref PubMed Scopus (98) Google Scholar, 11Goh C.S. Lan N. Echols N. Douglas S.M. Milburn D. Bertone P. Xiao R. Ma L.C. Zheng D. Wunderlich Z. Acton T. Montelione G.T. Gerstein M. SPINE 2: a system for collaborative structural proteomics within a federated database framework.Nucleic Acids Res. 2003; 31: 2833-2838Crossref PubMed Scopus (46) Google Scholar, 12Idicula-Thomas S. Balaji P.V. Understanding the relationship between the primary structure of proteins and its propensity to be soluble on overexpression in Escherichia coli.Protein Sci. 2005; 14: 582-592Crossref PubMed Scopus (128) Google Scholar, 13Shimada K. Nagano M. Kawai M. Koga H. Influences of amino acid features of glutathione S-transferase fusion proteins on their solubility.Proteomics. 2005; 5: 3859-3863Crossref PubMed Scopus (6) Google Scholar). On the other hand, protein production for affinity ligands does not necessarily require the heterologous protein to be soluble. Agaton et al. (4Agaton C. Galli J. Höidén Guthenberg I. Janzon L. Hansson M. Asplund A. Brundell E. Lindberg S. Ruthberg I. Wester K. Wurtz D. Höög C. Lundeberg J. Ståhl S. Pontén F. Uhlén M. Affinity proteomics for systematic protein profiling of chromosome 21 gene products in human tissues.Mol. Cell. Proteomics. 2003; 2: 405-414Abstract Full Text Full Text PDF PubMed Scopus (105) Google Scholar) reported a success rate of 56% for eliciting affinity-purified antibodies against proteins that were expressed in E. coli and purified under denaturing conditions. In this respect protein production for affinity ligands is significantly less demanding than production for structural studies. To better cope with the financial constraints of high throughput protein production, it would be beneficial to identify a priori proteins that are likely to fail expression in a pipeline designed for affinity ligand target generation. Although prediction of protein solubility upon overexpression has drawn scientific attention, prediction of successful expression has been largely disregarded. Prediction of protein expression is bound to be more complicated because expression can fail in any of several different steps from plasmid construct stability to the final purified protein. Many of those steps, such as mRNA decay, are not necessarily related to the primary protein sequence or to the physicochemical properties of the amino acids. Solubility, on the other hand, is more likely to be dependent on the amino acid composition of the protein. In this study we present results on the expression of 547 recombinant proteins, produced as targets for affinity ligand generation, and investigate the link between their DNA and protein sequences and successful expression. Finally we investigate the relationship between solubility and expression level on a set of 68 human proteins. We randomly selected 615 human ORFs, 547 for high throughput expression and 68 for inclusion body analysis, from disease-related genes available in publicly accessible clone libraries in late 2001 and retrieved their DNA coding sequence from GenBank™ (www.ncbi.nlm.nih.gov). Coding sequences were compared with the human genome (GenBank™ build 25) using BLAST, and the exons were extracted and set in-frame. For ORFs containing multiple exon, the first was discarded to reduce the likelihood of a signal peptide, and from the remaining exons the longest was chosen. Primer selection criteria, genomic template, and PCR protocols for our protein production pipeline have been described previously (14Benita Y. Oosting R.S. Lok M.C. Wise M.J. Humphery-Smith I. Regionalized GC content of template DNA as a predictor of PCR success.Nucleic Acids Res. 2003; 31: e99Crossref PubMed Google Scholar). The plasmid construct used, named HZS, contained a His6 tag, a ZZ domain, a Gateway-compatible insert, and a Strep-tag. The ZZ domain is the tandem repeat dimer of the modified immunoglobulin binding domain of protein A of Staphylococcus aureus (15Nilsson B. Moks T. Jansson B. Abrahmsén L. Elmblad A. Holmgren E. Henrichson C. Jones T.A. Uhlén M. A synthetic IgG-binding domain based on staphylococcal protein A.Protein Eng. 1987; 1: 107-113Crossref PubMed Scopus (630) Google Scholar). The Strep-tag (16Skerra A. Schmidt T.G. Use of the Strep-Tag and streptavidin for detection and purification of recombinant proteins.Methods Enzymol. 2000; 326: 271-304Crossref PubMed Google Scholar) was constructed using custom oligos. Plasmid construction, gene cloning, and bacterial transformation and induction have been described previously by our group (17Zhao Y. Benita Y. Lok M. Kuipers B. van der Ley P. Jiskoot W. Hennink W.E. Crommelin D.J. Oosting R.S. Multi-antigen immunization using IgG binding domain ZZ as carrier.Vaccine. 2005; 23: 5082-5090Crossref PubMed Scopus (15) Google Scholar). As expression host the E. coli BL21 codon-plus RP strain (Stratagene) was used. These cells contain extra copies of the argU and proL tRNA to enable expression of genes restricted by either AGG/AGA or CCC codons. Protein purification for the high throughput protein pipeline was done under denaturing conditions. The bacteria were grown in 24 deep well plates. Each well contained 5 ml of LB medium supplemented with 50 μg/ml ampicillin and chloramphenicol. At the end of the 4-h isopropyl β-d-thiogalactopyranoside induction period, bacterial plates were centrifuged at 3500 rpm for 15 min. The supernatants were removed, and the bacterial pellets were resuspended each in 1 ml of lysis buffer containing 8 m urea (lysis buffer: 100 mm NaH2PO4, 20 mm Tris, 10% glycerol, 0.1% Tween 20, pH 8.0, 20 mm β-mercaptoethanol plus one tablet of Complete protease inhibitor (Roche Applied Science)). The content of each well was sonicated two times for 15 s with 10 s in between. Then the plates were centrifuged for 20 min at 3500 rpm. The next steps in the protein purification protocol were done using the Biorobot 8000 (Qiagen). Aliquots of 800 μl of the supernatants were transferred to a 96-well filter plate (Qiagen) containing 200 μl of Ni-NTA 1The abbreviations used are: Ni-NTA, nickel-nitrilotriacetic acid; aa, amino acids; AUC, area under the curve; CAI, codon adaptation index; POPPs, Protein or Oligonucleotide Probability Profiles; ANOVA, analysis of variance; 1D, one-dimensional. Superflow that was washed once with 500 μl of lysis buffer containing 8 m urea before applying the supernatant. Then vacuum of 900 millibars was applied for 3 min. The resin was successively washed with 4, 2, 1, and 0 m solutions of urea in lysis buffer. After each wash step vacuum was applied for 1.5 min at 900 millibars, and the flow-through was discarded. Finally 1 ml of elution buffer (50 mm NaH2PO4, 300 mm NaCl, 250 mm imidazole, pH = 8.0) was added to each well. After 10 min, vacuum was applied for 2 min at 700 millibars, and the eluate was collected in a deep well 96-well block. Protein purification for inclusion body analysis was performed separately for the soluble and insoluble fractions of the bacterial lysate. The bacterial pellet from 10 ml of induced bacteria was resuspended in 600 μl of B-PER (Bacterial Protein Extraction Reagent, Pierce) containing one tablet of Complete protease inhibitor (Roche Applied Science)/25 ml, vortexed for 1 min at 3000 rpm, and centrifuged for 10 min in a standard tabletop microcentrifuge at 13,000 rpm and 4 °C. The supernatant was removed and placed on a custom made column containing 100 μl of Ni-NTA Superflow. Columns were washed twice with 500 μl of wash buffer (50 mm NaH2PO4, 300 nm NaCl, and 20 mm imidazole, pH = 8.0) and eluted with 500 μl of elution buffer. The remaining pellet of the lysed bacteria containing the insoluble fraction was resuspended by sonication (2 × 5 s) in 1 ml of 8 m urea and centrifuged for 10 min at 13,000 rpm. The supernatant was then placed on columns containing 100 μl of Ni-NTA Superflow and washed with 4, 2, 1, and 0 m solutions of urea in BR buffer (0.1% Tween-20, 10% glycerol, 100 mm NaH2PO4, 20 mm Tris, pH = 8). Proteins were eluted with 500 μl of elution buffer. All proteins were visualized on Criterion Tris-HCl 12.5% precast polyacrylamide gels (Bio-Rad) with Coomassie staining. A sequence analysis module was written in Python (Python Software Foundation) for this study and is being distributed as part of the SeqUtils module of Biopython (18Chapman B. Chang J. Biopython: python tools for computational biology.ACM SIGBIO Newslett. 2000; 20: 15-19Crossref Google Scholar). An aromaticity score was calculated according to Lobry and Gautier (19Lobry J.R. Gautier C. Hydrophobicity, expressivity and aromaticity are the major trends of amino-acid usage in 999 Escherichia coli chromosome-encoded genes.Nucleic Acids Res. 1994; 22: 3174-3180Crossref PubMed Scopus (219) Google Scholar), and a protein instability index was calculated according to Guruprasad et al. (20Guruprasad K. Reddy B.V. Pandit M.W. Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence.Protein Eng. 1990; 4: 155-161Crossref PubMed Scopus (773) Google Scholar). Isoelectric point, charge, and amino acid content (aliphatic, aromatic, polar, non-polar, charged, basic, acidic, small, and tiny) were calculated using pepstats (21Harrison R.G. Expression of soluble heterologous proteins via fusion with NusA protein.inNovations. 2000; 11: 4-7Google Scholar) from the EMBOSS package (22Rice P. Longden I. Bleasby A. EMBOSS: The European Molecular Biology Open Software Suite.Trends Genet. 2000; 16: 276-277Abstract Full Text Full Text PDF PubMed Scopus (6254) Google Scholar). Average and maximum protein flexibility were calculated according to Vihinen et al. (23Vihinen M. Torkkila E. Riikonen P. Accuracy of protein flexibility predictions.Proteins. 1994; 19: 141-149Crossref PubMed Scopus (251) Google Scholar). Protein disorder was calculated using FoldIndex (24Prilusky J. Felder C.E. Zeev-Ben-Mordehai T. Rydberg E.H. Man O. Beckmann J.S. Silman I. Sussman J.L. FoldIndex: a simple tool to predict whether a given protein sequence is intrinsically unfolded.Bioinformatics. 2005; 21: 3435-3438Crossref PubMed Scopus (789) Google Scholar), and from the output the longest disorder segment and the total number of residues in disorder segments were extracted. DNA sequence complexity was calculated using both nSEG (25Wootton J.C. Federhen S. Analysis of compositionally biased regions in sequence databases.Methods Enzymol. 1996; 266: 554-571Crossref PubMed Google Scholar) and G1 (26Wan H. Wootton J.C. A global compositional complexity measure for biological sequences: AT-rich and GC-rich genomes encode less complex proteins.Comput. Chem. 2000; 24: 71-94Crossref PubMed Scopus (50) Google Scholar). Protein secondary structure was assessed using garnier (27Garnier J. Osguthorpe D.J. Robson B. Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins.J. Mol. Biol. 1978; 120: 97-120Crossref PubMed Scopus (3394) Google Scholar) from the EMBOSS package, and the fractions of α-helix, β-sheet, coil, and turn were calculated from the output. The garnier method is considered to have a low reliability in predicting the secondary structure of a protein; however, here we are not interested in the accuracy position by position but in the overall α-helix, β-sheet, coil, and turn propensities of the protein based on the tendency of individual amino acids to be present in one structure or another. Secondary structure of mRNA was predicted using mfold (28Zuker M. Mfold web server for nucleic acid folding and hybridization prediction.Nucleic Acids Res. 2003; 31: 3406-3415Crossref PubMed Scopus (9968) Google Scholar), and the most stable structure was selected (lowest ΔG). Protein low complexity was calculated using 0j.py (29Wise M.J. 0j.py: a software tool for low complexity proteins and protein domains.Bioinformatics. 2001; 17: S288-S295Crossref PubMed Scopus (16) Google Scholar). Local GC content was calculated as previously described (14Benita Y. Oosting R.S. Lok M.C. Wise M.J. Humphery-Smith I. Regionalized GC content of template DNA as a predictor of PCR success.Nucleic Acids Res. 2003; 31: e99Crossref PubMed Google Scholar) (see also Fig. 1). The GRAVY was calculated according to Kyte and Doolittle (30Kyte J. Doolittle R.F. A simple method for displaying the hydropathic character of a protein.J. Mol. Biol. 1982; 157: 105-132Crossref PubMed Scopus (16899) Google Scholar). The GRAVY calculates an average value for the entire protein, which in many cases may be misleading. For example, on average a protein could be hydrophilic but still have a large internal hydrophobic region. Therefore, we further calculated local hydrophilic and hydrophobic regions along the protein using the Kyte and Doolittle hydrophobicity plot generated with a sliding window of 11 amino acids. The area under the curve (AUC) and the area above/below 0 were calculated using the trapezoid method as shown in Fig. 1 for GC content. The area above 0 was labeled “sum hydrophobic AUC,” and the area below 0 was labeled “sum hydrophilic AUC.” The single largest local hydrophobic or hydrophilic regions were located and labeled “max hydrophobic AUC” and “max hydrophilic AUC,” respectively. The hydrophobic AUC and hydrophilic AUC were also normalized by dividing the area by the total number of amino acids in the entire sequence. In addition, the hydrophobic to hydrophilic ratio was calculated, i.e. the ratio of the sum of all hydrophobic regions and the sum of all hydrophilic regions. Codon usage was calculated according to Sharp and Li (31Sharp P.M. Li W.H. The Codon Adaptation Index—a measure of directional synonymous codon usage bias, and its potential applications.Nucleic Acids Res. 1987; 15: 1281-1295Crossref PubMed Scopus (2566) Google Scholar). A set of 121 highly expressed E. coli proteins were selected from Swiss-2D PAGE (www.expasy.org/ch2d). All selected proteins were identified on a two-dimensional protein gel and were present in large amounts with percent volume average above 0.2 as calculated using the software Melanie (Version 2, Swiss Institute of Bioinformatics, GeneBio, Geneva, Switzerland). This set of proteins is available upon request. The codon usage index was generated using a Python codon usage module (available through Biopython) and the CAI for each gene was calculated. Regionalized CAI values were calculated using a CAI plot that was generated using a sliding window of 4 codons. The area below a threshold and above the curve was calculated for several thresholds (Fig. 2). Both the sum of all areas and the single largest area were used for the analysis. A modified version of the CAI, AAcai (amino acid codon adaptation index), was introduced by taking into account amino acid shortage due to overexpression of a protein with an amino acid content different from the average E. coli protein. This attribute is based on the observation that ribosomes translating a heterologous mRNA may stall at positions calling for a tRNA that is largely deacylated because of the heavier than normal drain of its amino acid into protein (32Kurland C. Gallant J. Errors of heterologous protein expression.Curr. Opin. Biotechnol. 1996; 7: 489-493Crossref PubMed Scopus (136) Google Scholar). In other words, the ribosome may stall even at an optimal codon if not enough amino acid is available to be loaded onto the tRNA. The average amino acid content of E. coli was calculated using the same set of 121 highly expressed E. coli proteins mentioned above. The amino acid content of each overexpressed sequence and the deviation from the average protein content were calculated. For each amino acid that was used more frequently than average the proportion of average usage to specific usage was calculated, and the index used to calculate the CAI value was adjusted accordingly. For instance, if a specific protein had 20% alanine and the average E. coli protein had 10% alanine, the most abundant alanine codon was rescaled from 1 to 0.5, and all other alanine codons were adjusted accordingly. The probability of finding a loaded alanine tRNA was reduced by 2-fold due to the 2-fold increase in usage of alanine in the heterologous protein. Once the index values were calibrated to the amino acid usage, the exact same methods that were described above for CAI were used. Protein compositional bias was assessed using POPPs (Protein or Oligonucleotide Probability Profiles) (33Wise M.J. The POPPs: clustering and searching using peptide probability profiles.Bioinformatics. 2002; 18: S38-S45Crossref PubMed Scopus (20) Google Scholar), a suite of inter-related software tools that enable the user to discover statistically “unusual” peptides. POPPs were created for each protein sequence versus the E. coli and human proteomes, scaled to a sequence length of 100 amino acids. The E. coli proteome was created using E. coli K-12 genome annotations (GenBank™ accession number NC_000913). The human proteome was fetched from the International Protein Index (34Kersey P.J. Duarte J. Williams A. Karavidopoulou Y. Birney E. Apweiler R. The International Protein Index: an integrated database for proteomics experiments.Proteomics. 2004; 4: 1985-1988Crossref PubMed Scopus (630) Google Scholar). Redundant proteins with more than 99 and 98% similarity to another in the respective databases were removed using nrdb90 (35Holm L. Sander C. Removing near-neighbour redundancy from large protein sequence collections.Bioinformatics. 1998; 14: 423-429Crossref PubMed Scopus (241) Google Scholar). t tests and one-way ANOVA were performed using the stats.py and pstat.py modules (Version 0.6). 2G. Strangman, unpublished software. Decision trees are predictive models that are often used in machine learning. Here we used decision trees to classify proteins into expression groups based on DNA and protein sequence attributes. Inner nodes in the tree represented decision variables, e.g. degree of aromaticity, and leaf nodes represented the predicted expression groups. Decision trees were previously applied to similar data, linking sequence attributes to successful expression for structural analysis (10Bertone P. Kluger Y. Lan N. Zheng D. Christendat D. Yee A. Edwards A.M. Arrowsmith C.H. Montelione G.T. Gerstein M. SPINE: an integrated tracking database and data mining approach for identifying feasible targets in high-throughput structural proteomics.Nucleic Acids Res. 2001; 29: 2884-2898Crossref PubMed Scopus (98) Google Scholar, 11Goh C.S. Lan N. Echols N. Douglas S.M. Milburn D. Bertone P. Xiao R. Ma L.C. Zheng D. Wunderlich Z. Acton T. Montelione G.T. Gerstein M. SPINE 2: a system for collaborative structural proteomics within a federated database framework.Nucleic Acids Res. 2003; 31: 2833-2838Crossref PubMed Scopus (46) Google Scholar). Decision trees were generated using the rpart module of the R statistical package (36R Development Core TeamR: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria2006Google Scholar). The decision trees were pruned using a complexity value of 0.05. To minimize the effect of unequal group sizes, a subset of randomly selected proteins were selected from the larger group equal in number to the small group. The process was repeated three times, generating three decision trees. An initial set of 547 human exons, each representing a different gene, were transferred using Gateway high throughput cloning system into the HZS vector construct. The protein fragments were a relatively small part of the entire recombinant protein. The average length of the protein insert was 76 ± 29 amino acids, corresponding to 8.6 ± 3.3 kDa. The constant part of the protein (HisZZ on the amino-terminal end and Strep-tag on the carboxyl-terminal end) was in total 170 aa long with a molecular mass of 19.45 kDa. The final HZS vectors containing the inserts were confirmed to be correct by observing the expected fragments on agarose gel after restriction enzyme digestion. The proteins were classified into one of five groups: I, no visible bands; II, faint bands with correct size; III, faint bands with wrong size; IV, strong bands with correct size; and V, strong bands with wrong size (Fig. 3). Classification into faint/strong was performed visually on a scanned gel image. Gray bands were labeled faint, and black bands were labeled strong. In 77% of the proteins a band was visible on the gel, and overall in 58.5% of the proteins th

Referência(s)