Artigo Acesso aberto Revisado por pares

Motif Decomposition of the Phosphotyrosine Proteome Reveals a New N-terminal Binding Motif for SHIP2

2007; Elsevier BV; Volume: 7; Issue: 1 Linguagem: Inglês

10.1074/mcp.m700241-mcp200

ISSN

1535-9484

Autores

Martin L. Miller, S. Hanke, Anders M. Hinsby, Carsten Friis, Søren Brunak, Matthias Mann, Nikolaj Blom,

Tópico(s)

Glycosylation and Glycoproteins Research

Resumo

Advances in mass spectrometry-based proteomics have yielded a substantial mapping of the tyrosine phosphoproteome and thus provided an important step toward a systematic analysis of intracellular signaling networks in higher eukaryotes. In this study we decomposed an uncharacterized proteomics data set of 481 unique phosphotyrosine (Tyr(P)) peptides by sequence similarity to known ligands of the Src homology 2 (SH2) and the phosphotyrosine binding (PTB) domains. From 20 clusters we extracted 16 known and four new interaction motifs. Using quantitative mass spectrometry we pulled down Tyr(P)-specific binding partners for peptides corresponding to the extracted motifs. We confirmed numerous previously known interaction motifs and found 15 new interactions mediated by phosphosites not previously known to bind SH2 or PTB. Remarkably, a novel hydrophobic N-terminal motif ((L/V/I)(L/V/I)pY) was identified and validated as a binding motif for the SH2 domain-containing inositol phosphatase SHIP2. Our decomposition of the in vivo Tyr(P) proteome furthermore suggests that two-thirds of the Tyr(P) sites mediate interaction, whereas the remaining third govern processes such as enzyme activation and nucleic acid binding. Advances in mass spectrometry-based proteomics have yielded a substantial mapping of the tyrosine phosphoproteome and thus provided an important step toward a systematic analysis of intracellular signaling networks in higher eukaryotes. In this study we decomposed an uncharacterized proteomics data set of 481 unique phosphotyrosine (Tyr(P)) peptides by sequence similarity to known ligands of the Src homology 2 (SH2) and the phosphotyrosine binding (PTB) domains. From 20 clusters we extracted 16 known and four new interaction motifs. Using quantitative mass spectrometry we pulled down Tyr(P)-specific binding partners for peptides corresponding to the extracted motifs. We confirmed numerous previously known interaction motifs and found 15 new interactions mediated by phosphosites not previously known to bind SH2 or PTB. Remarkably, a novel hydrophobic N-terminal motif ((L/V/I)(L/V/I)pY) was identified and validated as a binding motif for the SH2 domain-containing inositol phosphatase SHIP2. Our decomposition of the in vivo Tyr(P) proteome furthermore suggests that two-thirds of the Tyr(P) sites mediate interaction, whereas the remaining third govern processes such as enzyme activation and nucleic acid binding. Phosphorylation-dependent protein-protein interaction is one of the key organizing principles in intracellular signaling events. The phosphotyrosine binding (PTB) 1The abbreviations used are: PTB, phosphotyrosine binding; PAM, partitioning around medoids; ITIM, immunoreceptor tyrosine-based inhibition motif; SH2, Src homology 2; GO, Gene Ontology; IPI, International Protein Index; zf-C2H2, Cys2-His2 zinc finger protein; N-WASP, neural Wiskott-Aldrich syndrome protein; PI3K, phosphatidylinositol 3-kinase; GAP, GTPase-activating protein; STAT, signal transducers and activators of transcription; PIP3, phosphatidylinositol 3,4,5-trisphosphate. 1The abbreviations used are: PTB, phosphotyrosine binding; PAM, partitioning around medoids; ITIM, immunoreceptor tyrosine-based inhibition motif; SH2, Src homology 2; GO, Gene Ontology; IPI, International Protein Index; zf-C2H2, Cys2-His2 zinc finger protein; N-WASP, neural Wiskott-Aldrich syndrome protein; PI3K, phosphatidylinositol 3-kinase; GAP, GTPase-activating protein; STAT, signal transducers and activators of transcription; PIP3, phosphatidylinositol 3,4,5-trisphosphate. domain and the Src homology 2 (SH2) domain are modular domains that typically bind phosphotyrosine (Tyr(P))-containing peptides (1Yaffe M.B. Phosphotyrosine-binding domains in signal transduction.Nat. Rev. Mol. Cell Biol. 2002; 3: 177-186Crossref PubMed Scopus (282) Google Scholar, 2Pawson T. Specificity in signal transduction: from phosphotyrosine-SH2 domain interactions to complex cellular systems.Cell. 2004; 116: 191-203Abstract Full Text Full Text PDF PubMed Scopus (666) Google Scholar). "Linear motifs" (unstructured sequence recognition patches with conserved residues at specific positions (3Bork P. Koonin E.V. Protein sequence motifs.Curr. Opin. Struct. Biol. 1996; 6: 366-376Crossref PubMed Scopus (118) Google Scholar)) that direct Tyr(P)-dependent interaction have traditionally been studied using degenerate oriented peptide libraries. Such studies revealed that PTB and SH2 domains have preference for specific amino acids N- and C-terminal to the Tyr(P) residue, respectively (4Songyang Z. Shoelson S.E. McGlade J. Olivier P. Pawson T. Bustelo X.R. Barbacid M. Sabe H. Hanafusa H. Yi T. Specific motifs recognized by the SH2 domains of Csk, 3BP2, fps/fes, GRB-2, HCP, SHC, Syk, and Vav.Mol. Cell. Biol. 1994; 14: 2777-2785Crossref PubMed Scopus (829) Google Scholar, 5Songyang Z. Margolis B. Chaudhuri M. Shoelson S.E. Cantley L.C. The phosphotyrosine interaction domain of SHC recognizes tyrosine-phosphorylated NPXY motif.J. Biol. Chem. 1995; 270: 14863-14866Abstract Full Text Full Text PDF PubMed Scopus (158) Google Scholar). Recent methodological developments in MS-based proteomics have made it possible to identify hundreds to thousands of protein phosphorylation sites in a single project (6Ballif B.A. Villen J. Beausoleil S.A. Schwartz D. Gygi S.P. Phosphoproteomic analysis of the developing mouse brain.Mol. Cell. Proteomics. 2004; 3: 1093-1101Abstract Full Text Full Text PDF PubMed Scopus (319) Google Scholar, 7Beausoleil S.A. Jedrychowski M. Schwartz D. Elias J.E. Villen J. Li J. Cohn M.A. Cantley L.C. Gygi S.P. Large-scale characterization of HeLa cell nuclear phosphoproteins.Proc. Natl. Acad. Sci. U. S. A. 2004; 101: 12130-12135Crossref PubMed Scopus (1227) Google Scholar, 8Brill L.M. Salomon A.R. Ficarro S.B. Mukherji M. Stettler-Gill M. Peters E.C. Robust phosphoproteomic profiling of tyrosine phosphorylation sites from human T cells using immobilized metal affinity chromatography and tandem mass spectrometry.Anal. Chem. 2004; 76: 2763-2772Crossref PubMed Scopus (196) Google Scholar, 9Ficarro S. Chertihin O. Westbrook V.A. White F. Jayes F. Kalab P. Marto J.A. Shabanowitz J. Herr J.C. Hunt D.F. Visconti P.E. Phosphoproteome analysis of capacitated human sperm. Evidence of tyrosine phosphorylation of a kinase-anchoring protein 3 and valosin-containing protein/p97 during capacitation.J. Biol. Chem. 2003; 278: 11579-11589Abstract Full Text Full Text PDF PubMed Scopus (432) Google Scholar, 10Ficarro S.B. Salomon A.R. Brill L.M. Mason D.E. Stettler-Gill M. Brock A. Peters E.C. Automated immobilized metal affinity chromatography/nano-liquid chromatography/electrospray ionization mass spectrometry platform for profiling protein phosphorylation sites.Rapid Commun. Mass Spectrom. 2005; 19: 57-71Crossref PubMed Scopus (89) Google Scholar, 11Olsen J.V. Blagoev B. Gnad F. Macek B. Kumar C. Mortensen P. Mann M. Global, in vivo, and site-specific phosphorylation dynamics in signaling networks.Cell. 2006; 127: 635-648Abstract Full Text Full Text PDF PubMed Scopus (2750) Google Scholar, 12Rush J. Moritz A. Lee K.A. Guo A. Goss V.L. Spek E.J. Zhang H. Zha X.M. Polakiewicz R.D. Comb M.J. Immunoaffinity profiling of tyrosine phosphorylation in cancer cells.Nat. Biotechnol. 2005; 23: 94-101Crossref PubMed Scopus (942) Google Scholar, 13Salomon A.R. Ficarro S.B. Brill L.M. Brinker A. Phung Q.T. Ericson C. Sauer K. Brock A. Horn D.M. Schultz P.G. Peters E.C. Profiling of tyrosine phosphorylation pathways in human cells using mass spectrometry.Proc. Natl. Acad. Sci. U. S. A. 2003; 100: 443-448Crossref PubMed Scopus (262) Google Scholar, 14Zheng H. Hu P. Quinn D.F. Wang Y.K. Phosphotyrosine proteomic study of interferon α signaling pathway using a combination of immunoprecipitation and immobilized metal affinity chromatography.Mol. Cell. Proteomics. 2005; 4: 721-730Abstract Full Text Full Text PDF PubMed Scopus (89) Google Scholar). Extensive mapping of the phosphoproteome is an important step toward analyzing the regulatory components of the cell. Because the majority of newly identified phosphopeptides are uncharacterized with respect to signaling context, there is now a unique opportunity to mine the phosphoproteome for novel phosphorylation motifs. Methods have been developed that successfully mine for overrepresented motifs from large protein data sets in general (15Jonassen I. Collins J.F. Higgins D.G. Finding flexible patterns in unaligned protein sequences.Protein Sci. 1995; 4: 1587-1595Crossref PubMed Scopus (239) Google Scholar, 16Nevill-Manning C.G. Wu T.D. Brutlag D.L. Highly specific protein sequence motifs for genome analysis.Proc. Natl. Acad. Sci. U. S. A. 1998; 95: 5865-5871Crossref PubMed Scopus (165) Google Scholar, 17Rigoutsos I. Floratos A. Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm.Bioinformatics. 1998; 14: 55-67Crossref PubMed Scopus (406) Google Scholar) and more recently also from phosphoproteomics data sets (18Schwartz D. Gygi S.P. An iterative statistical approach to the identification of protein phosphorylation motifs from large-scale data sets.Nat. Biotechnol. 2005; 23: 1391-1398Crossref PubMed Scopus (707) Google Scholar). However, these methods do not partition the data set into smaller subsets with high sequence similarity prior to motif extraction. Sequence patches flanking the motif also govern phosphorylation-dependent recognition (19Forman-Kay J.D. Pawson T. Diversity in protein recognition by PTB domains.Curr. Opin. Struct. Biol. 1999; 9: 690-695Crossref PubMed Scopus (106) Google Scholar); consequently there is a risk of extracting false positive motifs from functionally unrelated peptides. Furthermore the above mentioned methods are in silico approaches and do not combine prediction with experimental validation. To overcome such limitations in the area of Tyr(P) motif discovery and classification, one may partition the data set into smaller subsets e.g. by sequence similarity with known kinase or binding substrates prior to motif extraction. Thus, the risk of retrieving false positive motifs is minimized because overrepresented motifs are extracted from peptides closely related in sequence and function. Besides Tyr(P) recognition motifs for kinases and interaction domains, there may also potentially exist Tyr(P) motifs that mediate other processes than binding such as e.g. enzyme activation and nucleic acid binding. Thus, it is essential to validate the extracted motifs both by experimental and bioinformatical means to obtain a functional classification. With this in mind we developed a motif extraction and validation methodology and classified Tyr(P) motifs on a proteome level. Operating in sequence space, we stretched the MS-mapped Tyr(P) peptides over a backbone of ligands already known to be involved in Tyr(P)-dependent interaction. Using experimentally verified Tyr(P) ligands of the PTB and SH2 domains as both a clustering backbone and as a control for the partitioning, we split a literature-extracted data set of mammalian Tyr(P) peptides into 20 different clusters. We obtained a meaningful clustering because the controls partitioned correctly into separate clusters. From the 20 clusters we extracted both known and unknown phosphorylation motifs, and peptides matching these motifs were synthesized and assayed for phosphorylation-specific interaction partners using a peptide pulldown assay based on quantitative proteomics (20Schulze W.X. Mann M. A novel proteomic screen for peptide-protein interactions.J. Biol. Chem. 2004; 279: 10756-10764Abstract Full Text Full Text PDF PubMed Scopus (255) Google Scholar). In contrast to the oriented peptide library approach that uses artificial degenerate peptides, we used naturally occurring peptides as baits to pull down binding partners from the cell lysate. Moreover because the interaction partners are in competition for binding, mimicking the in vivo binding situation, the risk of finding kinetically unfavorable interaction motifs is minimized. Finally this technique can potentially identify new types of domains with modification-specific binding capability. Using the pulldown assay we identified the expected binding partners for numerous known C-terminally directed SH2 domain motifs. We also found 15 new phosphorylation-dependent interactions mediated by phosphosites not previously shown to direct interaction. Surprisingly we identified a new N-terminal hydrophobic motif ((L/V/I)(L/V/I)pY where pY is phosphotyrosine) for the SH2 domain-containing inositol phosphatase SHIP2. The specificity of the motif was confirmed by mutational analysis. Surprisingly this motif is N-terminally directed, which is in contrast to previous observations showing that binding of prototypical SH2 domains are directed by C-terminal recognition (21Songyang Z. Cantley L.C. Recognition and specificity in protein tyrosine kinase-mediated signalling.Trends Biochem. Sci. 1995; 20: 470-475Abstract Full Text PDF PubMed Scopus (328) Google Scholar). Until now the only other known SH2 domain binding motif that is partly directed by N-terminal recognition is the immunoreceptor tyrosine-based inhibition motif (ITIM) (I/L/V)XpYXX(I/L/V) (22Vely F. Vivier E. Conservation of structural features reveals the existence of a large family of inhibitory cell surface receptors and noninhibitory/activatory counterparts.J. Immunol. 1997; 159: 2075-2077PubMed Google Scholar, 23Burshtyn D.N. Yang W. Yi T. Long E.O. A novel phosphotyrosine motif with a critical amino acid at position −2 for the SH2 domain-mediated activation of the tyrosine phosphatase SHP-1.J. Biol. Chem. 1997; 272: 13066-13072Abstract Full Text Full Text PDF PubMed Scopus (161) Google Scholar). On a proteome level we analyzed which Gene Ontology (GO) categories were overrepresented in proteins matching the extracted motifs. We found that motifs that mediate interaction in the pulldown assay are typically found in proteins involved in signal transduction, whereas non-binding motifs are found in enzymes and ion- and nucleic acid-binding proteins. Thus, we estimate that one-third of the in vivo Tyr(P) sites are not directly involved in interaction via domains such as SH2 and PTB but rather are sites that could alter the catalytic activity of enzymes or modulate the DNA binding affinity of e.g. transcription factors. Large scale data sets of tyrosine phosphorylation sites mapped in MS/MS experiments with mammalian cell lines were collected from the literature (8Brill L.M. Salomon A.R. Ficarro S.B. Mukherji M. Stettler-Gill M. Peters E.C. Robust phosphoproteomic profiling of tyrosine phosphorylation sites from human T cells using immobilized metal affinity chromatography and tandem mass spectrometry.Anal. Chem. 2004; 76: 2763-2772Crossref PubMed Scopus (196) Google Scholar, 10Ficarro S.B. Salomon A.R. Brill L.M. Mason D.E. Stettler-Gill M. Brock A. Peters E.C. Automated immobilized metal affinity chromatography/nano-liquid chromatography/electrospray ionization mass spectrometry platform for profiling protein phosphorylation sites.Rapid Commun. Mass Spectrom. 2005; 19: 57-71Crossref PubMed Scopus (89) Google Scholar, 12Rush J. Moritz A. Lee K.A. Guo A. Goss V.L. Spek E.J. Zhang H. Zha X.M. Polakiewicz R.D. Comb M.J. Immunoaffinity profiling of tyrosine phosphorylation in cancer cells.Nat. Biotechnol. 2005; 23: 94-101Crossref PubMed Scopus (942) Google Scholar, 13Salomon A.R. Ficarro S.B. Brill L.M. Brinker A. Phung Q.T. Ericson C. Sauer K. Brock A. Horn D.M. Schultz P.G. Peters E.C. Profiling of tyrosine phosphorylation pathways in human cells using mass spectrometry.Proc. Natl. Acad. Sci. U. S. A. 2003; 100: 443-448Crossref PubMed Scopus (262) Google Scholar, 14Zheng H. Hu P. Quinn D.F. Wang Y.K. Phosphotyrosine proteomic study of interferon α signaling pathway using a combination of immunoprecipitation and immobilized metal affinity chromatography.Mol. Cell. Proteomics. 2005; 4: 721-730Abstract Full Text Full Text PDF PubMed Scopus (89) Google Scholar, 24Hinsby A.M. Olsen J.V. Bennett K.L. Mann M. Signaling initiated by overexpression of the fibroblast growth factor receptor-1 investigated by mass spectrometry.Mol. Cell. Proteomics. 2003; 2: 29-36Abstract Full Text Full Text PDF PubMed Scopus (62) Google Scholar, 25Hinsby A.M. Olsen J.V. Mann M. Tyrosine phosphoproteomics of fibroblast growth factor signaling: a role for insulin receptor substrate-4.J. Biol. Chem. 2004; 279: 46438-46447Abstract Full Text Full Text PDF PubMed Scopus (82) Google Scholar) yielding a total of 847 tyrosine phosphorylation sites. To filter out phosphopeptides from closely related homologs and orthologs only unique 13-mer peptide sequences with the Tyr(P) centrally positioned were considered. This reduced the MS-based data set to 481 phosphopeptides distributed in 380 proteins. Furthermore 162 experimentally verified Tyr(P) peptide ligands of one PTB domain and 10 different SH2 domains were extracted from the Phospho.ELM database (26Diella F. Cameron S. Gemund C. Linding R. Via A. Kuster B. Sicheritz-Ponten T. Blom N. Gibson T.J. Phospho.ELM: a database of experimentally verified phosphorylation sites in eukaryotic proteins.BMC Bioinformatics. 2004; 5: 79Crossref PubMed Scopus (301) Google Scholar). The 162 peptides were included in the data set as positive controls, resulting in a data set of 643 Tyr(P) peptides (see supplemental Table 3). The criteria for selecting the positive controls were the existence of a consensus binding motif and that a suitable amount of examples could be obtained. 13-mers of the 162 phosphopeptide ligands of the 11 respective PTB and SH2 domains (see Table I) were used to create 11 weight matrices using the weight matrix mode of EasyGibbs 1.0 (27Nielsen M. Lundegaard C. Worning P. Hvid C.S. Lamberth K. Buus S. Brunak S. Lund O. Improved prediction of MHC class I and class II epitopes using a novel Gibbs sampling approach.Bioinformatics. 2004; 20: 1388-1397Crossref PubMed Scopus (226) Google Scholar). Default settings were used except motif length was set to 13 fixed around the central Tyr(P) residue. Subsequently all phosphopeptides in the MS-based data set (481) and the positive control data set (162) were scored by each of the 11 weight matrices, and thus each phosphopeptide could be represented as a vector of the 11 weight matrix scores.Table IClustering and motif extraction of the Tyr(P) proteomeClusterSizePositive controlsExtracted motifsMatched motifs, expected partnerPubMed IDPeptides synthesizedIdentified partners141PTB 10 of 10NPXpYX(S/T) 7 of 7SHC PTB (NPXpY)7542744 7541030KEVCDGWSLPNPEpYYTLRYA ELMO2_MOUSE, 48SHC236Crk SH2 8 of 10(I/L/M/V)pYX(I/L/M/V)P 8 of 14Crk/RasGAP SH2 (pYXXP)9233798 11607838KPSTDPLpYDTPDTRG RIN1_MOUSE, 35RasGAP Nck1329Vav SH2 4 of 4pYESPXX(D/E) 5 of 5Vav SH2 (pYESP)9151714TETKTITpYESPQIDG E41L2_MOUSE, 889None428(D/E)XXX(I/L/V)(I/L/V)pY 4 of 6New motifRETSKVIpYDFIEKTG WASL_MOUSE, 253SHIP2520(D/E)(D/E)XXXpYXN 4 of 6Grb2 SH2 (pYXN)11994738VYDEDSPpYQNIKILH SPSY_MOUSE, 147Grb2 RasGAP641Grb2 SH2 14 of 31pYXN(I/L/M/V)XXL 5 of 7Grb2 SH2 (pYXN)11994738ELFDDPSpYVNIQNLD SHC1_MOUSE, 423Grb2721Grb2 SH2 6 of 31(D/E)pYXN(I/L/M/V) 4 of 11Grb2 SH2 (pYXN)11994738QPASVTDpYQNVSFSN ITSN2_HUMAN, 858Grb2845pY(I/L/M/V)XMXP 4 of 10p85-PI3K SH2 (pYXXM)7511210 11994738PQRVDPNGpYMMMSPS IRS1_MOUSE, 658p85-PI3Ka p85-PI3Kβ940pY(D/E)X(I/L/M/V)X(I/L/M/V) 5 of 22Fps/Fes SH2 (pY(E/D)X(I/V))7511210AGKQKLQpYEGIFIKD SF3A1_MOUSE, 757None1032(D/E)XXpY(D/E)X(I/L/M/V) 7 of 27Fps/Fes SH2 (pY(E/D)X(I/V))7511210DGGSDQNpYDIVTIGA INP4A_HUMAN, 355None1135PI3K SH2 16 of 24pY(I/L/M/V)PMXP 6 of 7p85 PI3K SH2 (pYXXM)7511210 11994738NLHTDDGpYMPMSPGV IRS1_MOUSE, 608p85-PI3Kα1223DpY(I/L/M/V)X(I/L/M/V) 7 of 18SHP2 SH2 (pY(I/V/L)X(I/V/L))7680959DLINRMDpYVEINIDH VIGLN_MOUSE, 437SHP21328SHP2 SH2 7 of 12(I/L/M/V)XpY(I/L/M/V)X(I/L/M/V)D 6 of 7SHP2 SH2 (pY(I/V/L)X(I/V/L))7680959DIKEKLCpYVALDFEQ ACTB_MOUSE, 218SHP21432PLCγ SH2 5 of 16(I/L/M/V)pYXX(I/L/M/V)(I/L/M/V) 5 of 11General/SHP2 SH2 (pY(I/V/L)X(I/V/L))7511210 7680959GKSKQPLpYSSIVTVE O88185_MOUSE, 948SHP2 SHIP21529RasGAP SH2 7 of 8(A/G)(I/L/M/V)pYXXP 6 of 10Crk/RasGAP SH2 (pYXXP)9233798 11607838GVVDSGVpYAVPPPAE BCAR1_HUMAN, 410RasGAP1637SHC SH2 9 of 13PXEpYXXXXX(I/L/M/V) 3 of 3New motifTTEAPGEYFFSDGVR IMDH1_MOUSE, 400None1725Src SH2 7 of 14pY(D/E)X(I/L/M/V)H 4 of 6Fps/Fes SH2 (pY(E/D)X(I/V))7511210ELTAEFLpYDEVHPKQ TWF2_MOUSE, 309RasGAP1832STAT SH2 15 of 19pY(I/L/M/V)PQ 4 of 4STAT SH2 (pYXXQ)14966128SGENFVPpYMPQFQTC LEPR_MOUSE, 1138None1933H(S/T)GXKPpYXCXXCG 10 of 10New motifRIHTGEKPpYECVQCGK ZNF24_MOUSE, 335None2036(A/G)XpYXX(I/L/M/V)X(K/R) 8 of 15New motifKKNRIAIpYELLFKEG RS10_MOUSE, 12None Open table in a new tab A matrix consisting of the 11 weight matrix scores and the 643 phosphopeptides was generated and subsequently clustered by the PAM method (28Kaufman L. Rousseeuw P.J. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley, New York1990Crossref Google Scholar) using the cluster package in R. The PAM algorithm is a robust version of k-means, and it searches for a specified number of medoids (representatives), k, around which clusters are constructed. The clusters are generated by minimizing the sum of the dissimilarities of all observations and assigning them to their closest medoid. Using a hypergeometrical test (see "Statistics") the optimal number of clusters (k = 20) was inferred because this resulted in the best partitioning of the positive controls. We use z-scores, i.e. multiples of standard deviations from the mean, to account for the different numeric ranges of the measured parameters. The choice of an appropriate clustering algorithm is a complex one because no given algorithm is universally superior (29Fraley C. Raftery A.E. How many clusters? Which clustering method? Answers via model-based cluster analysis.Comput. J. 1998; 41: 578-588Crossref Scopus (1598) Google Scholar, 30Jain A.K. Murth M.N. Flynn P.J. Data clustering: a review.ACM Comput. Surv. 1999; 31: 264-323Crossref Scopus (9766) Google Scholar). Rather the best choice will depend on the data set and in particular on what constitutes a good distance measure for it. Another relevant concern is the desired outcome and whether a hierarchical or partitional result is preferable. Many sophisticated methods exist that are capable of automatically determining the number of "natural clusters" in the data like the popular density-based clustering algorithms that can describe very complex non-circular relations in the data (31Ester M. Kriegel H.P. Sander S. Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise.in: Simoudis E. Han J. Fayyad U. Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining, Portland, August 2–4, 1996. 1. Association for the Advancement of Artificial Intelligence (AAAI) Press, Menlo Park, CA1996: 226-231Google Scholar). It is, however, not clear whether the ability to recognize non-circular structures in the data is beneficial in this case. Proteins that share the same features are likely to be related and will form a circular relation in feature space. On the other hand, an elongated cluster in feature space will contain proteins that share only some features but not others, and the biological implications thereof can be quite diverse. Other than being computationally effective and easy to implement, the PAM algorithm was selected because it satisfies the need for a robust clustering algorithm and because its reliance on an Euclidean distance measure ensures that the result can be easily interpreted. The primary weakness of PAM is the need to arbitrarily select a number of clusters for the data, which in this case is overcome by the mentioned application of the hypergeometric test. Weight matrices of the peptides in the 20 clusters were made using positional weighting of the three residues flanking the central Tyr(P) residue (27Nielsen M. Lundegaard C. Worning P. Hvid C.S. Lamberth K. Buus S. Brunak S. Lund O. Improved prediction of MHC class I and class II epitopes using a novel Gibbs sampling approach.Bioinformatics. 2004; 20: 1388-1397Crossref PubMed Scopus (226) Google Scholar) and used to calculate distance matrices as described previously (32Lund O. Nielsen M. Kesmir C. Petersen A.G. Lundegaard C. Worning P. Sylvester-Hvid C. Lamberth K. Roder G. Justesen S. Buus S. Brunak S. Definition of supertypes for HLA molecules using clustering of specificity matrices.Immunogenetics. 2004; 55: 797-810Crossref PubMed Scopus (215) Google Scholar). The distance matrices were used as input to the program neighbor from version 3.5 of PHYLIP (Phylogeny Inference Package). To estimate the significance of the neighbor-joining clustering we used the bootstrap method and estimated the consensus tree by bootstrapping for 1000 repetitions as described earlier (32Lund O. Nielsen M. Kesmir C. Petersen A.G. Lundegaard C. Worning P. Sylvester-Hvid C. Lamberth K. Roder G. Justesen S. Buus S. Brunak S. Definition of supertypes for HLA molecules using clustering of specificity matrices.Immunogenetics. 2004; 55: 797-810Crossref PubMed Scopus (215) Google Scholar). The frequencies of amino acids at particular positions in each cluster were calculated, and subsequently sequence logo plots were used for graphic visualization (33Schneider T.D. Stephens R.M. Sequence logos: a new way to display consensus sequences.Nucleic Acids Res. 1990; 18: 6097-6100Crossref PubMed Scopus (2381) Google Scholar). Each position in the aligned sequences corresponds to a column in the logo plot. The height of the column represents the degree of conservation at that position, whereas the height of the individual letters is proportional to the relative frequency of this amino acid residue. The maximal height of the column for the 20-amino acid alphabet is log220 = 4.32 bits. The identified phosphomotifs in each of the 20 clusters were found using the publicly available TEIRESIAS pattern discovery tool from IBM Bioinformatics (17Rigoutsos I. Floratos A. Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm.Bioinformatics. 1998; 14: 55-67Crossref PubMed Scopus (406) Google Scholar). Parameters were set so the extracted motifs were within a window of 13 residues centered on the phosphoresidue. The minimal number of literals in the motif was set to 4, and the amino acids were grouped according to their chemical nature (Ala/Gly, Asp/Glu, Phe/Tyr, Lys/Arg, Ile/Leu/Met/Val, Gln/Asn, Ser/Thr, Pro, Trp, His, and Cys (17Rigoutsos I. Floratos A. Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm.Bioinformatics. 1998; 14: 55-67Crossref PubMed Scopus (406) Google Scholar)). For each of the 20 clusters the most abundant motif was selected, and subsequently one peptide matching the motif was chosen from the respective cluster. Because multiple peptides in each cluster matched the extracted motif, peptides from mouse and peptides not previously known to be involved in phosphorylation-dependent interaction were preferred. In the few cases (three) where mouse sequences could not be obtained, peptides from humans with high homology in mouse were chosen. Gene Ontology categories were obtained from Gene Ontology Annotation mouse database version 29.0. The extracted motifs were matched to proteins in the International Protein Index (IPI) mouse proteome version 3.20. Using a hypergeometrical test (see "Statistics") with the total proteome as background we found the 10 most overrepresented GO terms in retrieved proteins. The hits were inspected manually, and the consensus GO term was assessed for each motif. For the purpose of the hypergeometrical test, each annotated GO category was taken to include all of its ancestral terms to avoid problems with diverging levels of annotation. To determine whether the positive controls were significantly overrepresented in specific clusters compared with the whole data set, hypergeometric sampling without replacement (34Johnson L.N. Kotz S. Kemp A.W. Univariate Discrete Distributions. 2nd Ed. Wiley-Interscience, New York1992Google Scholar) was performed. The hypergeometric test is a statistical test used to describe the arbitrariness of a sampling without replacement from a background of true or false examples. The probability (p) to observe a given or more extreme situation by a pure coincidence is given by the hypergeometric distribution, P(X=x|N,M,K)( XM)( K-xN-M)( KN)(Eq. 1) where N is the total number of peptides, M is the number of peptides in the given set, K is the number of peptides in a particular cluster, and x is the number of K that belongs to M. A Bonferroni correction was performed to correct for multiple comparisons. In the case of GO analysis, we performed the test once for each GO category present in the data and evaluated the probability of sampling the set of retrieved proteins from the background of the total proteome by mere chance, considering a protein 'true' or 'false' depending on whether it had been assigned the category in question. The end result of this test was one p value for each GO category, describing the degree of overrepresentation of that particular assignment in the retrieved set against the background of the entire proteome. Mouse C2C12 muscle cells were grown in arginine- and lysine-deficient Dulbecco's modified Eagle's medium with 10% dialyzed fetal bovine serum for at least five passages and then switched to 2% dialyzed fetal bovine serum to differentiate the cells for 8 days. In accordance with the stable isotope labeling by amino acids in cell culture (SILAC) procedure, one cell population was supplemented with normal isotopic abundance l-arginine (Sigma) and l-lysine, and the other was supplemented with >99% isotopic a

Referência(s)