Artigo Acesso aberto Revisado por pares

Template Proteogenomics: Sequencing Whole Proteins Using an Imperfect Database

2010; Elsevier BV; Volume: 9; Issue: 6 Linguagem: Inglês

10.1074/mcp.m900504-mcp200

ISSN

1535-9484

Autores

Natalie Castellana, Victoria C. Pham, David Arnott, Jennie R. Lill, Vineet Bafna,

Tópico(s)

Mass Spectrometry Techniques and Applications

Resumo

Database search algorithms are the primary workhorses for the identification of tandem mass spectra. However, these methods are limited to the identification of spectra for which peptides are present in the database, preventing the identification of peptides from mutated or alternatively spliced sequences. A variety of methods has been developed to search a spectrum against a sequence allowing for variations. Some tools determine the sequence of the homologous protein in the related species but do not report the peptide in the target organism. Other tools consider variations, including modifications and mutations, in reconstructing the target sequence. However, these tools will not work if the template (homologous peptide) is missing in the database, and they do not attempt to reconstruct the entire protein target sequence. De novo identification of peptide sequences is another possibility, because it does not require a protein database. However, the lack of database reduces the accuracy. We present a novel proteogenomic approach, GenoMS, that draws on the strengths of database and de novo peptide identification methods. Protein sequence templates (i.e. proteins or genomic sequences that are similar to the target protein) are identified using the database search tool InsPecT. The templates are then used to recruit, align, and de novo sequence regions of the target protein that have diverged from the database or are missing. We used GenoMS to reconstruct the full sequence of an antibody by using spectra acquired from multiple digests using different proteases. Antibodies are a prime example of proteins that confound standard database identification techniques. The mature antibody genes result from large-scale genome rearrangements with flexible fusion boundaries and somatic hypermutation. Using GenoMS we automatically reconstruct the complete sequences of two immunoglobulin chains with accuracy greater than 98% using a diverged protein database. Using the genome as the template, we achieve accuracy exceeding 97%. Database search algorithms are the primary workhorses for the identification of tandem mass spectra. However, these methods are limited to the identification of spectra for which peptides are present in the database, preventing the identification of peptides from mutated or alternatively spliced sequences. A variety of methods has been developed to search a spectrum against a sequence allowing for variations. Some tools determine the sequence of the homologous protein in the related species but do not report the peptide in the target organism. Other tools consider variations, including modifications and mutations, in reconstructing the target sequence. However, these tools will not work if the template (homologous peptide) is missing in the database, and they do not attempt to reconstruct the entire protein target sequence. De novo identification of peptide sequences is another possibility, because it does not require a protein database. However, the lack of database reduces the accuracy. We present a novel proteogenomic approach, GenoMS, that draws on the strengths of database and de novo peptide identification methods. Protein sequence templates (i.e. proteins or genomic sequences that are similar to the target protein) are identified using the database search tool InsPecT. The templates are then used to recruit, align, and de novo sequence regions of the target protein that have diverged from the database or are missing. We used GenoMS to reconstruct the full sequence of an antibody by using spectra acquired from multiple digests using different proteases. Antibodies are a prime example of proteins that confound standard database identification techniques. The mature antibody genes result from large-scale genome rearrangements with flexible fusion boundaries and somatic hypermutation. Using GenoMS we automatically reconstruct the complete sequences of two immunoglobulin chains with accuracy greater than 98% using a diverged protein database. Using the genome as the template, we achieve accuracy exceeding 97%. Database search algorithms, such as Sequest (1Eng J. McCormack A. J. Y. III, An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein data base.J. Am. Soc. Mass Spectrom. 1994; 5: 976-989Crossref PubMed Scopus (5472) Google Scholar), Mascot (2Perkins D.N. Pappin D.J. Creasy D.M. Cottrell J.S. Probability-based protein identification by searching sequence data bases using mass spectrometry data.Electrophoresis. 1999; 20: 3551-3567Crossref PubMed Scopus (6814) Google Scholar), and InsPecT (3Tanner S. Shu H. Frank A. Wang L.C. Zandi E. Mumby M. Pevzner P.A. Bafna V. InsPecT: identification of posttranslationally modified peptides from tandem mass spectra.Anal. Chem. 2005; 77: 4626-4639Crossref PubMed Scopus (504) Google Scholar), are the primary workhorses for the identification of tandem mass spectra. However, these methods are limited to the identification of spectra for which peptides are present in the database. It is well recognized that curated protein databases are, at best, an imperfect template for the extant peptides. For example, peptides arising from novel splice forms or fusion proteins would be difficult to identify using most protein databases. Recent developments have extended the identifications to peptides that have diverged from the database entry. By allowing divergence, the methods enable the identification of small-scale mutations, and post-translational modifications, albeit with some loss of sensitivity (4Shevchenko A. Sunyaev S. Loboda A. Shevchenko A. Bork P. Ens W. Standing K.G. Charting the proteomes of organisms with unsequenced genomes by MALDI-quadrupole time-of-flight mass spectrometry and BLAST homology searching.Anal. Chem. 2001; 73: 1917-1926Crossref PubMed Scopus (532) Google Scholar, 5Tsur D. Tanner S. Zandi E. Bafna V. Pevzner P.A. Identification of post-translational modifications by blind search of mass spectra.Nat. Biotechnol. 2005; 23: 1562-1567Crossref PubMed Scopus (225) Google Scholar, 6Han Y. Ma B. Zhang K. SPIDER: software for protein identification from sequence tags with de novo sequencing error.J. Bioinform. Comput. Biol. 2005; 3: 697-716Crossref PubMed Scopus (170) Google Scholar, 7Searle B.C. Dasari S. Wilmarth P.A. Turner M. Reddy A.P. David L.L. Nagalla S.R. Identification of protein modifications using MS/MS de novo sequencing and the OpenSea alignment algorithm.J. Proteome Res. 2005; 4: 546-554Crossref PubMed Scopus (106) Google Scholar). Among these tools, MS-Blast is able to determine a homologous protein in the related species but does not report the (diverged) protein in the target organism. The other tools consider variations, including modifications and mutations, in reconstructing the target sequence. However, these tools will not work if the template (homologous peptide) is missing in the database or comes from a novel splice form. In addition, these tools do not attempt to reconstruct the entire protein target sequence. De novo identification of peptide sequences (8Frank A. Pevzner P. PepNovo: de novo peptide sequencing via probabilistic network modeling.Anal. Chem. 2005; 77: 964-973Crossref PubMed Scopus (530) Google Scholar, 9Ma B. Zhang K. Hendrie C. Liang C. Li M. Doherty-Kirby A. Lajoie G. PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry.Rapid Commun. Mass Spectrom. 2003; 17: 2337-2342Crossref PubMed Scopus (986) Google Scholar) is another possibility and does not require a protein database. However, these methods are prone to error. The issue of discovering spliced peptides (more generally, eukaryotic gene structures) has been investigated using a combination of approaches, loosely termed proteogenomics. Often, these approaches start by creating specialized databases of splice forms, combining evidence from protein (e.g. NCBI nr (10Benson D.A. Karsch-Mizrachi I. Lipman D.J. Ostell J. Wheeler D.L. GenBank.Nucleic Acids Res. 2008; 36: D25-D30Crossref PubMed Scopus (841) Google Scholar)) and cDNA sequencing (11Boguski M.S. Lowe T.M. Tolstoshev C.M. dbEST–data base for "expressed sequence tags".Nat. Genet. 1993; 4: 332-333Crossref PubMed Scopus (1141) Google Scholar, 12Fermin D. Allen B.B. Blackwell T.W. Menon R. Adamski M. Xu Y. Ulintz P. Omenn G.S. States D.J. Novel gene and gene model detection using a whole genome open reading frame analysis in proteomics.Genome Biol. 2006; 7: R35Crossref PubMed Scopus (107) Google Scholar, 13Menon R. Zhang Q. Zhang Y. Fermin D. Bardeesy N. DePinho R.A. Lu C. Hanash S.M. Omenn G.S. States D.J. Identification of novel alternative splice isoforms of circulating proteins in a mouse model of human pancreatic cancer.Cancer Res. 2009; 69: 300-309Crossref PubMed Scopus (66) Google Scholar). To discover novel splicing events, the tools also search databases derived directly from the genome such as a six-frame translation or a compact encoding of multiple putative splicing events (14Baerenfaller K. Grossmann J. Grobei M.A. Hull R. Hirsch-Hoffmann M. Yalovsky S. Zimmermann P. Grossniklaus U. Gruissem W. Baginsky S. Genome-scale proteomics reveals Arabidopsis thaliana gene models and proteome dynamics.Science. 2008; 320: 938-941Crossref PubMed Scopus (419) Google Scholar, 15Castellana N.E. Payne S.H. Shen Z. Stanke M. Bafna V. Briggs S.P. Discovery and revision of Arabidopsis genes by proteogenomics.Proc. Natl. Acad. Sci. U.S.A. 2008; 105: 21034-21038Crossref PubMed Scopus (235) Google Scholar, 16Tanner S. Shen Z. Ng J. Florea L. Guigó R. Briggs S.P. Bafna V. Improving gene annotation using peptide mass spectrometry.Genome Res. 2007; 17: 231-239Crossref PubMed Scopus (153) Google Scholar, 17Edwards N.J. Novel peptide identification from tandem mass spectra using ESTs and sequence data base compression.Mol. Syst. Biol. 2007; 3: 102Crossref PubMed Scopus (62) Google Scholar). For example, Castellana et al. (15Castellana N.E. Payne S.H. Shen Z. Stanke M. Bafna V. Briggs S.P. Discovery and revision of Arabidopsis genes by proteogenomics.Proc. Natl. Acad. Sci. U.S.A. 2008; 105: 21034-21038Crossref PubMed Scopus (235) Google Scholar) achieved this by constructing a database, represented as a graph (16Tanner S. Shen Z. Ng J. Florea L. Guigó R. Briggs S.P. Bafna V. Improving gene annotation using peptide mass spectrometry.Genome Res. 2007; 17: 231-239Crossref PubMed Scopus (153) Google Scholar), containing many putative exons and exon splice junctions. However, this approach also has its shortcomings. The putative gene models are constructed based on prior assumptions about splice junctions and proximal exons. In addition, recent genomic discoveries point to extensive structural variation in the genome in the form of large-scale deletions, insertions, inversions, and translocations on the genome that might fuse different genic regions or create nonstandard splice forms (18Iafrate A.J. Feuk L. Rivera M.N. Listewnik M.L. Donahoe P.K. Qi Y. Scherer S.W. Lee C. Detection of large-scale variation in the human genome.Nat. Genet. 2004; 36: 949-951Crossref PubMed Scopus (2318) Google Scholar, 19Sebat J. et al.Large-scale copy number polymorphism in the human genome.Science. 2004; 305: 525-528Crossref PubMed Scopus (1976) Google Scholar). Indeed, many cancers are characterized by such large-scale mutations of the genome (20Campbell P.J. et al.Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing.Nat. Genet. 2008; 40: 722-729Crossref PubMed Scopus (670) Google Scholar). Other examples of variation that confound standard database identification techniques are immunoglobulins and antibodies. Here, recombination events fuse disparate regions of the genome, often inserting nontemplated sequence and creating many novel gene structures in every individual. The common theme in all of the scenarios described is that it is not possible to maintain all possible encodings in a database to allow for a standard proteogenomic search. In this study, we sought to determine whether the imperfect template provided by the genome can be still used as a basis for peptide (and protein) identification. We are motivated in our approach by the work of Bandeira et al. (21Bandeira N. Pham V. Pevzner P. Arnott D. Lill J.R. Automated de novo protein sequencing of monoclonal antibodies.Nat. Biotechnol. 2008; 26: 1336-1338Crossref PubMed Scopus (94) Google Scholar), who were able to sequence monoclonal antibodies de novo, making no use of a database at all. In their method, an all-to-all comparison of spectra allowed the creation of spectral contigs, similar to sequence contigs in shotgun sequencing projects. The sequences of the spectral contigs were determined de novo. Using full antibody sequences as references, they were able to order the contigs and infer the missing sequence. Because the construction and sequencing of the contigs was performed completely de novo, Bandeira et al. (21Bandeira N. Pham V. Pevzner P. Arnott D. Lill J.R. Automated de novo protein sequencing of monoclonal antibodies.Nat. Biotechnol. 2008; 26: 1336-1338Crossref PubMed Scopus (94) Google Scholar) were able to sequence highly divergent proteins or proteins for which there is no database. However, the ordering of the sequenced contigs relies on a database of full antibody sequences for mapping. Sequences that cannot be mapped to an antibody in the database may be discarded. In contrast, the templates used in our method are not full proteins, but substrings of proteins, such as exons, which are combinatorially chained together to best explain the spectrometric evidence. Liu et al. (22Liu X. Han Y. Yuen D. Ma B. Automated protein (re)sequencing with MS/MS and a homologous data base yields almost full coverage and accuracy.Bioinformatics. 2009; 25: 2174-2180Crossref PubMed Scopus (26) Google Scholar) have developed Champs, a method for sequencing a divergent protein using a homologous protein database. In their method, a single reference protein was chosen, and the de novo interpretations of spectra were mapped to the reference. They were able to sequence a protein with high accuracy using a reference protein with only 77% similarity to the target. Although Champs is able to map peptides that differ from the reference by one or two amino acids, it does not look for large insertions or deletions in the target sequence, as in a novel splice form. In our work, use of the database as an incomplete template lends additional confidence to the target sequencing without substantially limiting the ability to identify diverged sequences. Here, we describe a novel method for template proteogenomics, implemented in the tool GenoMS. GenoMS takes as input a collection of spectra (acquired from multiple protease digests) and a collection of imperfect templates and constraints (defined under Experimental Procedures). It returns a target protein sequence. At the heart of the approach is a novel method of extending a target amino acid sequence by recruiting and aligning spectra that match it partially. By using spectral data sets with multiple protease digests, we are able to identify many overlapping peptides. We then align the overlapping spectra and produce an extended consensus spectrum. We are able to extend 89% of the target amino acid sequences. More than 40% of these extensions are three or more amino acids. We test the performance of GenoMS in reconstructing monoclonal antibody sequences. Antibodies are an interesting test case because of their highly variable nature and because no complete antibody database exists. They are composed of four polypeptide chains: two identical heavy chains and two identical light chains (Fig. 1). An antibody's preference and efficiency in the detection and removal of encountered antigens is heavily dependent on its amino acid sequence. Consequently, antibodies are extremely diverse. A principal way in which antibody diversity is achieved is through genome rearrangement of the germline locus (Fig. 1). An antibody's heavy chain comprises four gene segments; a variable (V) segment, a diversity (D) segment, a joining (J) segment, and a constant (C) segment. Likewise, the light chain is composed of three gene segments: a V segment, a D segment, and a C segment. Each segment is chosen from potentially hundreds present in the genome, and many combinations of gene segments may be joined. Imprecise boundaries with the possible insertion of additional nucleotides allow the creation of many sequences from a single germline locus. Somatic hypermutation also plays a role in achieving antibody diversity. Although antibody sequence may be determined by sequencing the DNA of the source cell line, few direct protein-sequencing options exist when the source is unavailable or for ensuring antibody integrity. The antibody structure provides enough complexity to serve as a test case for template proteogenomics. Using the technique of extending the peptide sequence without reference to a database, we are able to reconstruct the full protein sequence for the antibody raised against the B- and T-lymphocyte attenuator molecule (aBTLA 1The abbreviations used are:aBTLAantibody raised against the B- and T-lymphocyte attenuator moleculePRMprefix residue massHMMhidden Markov modelAAamino acid.) (21Bandeira N. Pham V. Pevzner P. Arnott D. Lill J.R. Automated de novo protein sequencing of monoclonal antibodies.Nat. Biotechnol. 2008; 26: 1336-1338Crossref PubMed Scopus (94) Google Scholar). We also test our approach by using an available data set of spectra acquired using multiple protease digests for bovine serum album (BSA). The sequence of BSA is determined using the bovine genome as a template database. Both chains of aBTLA were sequenced using unrearranged gene segments as templates. An independent reconstruction of the aBTLA heavy chain was performed using the unrearranged heavy-chain genomic locus as a template. antibody raised against the B- and T-lymphocyte attenuator molecule prefix residue mass hidden Markov model amino acid. Our goal is to reconstruct the target amino acid sequence, using a chain of templates. A template is defined as an amino acid sequence that may be present in the target protein, although possibly in a mutated or modified form. The target protein might contain multiple templates chained together. We provide additional abstraction to model constraints on the templates. First, the user can specify a partial order t1 → t2 to enforce that template t1 must precede t2 in the chain. Second, the user can provide mutual exclusion constraints on (t1, t2), a pair of templates, to enforce that only one of the two templates is in the chain. For example, in antibody sequences, all V, D, J, and C genes are templates. The constraints help specify the ordering of V, D, J, and C genes, and the exclusion of any pair of genes from the same class (e.g. V). An anchor is defined as a substring of a template that is present in the target with no mutations. Each template may contain zero or more anchors. Fig. 2 describes an overview of our algorithm. GenoMS takes a collection of tandem MS spectra as input, along with a set of templates and their constraints, and requires at least one anchor sequence. It outputs a target protein sequence using a chain of templates as a guide. There are three stages: template-chain selection, anchor extension, and sequence construction, all described below. We create a custom database of all template sequences and use the database search tool InsPecT to search all spectra against the database (3Tanner S. Shu H. Frank A. Wang L.C. Zandi E. Mumby M. Pevzner P.A. Bafna V. InsPecT: identification of posttranslationally modified peptides from tandem mass spectra.Anal. Chem. 2005; 77: 4626-4639Crossref PubMed Scopus (504) Google Scholar) (see supplemental Methods). The best templates to use as guides are those that show a good match to the spectra. Coverage[t] is defined as the number of amino acids on t that were confirmed by the database search. Peptides that appear in multiple templates count toward the coverage of all of them. This reuse is eliminated in the next step. The goal of the template-chain selection phase is to select a chain of templates with maximum coverage while satisfying all constraints. To find the chain of templates, we define a graph in which the nodes are templates. There are two sets of edges. Directed edges t1 → t2 model the ordering constraints, while a set of undirected edges, (t1,t2) ∈ Ef, models the exclusion. In addition to the constraints specified by the user, we also create forbidden edges between templates that share more than the minimum of two peptides or half of the peptides belonging to one of the templates. A chain T = {t1,t2,…,tk} is valid if (ti,tj ∉)Ef for all ti,tj in T, and t1 → t2 → … → tk. The objective is to compute a valid chain so that Σi=1k is maximized. Solving this problem generally is hard. We use a heuristic method based on dynamic programming to find a valid chain (see supplemental Methods). Let Vj denote the maximum score of a valid chain ending at tj, and Tj denote the corresponding chain. Then, Vj=Coverage[tj]+maxi:Ti+tjis validVi(Eq. 1) and Tj is constructed by chaining tj to the optimal Ti. The template-chain determined by this heuristic is considered for subsequent stages of GenoMS. For an antibody, the template chain will often link V(D)JC together in that order. However, all templates are not required. Missing templates will be filled in by anchor extension. Second, we are not limited to a single chain. A variant of this heuristic can output multiple chains when needed (e.g. alternative splicing). Recall that the template chain was created by connecting templates that were well covered by target peptides. For each selected template in the chain, anchors are created by merging overlapping peptides. Anchors are ordered by their position on the chain. Spectra not annotated using the database search are reconsidered in the subsequent phases of the algorithm. In the second step, we extend the sequence of each anchor. Before extension, all spectra are first clustered to reduce the overall number of spectra and improve spectrum quality (23Frank A.M. Bandeira N. Shen Z. Tanner S. Briggs S.P. Smith R.D. Pevzner P.A. Clustering millions of tandem mass spectra.J. Proteome Res. 2008; 7: 113-122Crossref PubMed Scopus (190) Google Scholar). The clustered spectra are converted to prefix residue mass (PRM) spectra (8Frank A. Pevzner P. PepNovo: de novo peptide sequencing via probabilistic network modeling.Anal. Chem. 2005; 77: 964-973Crossref PubMed Scopus (530) Google Scholar). A PRM spectrum is represented by a list of mass values, and a PRM-score function ϕ that computes the likelihood that a mass value is a PRM. The procedure for extending the sequence of an anchor is shown in ExtendAnchor below. procedure EXTENDANCHORRecruit PRM spectra that overlap the N/C-terminal of the anchorrepeat1.1 Align the recruited spectra1.2 Construct a consensus spectrum from the aligned spectra1.3 Recruit spectra that ovelap the N/C-terminal of the consensus spectrumwhileSequence the consensus spectrum All spectra that do not contribute to an anchor and have not already been recruited are examined for overlap with each anchor. Any spectra that have been recruited in previous rounds to the same terminus of the anchor are eligible for recruitment in subsequent rounds of recruitment for the terminus as well. We determine overlap by using a modified spectral alignment method (24Pevzner P.A. Dancík V. Tang C.L. Mutation-tolerant protein identification by mass spectrometry.J. Comput. Biol. 2000; 7: 777-787Crossref PubMed Scopus (118) Google Scholar). When aligning a spectrum to an anchor, we allow the spectrum to only partially overlap the anchor (Fig. 3). Because the extended target sequence is determined by aligning the recruited spectra, it is critical to reduce false-positive recruitment and maintain enough coverage to reliably extend the sequence. We consider three parameters: the minimum additive score of the spectral alignment, Q (24Pevzner P.A. Dancík V. Tang C.L. Mutation-tolerant protein identification by mass spectrometry.J. Comput. Biol. 2000; 7: 777-787Crossref PubMed Scopus (118) Google Scholar); the minimum number of overlapping peaks, β, for a spectral alignment to be considered, and the exact number of spectra recruited, NS. Q could be learned by the algorithm independently for each experiment by looking at the alignment score of spectra identified by InsPecT (supplemental Methods). We tested for the dependence on β and NS using a training set of 206 uniformly selected anchor ends from the aBTLA heavy-chain sequence (supplemental Figs. 1 and 2). Values β = 4 and NS = 5 were chosen to balance the accuracy (fraction of recruited spectra that are correct) and sensitivity (fraction of true spectra recruited). The recruited spectra and the anchor sequence must then be aligned. The sequence helps to anchor the spectral alignment, and the spectral alignment is then used to produce a consensus extension of the sequence. We do this using hidden Markov models (HMMs). Profile HMMs are a popular tool for performing multiple sequence alignment (25Durbin R. Eddy S. Krogh A. Mitchison G. Biological Sequence Analysis. Cambridge University Press, Cambridge, UK1998Crossref Google Scholar). We alter the scheme slightly to perform multiple spectrum alignment. The use of HMMs for scoring peptide-spectrum alignments has previously been proposed (26Wan Y. Yang A. Chen T. PepHMM: a hidden Markov model based scoring function for mass spectrometry data base search.Anal. Chem. 2006; 78: 432-437Crossref PubMed Scopus (38) Google Scholar). A novel part of our approach is that the HMM is not static, but is updated by model surgery, as we extend the anchor sequence. Recall that the anchor sequence can also be interpreted as a list of PRMs [m1,m2,m3,…]. For example, the anchor VCAK corresponds to the PRM list [0,99.07,259.21,330.28,458.32]. Intuitively, the HMM is an automaton that generates these PRMs (Fig. 4A). In the absence of noise, we have a set of Match states (M1,M2,…). The automaton starts in Match state M1. In each Match state Mi, the PRM mi is emitted, followed by a transition to the next Match state. An HMM is formally described by a 5-tuple M = (Ω,A,B,π,Σ), where Ω is the set of states. The HMM is initially in state ωi ∈ Ω according to the distribution π. In state ωi, M emits a symbol o ∈ Σ according to the distribution Bi,o, and transitions to state ωj, according to the transition probability Ai,j. To model measurement errors, the Match state Mj outputs a mass m according to BMj,m ∼ N(mj,σ), where σ (i.e. S.D.) is obtained by empirically measured instrument accuracy. Noise peaks are modeled by Insert states in between each adjacent pair of Match states, with the emission probabilities defined by Blj,mαe−ϕ(m) if mj<m<mj+10otherwise(Eq. 2) Missing peaks in spectra are modeled by moving from a Match state to a Delete state, where no symbol is emitted. The transition probabilities Ai,j are initialized to favor match transitions, and penalize delete transitions (supplemental Methods). All parameters are updated at each iteration using a Bayesian approach described in the next section. In this generative model, each spectrum is produced by traversing a (hidden) path through the states of the HMM. Reconstructing the most likely path is equivalent to aligning the spectrum to the HMM and can be determined using the Viterbi algorithm (27Viterbi A. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm.IEEE Trans. Inf. Theory. 1967; 13: 260-269Crossref Scopus (4387) Google Scholar). An Insert state is created after the final Match state for C-terminal extension or before the initial Match state for N-terminal extension. Model surgery is performed to generate additional Match states from these terminal Insert states, which are used to reconstruct the template extension. The procedure for learning the HMM by aligning recruited spectra is shown in AlignSpectrum below. procedure ALIGNSPECTRUM Create an initial HMM using the anchorFor each recruited spectrum, S2.1 Align S to the model using the Viterbi algorithm2.2 Update model parameters2.3 Perform model surgery Transitions Ai,j are updated according to ρi←1∑kCi,k+1(Eq. 3) Ai,j←ci,j+αj+ρiAi,j∑k[ci,k+αk]+ρi(Eq. 4) αj=7if ωj is a Match state1otherwise(Eq. 5) To update BMj,m, the mean is recomputed in each step by using spectral PRMs that were emitted in state Mj. The variance remains unchanged. The initial HMM is constructed using the anchor PRMs. The aligned spectra overlap only partially. The PRMs preceding the N-terminal Match state (or succeeding the C-terminal Match state in the case of right extension) are emitted by Insert states. The observed masses emitted by an Insert state cluster around certain PRM values, specifically at the preceding (or succeeding) PRMs of the target sequence. Model surgery is used create a Match state that can emit the cluster of PRMs (see Fig. 4B). In this way, the HMM is extended to better represent the target sequence. Let WI denote the set of mass values emitted by Insert state I. Consider a subset W′ ⊆ WI. Let µW′ and σW′ denote the mean of the values in W′ and the S.D., respectively. Define Score(W')=∑m∈W'ϕ(m)(Eq. 6) We compute W*=argmaxW'⊆Wl|W'|≥2σW'<0.25Score(W')(Eq. 7) Note that the computation can be done efficiently by sorting the mass values, and looking at intervals. If Score(W*) exceeds the minimum PRM score ϕ(m) for any spectrum, we add a new Match state with mean µW*, along with the corresponding Delete and Insert states (Fig. 4B). All spectra are realigned to the new HMM. The HMM, once learned from the recruited spectra, is used to produce a consensus spectrum. The

Referência(s)