WO1999062930A2 - Protein sequencing using tandem mass spectroscopy - Google Patents

Protein sequencing using tandem mass spectroscopy Download PDF

Info

Publication number
WO1999062930A2
WO1999062930A2 PCT/US1999/012221 US9912221W WO9962930A2 WO 1999062930 A2 WO1999062930 A2 WO 1999062930A2 US 9912221 W US9912221 W US 9912221W WO 9962930 A2 WO9962930 A2 WO 9962930A2
Authority
WO
WIPO (PCT)
Prior art keywords
spectrum
mass
graph
peptide
ions
Prior art date
Application number
PCT/US1999/012221
Other languages
French (fr)
Inventor
Vladimir Dancik
Original Assignee
Millennium Pharmaceuticals, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Millennium Pharmaceuticals, Inc. filed Critical Millennium Pharmaceuticals, Inc.
Priority to AU42284/99A priority Critical patent/AU4228499A/en
Publication of WO1999062930A2 publication Critical patent/WO1999062930A2/en

Links

Classifications

    • HELECTRICITY
    • H01ELECTRIC ELEMENTS
    • H01JELECTRIC DISCHARGE TUBES OR DISCHARGE LAMPS
    • H01J49/00Particle spectrometers or separator tubes
    • H01J49/0027Methods for using particle spectrometers
    • H01J49/0036Step by step routines describing the handling of the data generated during a measurement
    • CCHEMISTRY; METALLURGY
    • C07ORGANIC CHEMISTRY
    • C07KPEPTIDES
    • C07K1/00General methods for the preparation of peptides, i.e. processes for the organic chemical preparation of peptides or proteins of any length
    • C07K1/12General methods for the preparation of peptides, i.e. processes for the organic chemical preparation of peptides or proteins of any length by hydrolysis, i.e. solvolysis in general
    • C07K1/128General methods for the preparation of peptides, i.e. processes for the organic chemical preparation of peptides or proteins of any length by hydrolysis, i.e. solvolysis in general sequencing
    • HELECTRICITY
    • H01ELECTRIC ELEMENTS
    • H01JELECTRIC DISCHARGE TUBES OR DISCHARGE LAMPS
    • H01J49/00Particle spectrometers or separator tubes
    • H01J49/004Combinations of spectrometers, tandem spectrometers, e.g. MS/MS, MSn

Definitions

  • a tandem mass spectrometer is capable of automatically ionizing a mixture of peptides and measuring their respective parent mass/charge ratios, then selectively fragmenting each peptide into constitutive pieces and measuring the mass/charge ratios of the fragment ions (MS/MS spectra of peptides).
  • the peptide sequencing problem is then to derive the sequences of peptides given their MS/MS spectra.
  • sequence of the peptide could be simply determined by converting the mass differences of consecutive fragmentions in the spectrum to their corresponding amino acids.
  • de novo peptide sequencing remains an open problem and even simple spectrum may require tens of minutes for a trained expert to interpret.
  • the number of sequence permutations examined can be further pruned by limiting the possible amino acid composition derived either through chemical amino acid analysis or through composition measurement for ions below m/z 160 in the tandem mass spectrum.
  • the difficulty with the prefix approach is that pruning frequently discards the correct sequence if its prefixes are poorly represented in the spectrum.
  • Another intrinsic problem with the global approach is that the spectrum information is used for scoring only after the potential peptide sequences are generated.
  • the global approach de novo programs typically have running time on the order of hours.
  • the peaks in the spectrum serve as vertices in the spectrum graph while the edges of the graph correspond to linking of vertices differing by the mass of an amino acid residue.
  • Fundamental to graph theory approaches is the prior transformation of each peak in the experimental spectrum into several vertices in a spectrum graph. Each vertex represents a different possible fragment ion type assignment for the peak.
  • the de novo peptide sequencing problem is thus cast as finding the longest path in the resulting directed acyclic graph. Since the number of edges in the spectrum graph is at most quadratic in the number of ions in the spectrum and since efficient algorithms for finding the longest paths are known such approaches have the potential to efficiently prune the set of all peptides to the set of high-scoring paths in the spectrum graph.
  • A be the set of amino acids with molecular masses w(a) , a e A .
  • a (parent) peptide P Pi .
  • Pn is a sequence of amino acids
  • a partial peptide P' c P is a substring p i ..p j ⁇ f P of mass ⁇ i ⁇ t ⁇ : m(p t ).
  • Electronic spectrum E(P) of peptide P is a set of masses of its partial peptides.
  • a match m(S,P) ⁇ s € S m(s,P) between spectrum S and peptide P is the number of ions from the spectrum S that match peptide P.
  • m(S,P) is the number of masses that experimental and electronic spectra have in common.
  • the peptide sequencing problem can stated as follows. Given spectrum S and a parent mass m find a peptide of mass m with the maximal match to spectrum S.
  • a ⁇ - ion of a partial peptide P' c P is such modification of P' that has molecular mass m(P')- ⁇ .
  • electronic spectrum E of peptide P is created by subtracting all offsets from ⁇ from the masses of all partial peptides of P (denoted as E ⁇ ).
  • W ⁇ (P) W ⁇ (P,) ⁇ W ⁇ (P (n . 1 ⁇ ).
  • the set of vertices of spectrum graph then is ⁇ s ⁇ jnjtial ⁇ ⁇ V(s 1 ) .... V(s m )
  • a spectrum S of a peptide P is called "complete" if S contains an ion corresponding to P j for every 1 ⁇ i ⁇ n.
  • the use of spectrum graph is based on the observation that for a complete spectrum S of peptide P, S is a complete spectrum of a peptide P when there exists a path of length n from V ⁇ initjal ⁇ to V ⁇ final j in G ⁇ (S) that is labeled by P and
  • ⁇ vet s(v), there s(v) denotes the multiplicity with which vertex v was created.
  • An offset frequency function is introduced that represents an important new tool for defining the ion type tendencies for particular mass-spectrometers.
  • the offset frequency function allows one to compare different mass spectrometers based on their propensity to generate different ion types thus making our algorithm instrument- independent.
  • Peaks in a spectrum either represent random noise or ⁇ -ions of partial peptides.
  • d(S) be the average distance between the peaks.
  • is approximately (l-p( ⁇ )) + p( ⁇ ) where p( ⁇ ) is the d(S) probability of ⁇ -ion (the portion of partial peptides that produce ⁇ -ions).
  • p( ⁇ ) is the d(S) probability of ⁇ -ion (the portion of partial peptides that produce ⁇ -ions).
  • the average d(S) for our sample spectra is 17.5, therefore probability of random offset is 0.057.
  • the probability of an a-ion with offset -27 is 0.23.
  • the offset -27 is observed 4 times more frequently then the average offset.
  • the statistics of offsets over all ions and all partial peptides provides a reliable learning algorithm for ion types.
  • Offsets ⁇ ⁇ j ,..., ⁇ k ⁇ corresponding to peaks of H(x) represent the ion-types produced by a given mass-spectrometer. Under normal circumstances we expect these offsets to correspond to the ion types that have sufficient support by chemistry.
  • Table 1 Information about terminal ion types learned from experimental spectra. The remaining offsets have average count 45 and average intensity 0.431024. When computing filtered counts, the peaks that have been identified as ions are not counted again for subsequent ion types.
  • Table 1 contains the list of offsets that have larger than expected counts and the corresponding ion types as known in chemistry All the significant offsets we found correspond to known ion types Surprisingly enough, some ion types turned to be more significant than previously thought (i.e. b-H 2 0-H 2 0 has larger count that y-NH 3 ). Also Fig. 1 clearly shows the presence of internal b-ions in the spectra.
  • a part of the learning of ion types is to decide what interval of offsets should be considered for particular ion type.
  • Peaks in a spectrum differ in intensity and one has to address the question of setting a threshold for distinguishing the signal from noise in a spectrum prior to transforming it to a spectrum graph. Low thresholds lead to excessive growth of the spectrum graph while high thresholds lead to fragmentation of the spectrum graph.
  • Earlier de novo sequencing algorithms set up the intensity thresholds for experimental spectra in a largely heuristic manner and have not addressed the fact that the intensity thresholds are ion-type dependent.
  • the offset frequency function allows one to set up intensity thresholds in a rigorous way.
  • K the length of the underlying peptide. Since this information is usually unavailable, K may be chosen as the ratio of the peptide mass and the average mass of an amino acid.
  • K may be chosen as the ratio of the peptide mass and the average mass of an amino acid.
  • the analysis of b-ions can be limited to intensity ranks 1, 2 and 3, while the analysis of b-H 2 0 can be limited to intensity ranks 3, 4 and 5.
  • a similar analysis implies that only intensities ranked 1 and 2 (i.e 20-30 high-intensity peaks) should be considered for y-ions while intensities ranked 2, 3 and 4 represent potential y-H 2 0 ions.
  • Fig. 3 shows that only intensities ranked 1 and 2 should be considered for y-ions while intensities ranked 2, 3 and 4 represent potential y-H 2 0 ions.
  • the merging algorithm decides what vertices in the spectrum graph are to be merged into one vertex. It is important to merge appropriate vertices; if we do not merge vertices that correspond to the same partial peptide, we will interpret meaningful peaks of spectra as a noise. On the other hand, if we merge vertices that do not correspond to the same peptide, we may interpret noise as meaningful peaks.
  • SHERENGA uses greedy a algorithm for merging vertices and introduces bridge edges in the resulting graph.
  • a gap edge in the spectrum graph is a directed edge from u to v such that v - u is - l i ⁇
  • the goal of scoring is to answer the question of how well a candidate peptide "explains" a spectrum and to choose the peptide that explains the spectrum the best.
  • p(P,S) be the probability that a spectrum S is generated by a peptide P produces spectrum S. It is appropriate to design scoring schema so that the high scoring peptides P have the high probability p(P,S).
  • p(P,S) evaluate p(P,S) and derive a scoring schema for paths in the spectrum graph, by the probabilities ofthe responding peptides. The longest path in the weighted spectrum graph corresponds to the peptide P that "explains" spectrum S the best.
  • the protein sequencing algorithm involves the generation ofthe weighted spectrum graph (as described above) and the search for the highest scoring paths in the spectrum graph.
  • Every peak in the spectrum may be interpreted either as an N-terminal ion or C-terminal ion. Therefore, every "real" vertex (corresponding to a mass m) has a
  • G be a graph and let T be a set of forbidden pairs of vertices of G (twins).
  • a path in G is called anti-symmetric if it contains at most one vertex from every forbidden pair.
  • Anti-symmetric longest path problem is to find a longest anti-symmetric path in G with a set of forbidden pairs T.
  • the intrinsic property ofthe conventional longest path algorithms is that they use only neighbors of a given vertex while computing the shortest path ending in this vertex.
  • Vertices in the spectrum graph are numbers that correspond to masses of potential partial peptides.
  • Two forbidden pairs of vertices (x 1 ? yj) and (x 2 , y 2 ) are non- interleaving if the intervals (x l5 y j ) and (x 2 , y 2 ) do not interleave, i.e. one of them is contained inside another.
  • a graph G with a set of forbidden pairs is called proper if every two forbidden pairs of vertices are non-interleaving.
  • Tandem mass-spectrometry peptide sequencing problem corresponds to antisymmetric longest path problem in a proper graph. We submit that there exists an efficient algorithm for anti-symmetric longest path problem in a proper graph.
  • C(G) a graph having a path that corresponds to a path in spectrum graph that is folded in the middle.
  • the vertices ofthe combined graph are pairs (e,x) such that edge e covers vertex x.
  • An initial vertex corresponds to pair (V ⁇ initjaj ⁇ ,v ⁇ f ⁇ nalj ) and a final vertex ( ⁇ p / ⁇ ,V jP 2j ) corresponds to a folding point ofthe spectrum graph.
  • the weight of new vertex will be the weighted average (i(s) u+i(t) v)/(i(s)+i(t)) of weights of u and v.
  • the greedy algorithm for merging provides satisfying results for most spectra.
  • a peak of a spectrum is actually a mass/charge ( m/z ) ratio ofthe corresponding ion.
  • m/z 1
  • m/z ofthe peak is the same as the mass ofthe corresponding ion.
  • some Mass-spectrometers are capable of producing ions with charge 2 or even more, in this case observed mass is half (third,%) ofthe ion's actual mass.
  • c(S, S (x)) be the number of peaks s ; e S and ⁇ e S(x) such that
  • the value of x that maximizes c(S, S (x)) then would be an appropriate choice for parent mass. Should there be many choices for x, we can select one that minimizes the sum of distances
  • This approach significantly improves the accuracy ofthe parent mass determination.
  • This approach can similarly be used to correct a mis-assignment ofthe parent mass/charge value resulting from an incorrect charge assignment.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Organic Chemistry (AREA)
  • General Health & Medical Sciences (AREA)
  • Biochemistry (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Medicinal Chemistry (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)

Abstract

A new algorithm, SHERENGA, for de novo spectral interpretation is described that automatically learns fragment ion-types and intensity thresholds from a collection of test spectra generated from any type of mass spectrometer. The algorithm employs a graph theory approach. The test data is used to construct optimal path scoring in the graph representations of tandem mass spectra. A ranked list of high scoring paths corresponds to potential peptide sequences. SHERENGA is most useful for interpreting sequences of peptides resulting from unknown proteins not yet encountered in genome sequencing, and leveraging text based pattern matching for homology matching to known proteins. The algorithm also serves as a powerful adjunct for validating the results of database matching algorithms in fully automated, high-throughput peptide sequencing.

Description

- I -
PROTEIN SEQUENCING USING TANDEM MASS SPECTROSCOPY
Background of the Invention
In a few seconds, a tandem mass spectrometer is capable of automatically ionizing a mixture of peptides and measuring their respective parent mass/charge ratios, then selectively fragmenting each peptide into constitutive pieces and measuring the mass/charge ratios of the fragment ions (MS/MS spectra of peptides). The peptide sequencing problem is then to derive the sequences of peptides given their MS/MS spectra. For an "ideal" fragmentation process and an "ideal" mass-spectrometer the sequence of the peptide could be simply determined by converting the mass differences of consecutive fragmentions in the spectrum to their corresponding amino acids. In practice, de novo peptide sequencing remains an open problem and even simple spectrum may require tens of minutes for a trained expert to interpret.
The previous attempts to develop automated de novo peptide sequencing algorithms followed either global or local search paradigms. One prior approach involves the generation of all amino acid sequences and corresponding electronic spectra , i.e. calculation of all theoretically possible fragment masses for each sequence. The goal is to find a sequence with the best match between the experimental and electronic spectrum. Since the number of sequence permutations grows exponentially with the length of the peptide, different pruning techniques were designed to limit the combinatorial explosion in global methods. Prefix pruning restricts the computational space to sequences whose prefixes match the experimental spectrum well. Unfortunately, prefix pruning frequently discards the correct sequence if its prefixes are poorly represented in the spectrum. The number of sequence permutations examined can be further pruned by limiting the possible amino acid composition derived either through chemical amino acid analysis or through composition measurement for ions below m/z 160 in the tandem mass spectrum. The difficulty with the prefix approach is that pruning frequently discards the correct sequence if its prefixes are poorly represented in the spectrum. Another intrinsic problem with the global approach is that the spectrum information is used for scoring only after the potential peptide sequences are generated. The global approach de novo programs typically have running time on the order of hours.
Local approaches tend to be more efficient techniques for de novo peptide sequencing because they use the spectral information before any candidate sequence is evaluated. In different modifications of the local approach the fragment ions correspond (sometimes implicitly) to vertices of the spectrum graph as described in, "Christian Bartels. Fast algorithm for peptide sequencing by mass spectroscopy" 19:363-368, 1990 which is incorporated by reference herein.
The peaks in the spectrum serve as vertices in the spectrum graph while the edges of the graph correspond to linking of vertices differing by the mass of an amino acid residue. Fundamental to graph theory approaches is the prior transformation of each peak in the experimental spectrum into several vertices in a spectrum graph. Each vertex represents a different possible fragment ion type assignment for the peak. The de novo peptide sequencing problem is thus cast as finding the longest path in the resulting directed acyclic graph. Since the number of edges in the spectrum graph is at most quadratic in the number of ions in the spectrum and since efficient algorithms for finding the longest paths are known such approaches have the potential to efficiently prune the set of all peptides to the set of high-scoring paths in the spectrum graph.
Although de novo sequencing software programs were developed beginning in the late 1980's, none are in widespread use today. The more widely used database search programs rely on the ability "to look the answer up in the back of the book" when studying genomes of extensively sequenced organisms. While de novo interpretation is limited to a certain extent by ambiguities arising from incomplete fragmentation of a peptide in a tandem mass spectrometer, current de novo algorithms implementations of graph-theory approaches face the following unsolved computational problems.
Existing algorithms tend to be instrument-dependent, i.e. they are designed for the kind of fragment ions that are most likely for the authors' particular type of mass spectrometer. No rigorous approach to defining ion-types and intensity thresholds in an instrument-independent fashion has yet been proposed. If the peptide fragmentation is incomplete the spectrum graph may break into a number of disconnected components. Random noise in the spectrum may generate many false vertices and edges in the spectrum graph that can mimic the correct peptide in the absence of a good scoring schema. Errors in the parent mass/charge assignment lead to misalignment between N-terminal and C-terminal vertices in the spectrum graph. No computational approach to adjust inappropriate parent mass/charge assignment has yet been proposed.
No rigorous approach to scoring paths in the spectrum graph has yet been proposed. The longest path in the spectrum graph may correspond to unrealistic solutions because it uses multiple graph vertices associated with the same spectral peak (antisymmetric path problem). No approach to take into account internal fragment ion types has yet been proposed. No approach to analyze ions of unknown charge state has yet been proposed. High-throughput peptide sequencing via tandem mass spectrometry (MS/MS) is emerging as one of the most powerful tools in proteomics for identifying proteins. While de novo MS/MS peptide sequencing remains a difficult problem, our method, as implemented in SHERENGA Software, is not limited to the near-complete sequences contained in spectra generated on magnetic sector instruments employing high-energy collision induced dissociation. A new method implemented in software for de novo peptide sequencing by tandem mass-spectrometry is desirable. Our algorithm automatically learns ion-types, error rates and intensity thresholds from a collection of spectra.
Detailed Description of an Illustrative Embodiment
Let A be the set of amino acids with molecular masses w(a) , a e A . A (parent) peptide P = Pi.—Pn is a sequence of amino acids, the mass of peptide P is m(P)=Σ m(pj). A partial peptide P' c P is a substring pi ..pj θf P of mass ∑i <t <: m(pt). Electronic spectrum E(P) of peptide P is a set of masses of its partial peptides. An (experimental) spectrum S={sj,...sm} is a set of masses of (fragment) ions. . A mass s matches a peptide P if m(P')=s for a partial peptide P'cP .. Denote x(s,P)= 1 if s matches P, and x(s,P)=0 otherwise. A match m(S,P) = Σ s€ S m(s,P) between spectrum S and peptide P is the number of ions from the spectrum S that match peptide P. In another words, m(S,P) is the number of masses that experimental and electronic spectra have in common.
The peptide sequencing problem can stated as follows. Given spectrum S and a parent mass m find a peptide of mass m with the maximal match to spectrum S.
However, different mass-spectrometers lead to different variations of the peptide sequencing problem. In particular, peptide fragmentation in a tandem mass- spectrometer is characterized by a set of numbers Δ={δj,..., δ } called ion-types. A δ- ion of a partial peptide P' c P is such modification of P' that has molecular mass m(P')-δ. In this case, a molecular mass s matches a peptide P if w(P')-δ=s for a partial peptide P' c_ P and an ion-type δ€Δ. For tandem mass-spectrometry, electronic spectrum E of peptide P is created by subtracting all offsets from Δ from the masses of all partial peptides of P (denoted as E ΘΔ).
The problem can be further stated as given spectrum S, the set of ion types Δ, and the mass w, find a peptide of mass w with the maximal match to spectrum S.
Denote partial N-terminal peptide sequences Pι,...,p; as P; , i = l,....,n-l and partial C-terminal peptide sequences p:,....,pn as Pj" - 2,...,n (Note that this indexing differs from usual indexing for C-terminal ions.). In tandem mass-spectrometry spectrum S consists mainly of δ-ions of partial N-terminal and C-terminal peptides with δ being limited to a set of ion types Δ = {δl ...,δk}. For example, the most frequent N- terminal ions are b, a, b-H20, b-NH3 ~ (Δ={ 1,-27,17,16}) for out mass-spectrometer.
For a given partial peptide Pj let WΔ(P;) be the set of mass of all δ-ions of P; for δeΔ, i.e. WΔ(Pj) = {m(Pi)-δ1....,m(Pi)- δk} For a peptide P we set WΔ(P)= WΔ(P,) ~υ WΔ (P(n.1} ). The mass spectrometry reconstruction problem then can be formulated as follows. For a given molecular mass w, spectrum S and the set of ion types Δ find a peptide P such that m(P) = w and WΔ(P) = S. We realistically try to find a peptide with the best match between spectrum and m(P), i.e. to maximize the size of W(P)\cap S Assume, that a spectrum from a tandem mass-spectrometer consists mainly of N- terminal ions, and random noise. The correspondence between elements of spectra and vertices of spectrum graphs is closely tied with possible ion types. We capture the relationship between spectrum S and ion types Δ={δj,....,δk} by a spectrum graph GΔ(S). Vertices of the graph are integers representing potential masses of partial peptides, for vertex v we denote this mass by m(v). Every peak of spectrum se S generates k vertices V(s) = {s+δj,...., s+δk}, with m(Vi)=s+δj , i=l,....,k.
The set of vertices of spectrum graph then is {s<jnjtial}} V(s1) .... V(sm)
{s{fιnai}}' he e {s{initiai}} = 0 and {s{finai}}= m(p)- Two vertices u and v are connected by a directed edge from u to v if v-u is the mass of some amino acid and the edge is labeled by this amino acid. If we look on vertices as potential partial N- terminal peptides, the edge from u to v implies that the sequence at v may be obtained by extending the sequence at u by one amino acid.. A spectrum S of a peptide P is called "complete" if S contains an ion corresponding to Pj for every 1 < i < n. The use of spectrum graph is based on the observation that for a complete spectrum S of peptide P, S is a complete spectrum of a peptide P when there exists a path of length n from V{initjal} to V{final j in GΔ(S) that is labeled by P and |W(P) S| = ∑vet s(v), there s(v) denotes the multiplicity with which vertex v was created.
This observation transforms the tandem mass-spectrometry protein sequencing problem into finding the correct path in the set of all paths. Since the number of paths in the graph is enormous, we need some way of evaluating the paths. Previous implementations of the spectrum graph searched for a path visiting as many vertices as possible. Unfortunately, experimental spectra are frequently incomplete and noisy, i.e. they contain many peaks that do not correspond to any ions. Thus in order to find 9 peptide sequence corresponding to the given spectrum we have to develop a new approach to spectrum graph and scoring schema to deal with incomplete noisy spectra and to evaluate the weight of the paths in the spectrum graph. Another problem is that different mass-spectrometers have different characteristics and different ion-types and therefore every algorithm for de novo peptide sequencing should be adjusted for a particular type of a mass-spectrometer. To address this problem an offset frequency function is described and an algorithm for an automatic, learning of ion types and intensity thresholds, and scoring parameters from a sample of experimental spectra is described.
An offset frequency function is introduced that represents an important new tool for defining the ion type tendencies for particular mass-spectrometers. The offset frequency function allows one to compare different mass spectrometers based on their propensity to generate different ion types thus making our algorithm instrument- independent.
Another observation is that in spectra we often observe ions that correspond to partial sequences Pj- = Pj pn (these are called C-terminal ion as opposed to N- terminal ions corresponding to Pi 's) and in some cases internal partial sequences p{ij } = Pi..- Pj-
We distinguish between the name of ion type δ and the mass difference (offset) δ corresponding to ion type δ.
Consider a spectrum corresponding to peptide P from the learning sample. We will concentrate on peaks of spectra that are close to the mass of a partial peptide P;.
Peaks in a spectrum either represent random noise or δ-ions of partial peptides.
If we don't know the ion types Δ={δj,....,δk} produced by a given mass spectrometer we cannot interpret the spectrum. We must distinguish between random noise and δ-ions. We describe how to learn the set Δ and ion propensities from a sample of experimental spectra.
Let S={sι,...,sm} be a spectrum corresponding to the peptide P and let
d(S) = be the average distance between the peaks. A partial peptide Pj and a peak m
Sj have an offset X<;JI =m(Pi)-sj; for illustration purposes we shall treat X;: as a discrete random variable. Given an arbitrary offset x, the probability that there is Sj e S such that X ijji = x can be roug 5hly estimated as d(S) . For an offset δ e Δ the probability that there is
a peak in S with offset X;: = δ is approximately (l-p(δ)) + p(δ) where p(δ) is the d(S) probability of δ-ion (the portion of partial peptides that produce δ-ions). For example, the average d(S) for our sample spectra is 17.5, therefore probability of random offset is 0.057. The probability of an a-ion with offset -27 is 0.23. Thus the offset -27 is observed 4 times more frequently then the average offset. The statistics of offsets over all ions and all partial peptides provides a reliable learning algorithm for ion types.
Given spectrum S, offset x and precision ε we compute the number H(x,S) of pairs (Pj,Sj), i = l,...,n-l, j = l,...,m that have offset m(P;)-Sj within distance from x. The offset frequency function is defined as H(x) = ∑s H(x,S) , where the sum is taken over all spectra from the learning sample. To learn about C-terminal ions we do the same for pairs (Pj",S:). Fig. 1 presents the plots of function H(x) for N-terminal, C- terminal, internal and doubly charged ion types. We consider only offsets within interval (-m,m) where m is the mass of the lightest amino acid. Vertical axes represent normalized offset counts with 1 being the average count Offset increment = 0.2. The only significant offsets for internal ions correspond to b and b~H 0 ions. The only significant offsets for doubly charged ions correspond to y and y~H20 ions.
Offsets Δ = {δj,..., δk} corresponding to peaks of H(x) represent the ion-types produced by a given mass-spectrometer. Under normal circumstances we expect these offsets to correspond to the ion types that have sufficient support by chemistry.
TABLE 1
Figure imgf000010_0001
Table 1 : Information about terminal ion types learned from experimental spectra. The remaining offsets have average count 45 and average intensity 0.431024. When computing filtered counts, the peaks that have been identified as ions are not counted again for subsequent ion types.
Table 1 contains the list of offsets that have larger than expected counts and the corresponding ion types as known in chemistry All the significant offsets we found correspond to known ion types Surprisingly enough, some ion types turned to be more significant than previously thought (i.e. b-H20-H20 has larger count that y-NH3). Also Fig. 1 clearly shows the presence of internal b-ions in the spectra.
A part of the learning of ion types is to decide what interval of offsets should be considered for particular ion type. The error range is chosen to be the width of the corresponding peak in the plot of the offset frequency function of H(x). This analysis suggests that ε = 0.45. If we need to be more precise, we can assume that offsets are distributed according to the mixture of uniform and normal distributions and use maximum likelihood methods to estimate appropriate values for error ranges.
Once we have learned and selected significant ion types we can annotate spectra from our learning sample. Annotated spectra will provide support for learning other features needed for the construction of spectrum graphs.
Peaks in a spectrum differ in intensity and one has to address the question of setting a threshold for distinguishing the signal from noise in a spectrum prior to transforming it to a spectrum graph. Low thresholds lead to excessive growth of the spectrum graph while high thresholds lead to fragmentation of the spectrum graph. Earlier de novo sequencing algorithms set up the intensity thresholds for experimental spectra in a largely heuristic manner and have not addressed the fact that the intensity thresholds are ion-type dependent. The offset frequency function allows one to set up intensity thresholds in a rigorous way.
Given a spectrum, we can address this concern by normalizing and ranking group intensities into bins of size K and rank K peaks with largest intensity by 1 , next K peaks are ranked by 2 and so on. A natural choice for K is the length of the underlying peptide. Since this information is usually unavailable, K may be chosen as the ratio of the peptide mass and the average mass of an amino acid. We normalize intensities of peaks in a spectrum in such way that the average intensity of the peaks in the spectrum is 1. The frequences of ion types depending on intensity are shown in Figure 3.
The change of H(x) depending on the intensity rank is shown in Fig. 2, which guides us in selecting intensity thresholds. Fig. 2 convincingly demonstrates that the intensities ranked below 5 represent nothing but random noise since the offset frequency function has no pronounced peaks in this region. It implies that for an average MS/MS spectrum on an ion-trap instrument no more than about 60 top intensities should be considered as a potential signal. This observation suggests a limit for the number of peaks analyzed by any peptide MS/MS interpretation program and indicates that the analysis of 100+ peaks with any program is likely to hamper rather than to help in interpreting peptide sequences. Moreover, Fig. 3 demonstrates that intensity thresholds are ion-type dependent. For example, the analysis of b-ions can be limited to intensity ranks 1, 2 and 3, while the analysis of b-H20 can be limited to intensity ranks 3, 4 and 5. A similar analysis implies that only intensities ranked 1 and 2 (i.e 20-30 high-intensity peaks) should be considered for y-ions while intensities ranked 2, 3 and 4 represent potential y-H20 ions. For example, Fig. 3 shows that only intensities ranked 1 and 2 should be considered for y-ions while intensities ranked 2, 3 and 4 represent potential y-H20 ions.
The approach to construction of the spectrum graph described above is incomplete since it does not take into account inaccuracies in experimental mass measurements of fragment and parent ions. Let partial peptide Pj produces peaks Sι,...,sk in the spectrum corresponding to the ion types δj,...,δk. Above we assumed that Sj + δj = s22 = ... = sk + δk = m(Pj) and all k ion types generate the same vertex in the spectrum graph. Of course, this is not the case for real spectra. Due to inaccuracies of experimental mass measurements the peaks Sj,...,sk correspond to different vertices with weights S; + δj, 1 < j < k within mass tolerance that is instrument dependent.
The merging algorithm decides what vertices in the spectrum graph are to be merged into one vertex. It is important to merge appropriate vertices; if we do not merge vertices that correspond to the same partial peptide, we will interpret meaningful peaks of spectra as a noise. On the other hand, if we merge vertices that do not correspond to the same peptide, we may interpret noise as meaningful peaks. To address this problem SHERENGA uses greedy a algorithm for merging vertices and introduces bridge edges in the resulting graph.
If a peptide undergoes incomplete fragmentation the spectrum graph does not contain a vertex corresponding to an underrepresented position in a peptide. Since fragmentation is frequently incomplete many peptides contain positions that have no corresponding peaks in the spectra. This can lead to a fragmented graph or, more frequently, a graph with paths that do not correspond to feasible solutions. This effect only amplifies as we introduce thresholds and exclude low intensity peaks from the spectrum. To overcome this problem we modify the spectrum graph by introducing gap edges. A gap edge in the spectrum graph is a directed edge from u to v such that v - u is - l i ¬
the mass of a dipeptide, i.e. the sum of masses of two amino acids. In a more general approach we consider tri-peptides or even longer peptides.
Accurate determination ofthe peptide parent mass/charge is extremely important in de novo peptide sequencing. An error in parent mass leads to systematic errors in the masses of vertices for C-terminal ions thus making peptide reconstruction difficult. In practice, the offsets between the real peptide masses (given by the sum of amino acids of a peptide) and experimentally observed parent mass/charge as shown in Fig. 4 are frequently so large that the errors in peptide reconstruction become almost unavoidable. To address this problem we have designed a combinatorial algorithm for parent mass/charge computation that provides a more accurate determination of the parent mass.
The goal of scoring is to answer the question of how well a candidate peptide "explains" a spectrum and to choose the peptide that explains the spectrum the best. Below we introduce a probabilistic model for tandem mass-spectrometry and derive a rigorous scoring algorithm (versus largely heuristic previous approaches).
Let p(P,S) be the probability that a spectrum S is generated by a peptide P produces spectrum S. It is appropriate to design scoring schema so that the high scoring peptides P have the high probability p(P,S). Below we describe a probabilistic model, evaluate p(P,S) and derive a scoring schema for paths in the spectrum graph, by the probabilities ofthe responding peptides. The longest path in the weighted spectrum graph corresponds to the peptide P that "explains" spectrum S the best.
In a probabilistic approach tandem mass spectrometry is characterized by a set of ion types Δ = {δj,...,δk} and their probabilities {p(δj),..., p(δk)} such that δj-ions of a partial peptide P'c: P are produced independently with probabilities p(δj). A mass- spectrometer also produces a "random noise'" that in any position may generate a peak with probability qR. Therefore, a peak at position corresponding to a δ.-ion is generated with probability qj = p(δj) + (l-p(δj))qR that can be estimated from the observed empirical distributions (Table 1). Partial peptide P' may theoretically have up to k
__ _ corresponding peaks in the spectra. It has all k peaks with probability j j qi and it has
no peaks with probability J~[ (1 - qi) . The described probabilistic model defines probability p(P,S) that a peptide P produces spectrum S . We can formulate peptide sequencing problem now as follows:
For a given spectrum S find a peptide P maximizing p(P,S), i.e. p(P,S) = maxP p(P,S). To illustrate the idea of scoring informally let's assume that only 4 types of ions are possible: y, b, y-H20, b-H20 ions with probabilities of appearing qι,q2,q3.q4 • Assume also that probability of random noise is qR.
Suppose that a candidate partial peptide Pj produces ions δl5 ..., δj ("present" ions) and does not produce the ions δ1+1,..., δk ("missing" ions) in the spectrum S . These 1 "present" ions will result in a vertex in the spectrum graph corresponding to Pj. How should we score this vertex? The existing database search algorithms use "premium for present ions" approach suggesting that the score for this vertex should be
proportional to qj — q; or maybe — — to normalize the probabilities against the qR R noise. (The ratios — can be taken from the offset frequency function). Normalizing qR against the noise has the additional effect of penalizing peaks ofthe experimental spectrum that are not explained in relation to a candidate sequence. Below we also show that it is not a correct approach and that we have to. However we achieve better results when we significant improvement results from penalizing for non-presence of ions in the experimental spectrum which are possible from fragmentation of a candidate sequence. The probability score ofthe vertex is then given by
Figure imgf000014_0001
("premium for present ions, penalty for missing ions"). This important observation was overlooked in scoring the database search hits for peptide mass-spectrometry. Although "premium for present ions, penalty for missing ions" approach may sound counter- intuitive, it is confirmed both by our theoretical analysis and improvements in SHERENGA performance as compared to the previous approach. We explain the role of this principle for a resolution of a simple alternative between dipeptide GG and amino acid N ofthe same mass. In the absence of "penalty for missing ions" GG is selected against N in the presence of any (even very weak random noise) peak supporting the position ofthe first G. Our results implies that such a rule leads to many wrong GG-abundant predictions since our learning procedure implies that the weak peak after the first G is, in fact, a vote against GG. The correct rule is to vote for GG if it is supported by a b or y-type ion. This rule is automatically enforced by our "premium for present ions, penalty for missing ions" scoring. The same concepts extend to ambiguities between AG and GA vs. K or Q (all mass 128). For the sake of simplicity we assume that all partial peptides P; and Pj- are equally likely and ignore the intensities of peaks for now. We discretize the space of all masses in the interval from 0 to the parent mass m(P)=M, denote T = {0,..., M}, and represent the spectrum as an M-mer vector S = {sj,...,sM} such that st, is the indicator of resence/absence of peaks at position t (st = 1 if there is a peak at position t and st = 0 otherwise). For a given peptide P and position t, st is a 0-1 random variable with probability distribution p(P,St). For a given P probabilities p(P,st) are independent and
M p(P,S) = []p(P,st). t=l
Let T; = {tji,...,tik} be the set of positions that represent Δ-ions of a partial peptide Pj where Δ = {δ.,...,δk}. Let R=T T be the set of positions that are not associated with any partial peptides. The probability distribution p(P,st) depends on whether t e T; or t e R. For a position t = ty e T; the probability p(P,st) is given by p(P,st) = {qj, if st - 1 (i.e. a peak is generated at position t) and 1 - q:, otherwise
Similarly for t e R the probability p(P,st) is given by
PR (P,st) = { qR, if st = 1 (i.e. there is a random noise at position t), and l-qR, otherwise and the overall probability of 'noisy' peaks in the spectrum can be estimated as
Π teR PR(P ' S.) - Let p(Pj,S) = πteχi p(P,St) be the probability that a peptide P, Pj+ and P{- produces a given spectrum at positions from the set T; (all other positions ignored). For the sake of simplicity, assume that each peak ofthe spectrum belongs only to one set T; and that all positions are independent. Then
Figure imgf000016_0001
We also assume that all positions from R have the same probability distribution pR(t) independent of P. For a given spectrum S the value J~[pR(P,st) does not depend t_τ on P and the maximization of p(P,S) is the same as the maximization of
n Pr(pP, s , _ »<? S J .
Figure imgf000016_0003
Figure imgf000016_0004
Figure imgf000016_0002
In logarithmic scale the above formula together with 1 and 2 imply the additive
"premium for present ions, penalty for missing ions" scoring of vertices in the spectrum graph.
Although we explain our approach in the terms of probability, the calculations are done in logarithmic scale to avoid dealing with very small numbers that may lead to the loss of precision. Up to this point we ignored the intensities ofthe peaks and the scoring described above assigns the same score to low intensity and high intensity peaks.
To incorporate the intensities into scoring we assume that intensity for ion type δj is distributed according to empirical distribution Iδi( ) and modify formulas (1) and (2) accordingly. The protein sequencing algorithm involves the generation ofthe weighted spectrum graph (as described above) and the search for the highest scoring paths in the spectrum graph.
After the weighted spectrum graph is constructed we cast peptide sequencing problem as the longest path problem in directed acyclic graph. This problem is solved by a fast linear time dynamic programming algorithm with running time 0(E), where E is the number of edges in the spectrum graph. For a typical spectrum, the algorithm is very fast thus giving the spectrum graph approach an advantage over the global approaches.
Unfortunately, this simple algorithm does not quite work in practice. The problem is that every peak in the spectrum may be interpreted either as an N-terminal ion or C-terminal ion. Therefore, every "real" vertex (corresponding to a mass m) has a
"fake" twin vertex (corresponding to a mass m(P)-m-offset). Moreover, if the real vertex has a high score then its fake twin also has a high score. The longest path in the spectrum graph then tends to include both real vertex and its fake twin since they both have high scores. Such paths do not correspond to feasible protein reconstructions and should be avoided. However, the known longest path algorithms do not allow to avoid such paths. Since they cannot check back on whether one ofthe twins was already included in the growing path. This problem was overlooked in the previous work on de novo protein reconstruction.
Therefore, the simple reduction ofthe tandem mass-spectrometry peptide sequencing to the longest path problem described earlier is inadequate. We now describe the anti-symmetric longest path problem that adequately models the peptide sequence reconstruction.
Let G be a graph and let T be a set of forbidden pairs of vertices of G (twins). A path in G is called anti-symmetric if it contains at most one vertex from every forbidden pair. Anti-symmetric longest path problem is to find a longest anti-symmetric path in G with a set of forbidden pairs T.
The intrinsic property ofthe conventional longest path algorithms is that they use only neighbors of a given vertex while computing the shortest path ending in this vertex.
Since vertices in a forbidden pair are not necessarily neighbors, these algorithms can not be adjusted to find anti-symmetric longest paths. The anti-symmetric longest path problem is NP-hard thus indicating that efficient algorithms for solving this problem are unlikely.
This negative result does not imply yet that it is futile to attempt to find an efficient algorithm for tandem mass-spectrometry peptide sequencing since this problem has a special structure consisting of forbidden pairs that leads to an efficient algorithm for finding anti-symmetric longest paths. Below we show that it is exactly the case and design an efficient algorithm for the tandem mass-spectrometry problem.
Vertices in the spectrum graph are numbers that correspond to masses of potential partial peptides. Two forbidden pairs of vertices (x1 ? yj) and (x2, y2) are non- interleaving if the intervals (xl5 yj) and (x2, y2) do not interleave, i.e. one of them is contained inside another. A graph G with a set of forbidden pairs is called proper if every two forbidden pairs of vertices are non-interleaving.
Tandem mass-spectrometry peptide sequencing problem corresponds to antisymmetric longest path problem in a proper graph. We submit that there exists an efficient algorithm for anti-symmetric longest path problem in a proper graph.
We assume that there are no two vertices u and v in the spectrum graph G such that w(u)+w(v)=w(P); if this happens we shift one ofthe vertices by a "microscopic' distance ε. We say that edge e={uv} " covers " vertex x when w(u) < w(P)-w(x) < w(v).
We define " combined graph " C(G) as a graph having a path that corresponds to a path in spectrum graph that is folded in the middle. The vertices ofthe combined graph are pairs (e,x) such that edge e covers vertex x. There are two distinguished vertices in the combined graph. An initial vertex corresponds to pair (V{initjaj},v^fιnalj) and a final vertex ( {p/ },VjP 2j) corresponds to a folding point ofthe spectrum graph.
Two vertices
Figure imgf000018_0001
V]},X]) and (e2 = {u2,v2},x2) are connected by a (directed) edge when X\~u2 and x2 ~Vι or when ej=e2 and there is an edge xtx2 in the spectrum graph G. The rules for the initial and final vertices ofthe combined graph are slightly different. There is an edge from (v{jnitjal},v{f-ιnaI}) to ({uv},x) when u = v{initjal} and there is edge xv{fιnal} in G or when v = v{fmal} and v{injtial}x e G. Vertex ({uv},x) is connected with final vertex of combined graph C(G) whenever x=u or x=v. The major property ofthe combined graph we use in our algorithm is that forbidden pairs will get close to each other.
The following establishes the locality of forbidden pairs. Let the maximal distance m between offsets from Δ be smaller then the weight ofthe smallest amino acid. If x1? x2 be a forbidden pair and if p is a path in G(S) from
Figure imgf000018_0002
to (e2,x ) then p consists of one edge. A proof follows. Every path p with length more than 1 contains an edge of spectrum graph, therefore the distance between Xj and x2 is more than the weight of an amino acid. Therefore Xj and x2 cannot be generated from the same peak ofthe spectrum and pair (xι,x ) is not a forbidden pair.
The algorithm for creating a graph without forbidden pairs follows: • generate spectrum graph G
• generate combined graph C(G)
• for every forbidden pair x j , x2 remove edges
• (ej,Xι) \to (e2,x2) from C(G)
• find the shortest path p from initial to final vertex in C(G) • recover the shortest path without forbidden pairs in G from p.
Although the proof of this theorem is complicated the resulting algorithm will be rather fast and practical. Also sometime we can gain a reasonable solution by searching for paths in opposite direction, starting from the vertex v^finalj and ending in vertex v{initial}- To make the spectrum graph approach work, all vertices that correspond to ion- types of a partial peptides Pj have to be merged into a single vertex corresponding to P;. Since ε = 0.45 the distance between Sj + δ; and Sj + δj is bounded by 0.45 + 0.45 = 0.9. This, rather large error range, presents a serious problem for merging vertices in the spectrum graph. We use a greedy algorithm to merge vertices. At every step we find the closest vertices, u (generated from peak s) and v (generated from t) and merge them. The weight of new vertex will be the weighted average (i(s) u+i(t) v)/(i(s)+i(t)) of weights of u and v. We repeat merging until all vertices are at least ε apart for a given precision ε. Note that in the later stages of this merging algorithm we might merge vertices that are already created by merging, in such case the new weight ofthe vertex is the weighted average of three (or even more) weights of original vertices. The greedy algorithm for merging provides satisfying results for most spectra. However there are cases when the algorithm does not merge vertices related to the same partial peptide or merges vertices that are not associated with the same partial peptide. The doubly charged ions frequently cause problems since their error range is actually twice larger comparing to error ranges of singly charged ion types. Unfortunately, the greedy merging algorithm described above allows only the uniform error range.
When different error ranges are needed we can proceed in the hierarchical manner. Instead of generating all vertices at once and merging them afterwards we generate only vertices corresponding only to the most significant ion types and merge those vertices using greedy merging algorithm. In the next step we generate the vertices for third most significant ion type and then merge new vertices with the old one. We continue until all vertices are generated and merged. Analysis of histograms in Fig. show frequences of offsets between most frequent ion types leads to a conclusion from that error range in vertex merging can be chosen 0.5 rather than 0.9 as one would expect (data are not shown).
Whenever the distance between two vertices u and v in the spectrum graph is equal to the mass of an amino acid a we connected u and v with an edge and labeled it a. In the last sections we redefined vertices and allowed their weights to be non-integer. In a more realistic approach we join vertices u and v we require that the mass of an amino acid a is approximately equal to the distance between the two vertices, i.e.- ε < |v-u| - m(a) < ε for error range ε. To determine the appropriate value for ε we check the peaks (say s and t) corresponding to the same type ions of partial peptides Pj, P{i+1 (say a is the last amino acid of P{i+1 not present in Pj). Analysis ofthe histograms of offsets |m(t)-s|-m(a) for all such pairs of peaks s and t. The analysis of implies that ε=0.5 is an appropriate choice for error range in defining edges of spectrum graph (data are not shown).
We have observed, that when creating spectrum graph it sometime happens that due to the merging procedure the weights of appropriate vertices are off more ε = 0.5 even when there are corresponding peaks with difference within 0.5 ofthe amino acid mass. Since such vertices are not connected by an edge, we are at risk of loosing important edges in the spectrum graph. To avoid it we introduce bridge edges in the spectrum graph. We connect two vertices u and v either by a (regular) edge with label a if -ε < |v-u|-m(a) <ε or by a bridge edge if there are peaks s,t e S and ion type δ e Δ such that -ε<|s-t|-m(a)<ε and vertex s+δ was merged into u and vertex t + δ was merged into v.
A peak of a spectrum is actually a mass/charge ( m/z ) ratio ofthe corresponding ion. Up to this point we worked as if z = 1 and assumed m/z ofthe peak is the same as the mass ofthe corresponding ion. However, some Mass-spectrometers are capable of producing ions with charge 2 or even more, in this case observed mass is half (third,...) ofthe ion's actual mass.
We analyze doubly charged ions in the same manner as we did ordinary ions by treating them as a "new' ion type. We investigate offset frequency function H+2 (x,S) where offsets are given by m(P;)-2Sj. The analysis of the corresponding offset frequency function demonstrates that the only two significant multiple-charged ion types are y+2 and y+2 - H20 (l).
We use simple alignment of spectra to compute parent masses. If S={sj,...,sm}is the spectrum of a peptide P S = { s, , ... , sm } then the reflection of S is a spectrum S = {s, , ... , sm } such that Si = m(P)-S;-d, where d = m(y-ion)-m(b-ion) is the difference of offsets of y-ions and b-ions. Note that if a spectrum S contains a peak s that corresponds to a b-ion of a partial peptide P; and peak t that corresponds to a y-ion of Pj- then S = t and therefore spectra S and S have a common element. For correct m(P) we should see good alignment between peaks corresponding to b-ions in S and peaks corresponding to y-ions in S (and vice versa because of symmetry).
We use this observation to devise an algorithm for computing the parent mass.
For a spectrum S = {sj,...,sm} and a number x we define S (x) = {s, ,..., sm } where si = x - δj - d. Spectra S and S may have some peaks in common just by chance, for a "random'
mass x the number of peaks in common is approximately —z « 0.5. (for y d2 (S) 2601 thresholded spectra with d(S)=51 ). It implies that two random spectra have approximately 0.5 peaks in common. However for x = m(P) spectra S and S tend to have more peaks in common due to the alignment between b-ions and y-ions. Since the condition that both P; and Pj. ions are present in the spectra is satisfied in 45% of cases (average number of aligned peaks is 6.4) we are able to devise the following combinatorial approach to estimate m(P).
Let c(S, S (x)) be the number of peaks s; e S and ^ e S(x) such that |Sj-Sj|<ε, where ε is given precision. The value of x that maximizes c(S, S (x)) then would be an appropriate choice for parent mass. Should there be many choices for x, we can select one that minimizes the sum of distances |Sj-Sj| ofthe aligned peaks SjG S and s. e S .
This approach significantly improves the accuracy ofthe parent mass determination. This approach can similarly be used to correct a mis-assignment ofthe parent mass/charge value resulting from an incorrect charge assignment.

Claims

What is claimed:
1. A method for generating a partial amino acid sequence for a fragmented peptide using mass spectroscopy, the method comprising the steps of: producing a mass spectrum for said fragmented peptide and transforming said mass spectrum into a spectrum graph whereby each peak in the mass spectrum is represented in said spectrum graph as a plurality of peaks which are offset by values related to a family of possible peptide ion types, and generating said partial amino acid sequence by deriving the longest path in said spectrum graph that does not include vertices for both a N-terminus and C- terminus fragment ion type representing a single peak in said mass spectrum.
2. A method for determining the precursor mass/charge of a fragmented peptide from a mass spectrum of said fragmented peptide, comprising: a) reflecting the mass spectrum about the axis of a proposed precursor mass/charge, taking into account the mass/charge offset between a pair of symmetric N- terminal and C-terminal fragment ion types; and b) aligning the original mass spectrum and the reflected mass spectrum while varying the mass/charge offset necessary to optimize alignment; and c) adjusting a proposed precursor mass/charge by the mass/charge offset to provide optimal alignment ofthe original and reflected mass spectra.
3. A method for generating a partial amino acid sequence for a fragmented peptide using mass spectroscopy, the method comprising the steps of: producing a mass spectrum for said fragmented peptide and transforming said mass spectrum into a spectrum graph whereby each peak in the mass spectrum is represented in said spectrum graph as a plurality of peaks which are offset by values related to a family of possible peptide ion types, generate a combined graph from said spectrum graph, removing each edge from said combined graph representing an edge between a forbidden pair, finding the longest path in said combined graph without forbidden pairs, generating said partial amino acid sequence from said longest path in said combined graph without forbidden pairs.
PCT/US1999/012221 1998-06-03 1999-06-02 Protein sequencing using tandem mass spectroscopy WO1999062930A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU42284/99A AU4228499A (en) 1998-06-03 1999-06-02 Protein sequencing using tandem mass spectroscopy

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US8778598P 1998-06-03 1998-06-03
US60/087,785 1998-06-03

Publications (1)

Publication Number Publication Date
WO1999062930A2 true WO1999062930A2 (en) 1999-12-09

Family

ID=22207246

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1999/012221 WO1999062930A2 (en) 1998-06-03 1999-06-02 Protein sequencing using tandem mass spectroscopy

Country Status (2)

Country Link
AU (1) AU4228499A (en)
WO (1) WO1999062930A2 (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002021139A2 (en) * 2000-09-08 2002-03-14 Oxford Glycosciences (Uk) Ltd. Automated identification of peptides
WO2003046577A1 (en) * 2001-11-30 2003-06-05 The European Molecular Biology Laboratory A system and method for automatic protein sequencing by mass spectrometry
WO2003075306A1 (en) * 2002-03-01 2003-09-12 Applera Corporation Method for protein identification using mass spectrometry data
WO2003098190A2 (en) * 2002-05-20 2003-11-27 Purdue Research Foundation Protein identification from protein product ion spectra
EP1366360A2 (en) * 2001-03-09 2003-12-03 Applera Corporation Methods for large scale protein matching
WO2004008371A1 (en) * 2002-07-10 2004-01-22 Institut Suisse De Bioinformatique Peptide and protein identification method
WO2004083233A2 (en) * 2003-02-10 2004-09-30 Battelle Memorial Institute Peptide identification
US6800449B1 (en) 2001-07-13 2004-10-05 Syngenta Participations Ag High throughput functional proteomics
DE10323917A1 (en) * 2003-05-23 2004-12-16 Protagen Ag Process and system for elucidating the primary structure of biopolymers
US6963807B2 (en) 2000-09-08 2005-11-08 Oxford Glycosciences (Uk) Ltd. Automated identification of peptides
US7158862B2 (en) * 2000-06-12 2007-01-02 The Arizona Board Of Regents On Behalf Of The University Of Arizona Method and system for mining mass spectral data
DE102011014805A1 (en) * 2011-03-18 2012-09-20 Friedrich-Schiller-Universität Jena Method for identifying in particular unknown substances by mass spectrometry

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7158862B2 (en) * 2000-06-12 2007-01-02 The Arizona Board Of Regents On Behalf Of The University Of Arizona Method and system for mining mass spectral data
WO2002021139A3 (en) * 2000-09-08 2003-02-06 Oxford Glycosciences Uk Ltd Automated identification of peptides
WO2002021139A2 (en) * 2000-09-08 2002-03-14 Oxford Glycosciences (Uk) Ltd. Automated identification of peptides
US6963807B2 (en) 2000-09-08 2005-11-08 Oxford Glycosciences (Uk) Ltd. Automated identification of peptides
EP1366360A2 (en) * 2001-03-09 2003-12-03 Applera Corporation Methods for large scale protein matching
EP1366360A4 (en) * 2001-03-09 2005-03-16 Applera Corp Methods for large scale protein matching
US6800449B1 (en) 2001-07-13 2004-10-05 Syngenta Participations Ag High throughput functional proteomics
WO2003046577A1 (en) * 2001-11-30 2003-06-05 The European Molecular Biology Laboratory A system and method for automatic protein sequencing by mass spectrometry
WO2003075306A1 (en) * 2002-03-01 2003-09-12 Applera Corporation Method for protein identification using mass spectrometry data
WO2003098190A2 (en) * 2002-05-20 2003-11-27 Purdue Research Foundation Protein identification from protein product ion spectra
WO2003098190A3 (en) * 2002-05-20 2004-07-15 Purdue Research Foundation Protein identification from protein product ion spectra
WO2004008371A1 (en) * 2002-07-10 2004-01-22 Institut Suisse De Bioinformatique Peptide and protein identification method
WO2004083233A3 (en) * 2003-02-10 2004-12-29 Battelle Memorial Institute Peptide identification
WO2004083233A2 (en) * 2003-02-10 2004-09-30 Battelle Memorial Institute Peptide identification
US7979214B2 (en) 2003-02-10 2011-07-12 Battelle Memorial Institute Peptide identification
DE10323917A1 (en) * 2003-05-23 2004-12-16 Protagen Ag Process and system for elucidating the primary structure of biopolymers
DE102011014805A1 (en) * 2011-03-18 2012-09-20 Friedrich-Schiller-Universität Jena Method for identifying in particular unknown substances by mass spectrometry

Also Published As

Publication number Publication date
AU4228499A (en) 1999-12-20

Similar Documents

Publication Publication Date Title
Colinge et al. OLAV: Towards high‐throughput tandem mass spectrometry data identification
Xu et al. MassMatrix: a database search program for rapid characterization of proteins and peptides from tandem mass spectrometry data
Moore et al. Qscore: an algorithm for evaluating SEQUEST database search results
US7409296B2 (en) System and method for scoring peptide matches
Zhang et al. ProbIDtree: an automated software program capable of identifying multiple peptides from a single collision‐induced dissociation spectrum collected by a tandem mass spectrometer
EP1047108A2 (en) A method of determining peptide sequences by mass spectrometry
Colinge et al. High‐performance peptide identification by tandem mass spectrometry allows reliable automatic data processing in proteomics
Bafna et al. On de novo interpretation of tandem mass spectra for peptide identification
WO1999062930A2 (en) Protein sequencing using tandem mass spectroscopy
WO2008008919A2 (en) Methods and systems for sequence-based design of multiple reaction monitoring transitions and experiments
Razumovskaya et al. A computational method for assessing peptide‐identification reliability in tandem mass spectrometry analysis with SEQUEST
Lu et al. A suffix tree approach to the interpretation of tandem mass spectra: applications to peptides of non-specific digestion and post-translational modifications
Ahrné et al. An improved method for the construction of decoy peptide MS/MS spectra suitable for the accurate estimation of false discovery rates
US7230235B2 (en) Automatic detection of quality spectra
US20020046002A1 (en) Method to evaluate the quality of database search results and the performance of database search algorithms
Zou et al. Charge state determination of peptide tandem mass spectra using support vector machine (SVM)
CN114639445B (en) Polypeptide histology identification method based on Bayesian evaluation and sequence search library
Park et al. Human plasma proteome analysis by reversed sequence database search and molecular weight correlation based on a bacterial proteome analysis
Yuan et al. Features-based deisotoping method for tandem mass spectra
US20040175838A1 (en) Peptide identification
Fei Novel Peptide Sequencing With Deep Reinforcement Learning
Zhang et al. A new strategy to filter out false positive identifications of peptides in SEQUEST database search results
Fei et al. GameTag: A New Sequence Tag Generation Algorithm Based on Cooperative Game Theory
Colinge et al. A systematic statistical analysis of ion trap tandem mass spectra in view of peptide scoring
Liu et al. PRIMA: peptide robust identification from MS/MS spectra

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AU CA JP

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

121 Ep: the epo has been informed by wipo that ep was designated in this application
WA Withdrawal of international application
122 Ep: pct application non-entry in european phase