WO2000073787A1 - An expert system for protein identification using mass spectrometric information combined with database searching - Google Patents

An expert system for protein identification using mass spectrometric information combined with database searching Download PDF

Info

Publication number
WO2000073787A1
WO2000073787A1 PCT/US2000/014809 US0014809W WO0073787A1 WO 2000073787 A1 WO2000073787 A1 WO 2000073787A1 US 0014809 W US0014809 W US 0014809W WO 0073787 A1 WO0073787 A1 WO 0073787A1
Authority
WO
WIPO (PCT)
Prior art keywords
mass
experimental
biological molecule
theoretical
data
Prior art date
Application number
PCT/US2000/014809
Other languages
French (fr)
Inventor
Wenzhu Zhang
Brian T. Chait
David FENYÖ
Chao Tang
Original Assignee
Rockefeller University
Proteometrics, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Rockefeller University, Proteometrics, Llc filed Critical Rockefeller University
Publication of WO2000073787A1 publication Critical patent/WO2000073787A1/en

Links

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/68Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving proteins, peptides or amino acids
    • G01N33/6803General methods of protein analysis not limited to specific proteins or families of proteins
    • G01N33/6848Methods of protein analysis involving mass spectrometry
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/68Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving proteins, peptides or amino acids
    • G01N33/6803General methods of protein analysis not limited to specific proteins or families of proteins
    • G01N33/6818Sequencing of polypeptides

Definitions

  • the rapid expansion of protein and DNA sequence databases together with technological improvements in biological mass spectrometry has made the combination of mass spectrometric peptide mapping with database searching (Henzel et al., 1993; Yates et al., 1993; Mann et al., 1993; James et al., 1993; Pappin et al., 1993) a superb method for rapid protein identification.
  • the method (Fig. 1) involves cleavage of proteins with an enzyme having high specificity (usually trypsin), whereupon the resulting proteolytic products are subjected to analysis by either matrix-assisted laser desorption/ionization mass spectrometry (MALDI-MS) or electrospray ionization mass spectrometry (ESI-MS).
  • MALDI-MS matrix-assisted laser desorption/ionization mass spectrometry
  • ESI-MS electrospray ionization mass spectrometry
  • the masses determined for the proteolytic peptides are compared with masses calculated for theoretically possible enzymatic cleavage products for every sequence in a protein DNA sequence database.
  • the protein is identified based on an evaluation of this comparison.
  • This peptide mapping method for protein identification is fast because the mass spectra are rapidly collected ( ⁇ 1 min per spectrum for MALDI-time-of-flight analysis) and because the analysis can be performed on the same time-scale.
  • the method is relatively insensitive to unspecified modifications and/or sequence errors in the database because high confidence identifications can be made even when the mapping experiment yields information on only a small percentage of the sequence.
  • Identification of proteins by the above-described approach requires a scheme for determining the best match between the experimental data and a sequence in the database.
  • Existing schemes for determining the best match include ranking by number of matches (Henzel et al., 1993; Yates et al, 1993; Mann et al., 1993; James et al., 1993) and a scoring system based on the observed frequency of peptides from all proteins in a database in a given molecular weight range (the so-called "MOWSE score" (Pappin et al., 1993)).
  • the object of the present invention is to provide a more accurate method to identify biological molecules.
  • the present invention provides a system for identifying biological molecules, for example proteins, using MS peptide mapping data.
  • the system makes use of a Bayesian algorithm that takes into account individual properties of each protein in the database as well as other information relevant to the experiment.
  • Bayesian probability theory has been widely u c to make scientific inference from incomplete information in various disciplines, including biopolymer sequence alignment [Liu & Lawrence, 1999], NMR spectral analysis (Bretthorst, 1988) and radar target identification (Bretthorst, 1996).
  • Bayesian probability theory is applied to make logical inference about the identity of an unknown protein sample against a protein sequence database.
  • the probability for the sample protein to be a specific protein in the database is calculated using the MS data as well as other background information such as protein mass range, species from which the protein originated, mass accuracy, enzyme cleavage chemistry, protein sequence, previous experiments on the sample protein, etc.
  • FIGURES Figure 1 Flow chart showing protein identification by database searching in conjunction with mass spectrometric peptide mapping experiment.
  • FIG. 2 Delayed-extraction reflectron MALDI-TOF spectrum of an in-gel tryptic digest of a 30 kDa SDS-PAGE protein band. ProFound determined that the band was a single protein: RPS7A (40s ribosomal protein S7A). Trypsin self-digestion products are labeled with 'Trypsin'. The labeled masses are monoisotopic masses. Two peaks labeled with asterisks (*) have masses 16.0 Da higher than the adjacent peaks (see discussion on the use of tag information section).
  • Figure 3 Normalized probability distribution for top 20 protein candidates using data shown in Figure 2.
  • FIG. 5 Delayed-extraction reflectron MALDI-TOF spectrum of an in-gel tryptic digest of a SDS-PAGE protein band. ProFound determined that the band was a mixture, identifying two protein components: YLR409c and YDLO ⁇ Ow. Trypsin self-digestion products are labeled with 'Trypsin'. The labeled masses are monoisotopic masses.
  • FIG. 6 Delayed-extraction reflectron MALDI-TOF spectrum of an in-gel tryptic digest of a SDS-PAGE protein band. ProFound determined that the band was a mixture, identifying two protein components: RPS1B and RPS1 A. Trypsin self-digestion products are labeled with 'Trypsin'. Both monoisotopic and average (indicated by brackets) masses were simultaneously submitted to ProFound with separately specified mass tolerances.
  • the present invention relates to improving current methods for identifying biological molecules.
  • the invention provides a method for determining the probability that an experimental biological molecule is a particular biological molecule described in a database given certain experimental mass data and background information.
  • Biological molecules include any biological polymer that can be degraded into constituent parts. The degradation is preferably into constituent parts at predictable positions to form predictable masses. Examples of biological molecules include proteins, nucleic acid molecules, polysaccharides and carbohydrates.
  • An experimental biological molecule is a biological molecule which is to be identified; the experimental biological molecule can also be referred to as an unknown biological molecule.
  • a theoretical biological molecule is a biological molecule is a known biological molecule described in a data base.
  • Proteins are polymers of amino acids. Constituent parts of proteins comprise amino acids.
  • a protein typically contains approximately at least ten amino acids, preferably at least fifty amino acids and more preferably at least 100 amino acids.
  • Nucleic acids are polymers of nucleotides. Constituent parts of nucleic acids comprise nucleotides. Typically, a nucleic acid contains at least 100 nucleotides, preferably at least 500 nucleotides.
  • Polysaccharides are polymers of monosaccharides. Constituent parts of polysaccharides comprise one or more monosaccharides. Typically, a polysaccharide contains at least five monosaccharides, preferably at least ten monosaccharides.
  • Mass data of biological molecules are quantifiable information about the masses of the constituent parts of the biological molecule.
  • Mass data include individual mass spectra and groups of mass spectra.
  • the mass spectra can be in the form of peptide maps, oglionucleotide maps or oligosaccharide maps.
  • the method of the present invention includes generating experimental mass data (D) for the experimental biological molecule within a certain mass range.
  • D includes the measured masses and standard deviations, ⁇ , associated with the measured masses.
  • the method also includes generating theoretical mass data in the same mass range.
  • the experimental mass data (D) is a subset of the experimental mass data (D).
  • mass data for proteins can be generated in any manner that provides mass data within a certain accuracy. Examples include matrix-assisted laser desorption/ionization mass spectrometry, electrospray ionization mass spectrometry, chromatography and electrophoresis. Mass data can also be generated by a general purpose computer configured by software or otherwise.
  • the mass data for example a peptide mass, mwel is determined to an accuracy ⁇ mcomp ⁇ h. ⁇ mJm, preferably ⁇ 10,000 ppm, more preferably ⁇ 100ppm and most preferably ⁇ 30ppm.
  • a step in generating mass data of a biological molecule may include first cleaving the biological molecule into constituent parts.
  • Biological molecules may be cleaved by methods known in the art.
  • the biological molecules are cleaved into constituent parts at predictable positions to form predictable masses.
  • Methods of cleaving include chemical degradation of the biological molecules.
  • Biological molecules may be degraded by contacting the biological molecule with any chemical substance.
  • proteins may be predictably degraded into peptides by means of cyanogen bromide and enzymes, such as trypsin, endoproteinase Asp-N, N8 protease, endoproteinase Arg-C, etc.
  • Nucleic acids may be predictably degraded into constituent parts by means of restriction endonucleases, such as Eco Rl, Sma I, BamH I, Hinc II, etc.
  • Polysaccharides may be degraded into constituent parts by means of enzymes, such as maltase, amylase, alpha-mannosidase, etc.
  • a mass range (m mm , m max ) is determined for the experimental mass data.
  • the mass range can be any mass range of the mass data.
  • the mass range is the minimum and maximum measured masses of the experimental biological molecule mass data.
  • a biological molecule database is any compilation of information about characteristics of biological molecules. Databases are the preferred method for storing both polypeptide amino acid sequences and the nucleic acid sequences that code for these polypeptides. The databases come in a variety of different types that have advantages and disadvantages when viewed as the hypothesis for a polypeptide identification experiment.
  • database entry for an amino acid sequence may appear to be a simple text file to a user browsing for a particular polypeptide
  • database many databases are organized into very flexible, complicated structures.
  • the detailed implementation of the database on a particular system may be based on a collection of simple text files (a "flat-file” database), a collection of tables (a “relational” database), or it may be organized around concepts that stem from the idea of a protein, gene, or organism (an "object-oriented” database).
  • Protein mass data may be predicted from nucleic acid sequence databases.
  • protein mass data may be obtained directly from protein sequence databases which contain a collection of amino acid sequences represented by a string of single-letter or three-letter codes for the residues in a polypeptide, starting at the N- terminus of the sequence. These codes may contain nonstandard characters to indicate ambiguity at a particular site (such as "B” indicating that the residue may be "D" (aspartic acid) or "N” (asparagine).
  • the sequences typically have a unique number-letter combination associated with them that is used internally by the database to identify the sequence, usually referred to as the accession number for the sequence.
  • Databases may contain a combination of amino acid sequences, comments, literature references, and notes on known posttranslational modifications to the sequence.
  • a database that contains these elements is referred to as "annotated.”
  • Annotated databases are used if some functional or structural information is known about the mature protein, as opposed to a sequence that is known only from the translation of a stretch of nucleic acid sequence.
  • Non-annotated databases only contain the sequence, an accession number, and a descriptive title.
  • the background information known about an experimental biological molecule by which the data base search can be constrained can include any information. Some examples of background information include information about the species of the experimental biological molecule, knowledge or an assumption about the mass of the experimental biological molecule and the isoelectric point of the experimental biological molecule.
  • the observed molecular mass or the observed isoelectric point of a protein can be used in combination with the measured masses of peptides generated by proteolysis to constrain the search for a polypeptide.
  • the comparison between the theoretical mass data of the database proteins and the mass data of the unknown protein may be constrained to only those proteins of the database which are within a chosen mass range.
  • the chosen mass range is preferably within 50% of the mass of the unknown protein, more preferably within 35%, most preferably within 25%.
  • the comparison between the theoretical mass data of the database proteins and the mass data of the unknown protein may be constrained to only those proteins of the database which are within a chosen isoelectric point range.
  • the isoelectric point (pi) of a protein is the pH at which its net charge is zero.
  • the chosen isoelectric point range is preferably within 50% of the isoelectric point of the unknown protein, more preferably within 35%, most preferably within 25%.
  • fragment mass data for a peptide can be generated in any manner which provides fragment mass data within a certain accuracy.
  • Experimental conditions include the type of energy used to generate the fragment mass data.
  • Nibrational excitation energy can be used.
  • the vibrational excitation may be generated by collisions of the peptide with electrons, photons, gas molecules or a surface.
  • Electronic excitation can be used.
  • the electronic excitation may be generated by collisions of the peptide with electrons, photons, gas molecules (e.g. argon) or a surface.
  • the experimental fragment mass spectrum of a peptide from an enzymatically digested unknown protein is compared with the theoretical masses calculated by applying the rules for the specificity of the enzyme, and the rules for the fragmentation as known to those of ordinary skill in the art, to the amino acid sequence of a database protein.
  • the software tool PepFrag allows for searching protein or nucleotide sequence databases using a combination of mass spectra data and fragmentation mass spectra data.
  • Fragment mass data for the purposes of this invention can be generated by using multidimensional mass spectrometry (MS/MS), also known as tandem mass spectrometry.
  • MS/MS multidimensional mass spectrometry
  • a number of types of mass spectrometers can be used including a triple- quadruple mass spectrometer, a Fourier-transform cyclotron resonance mass spectrometer, a tandem time-of-flight mass spectrometer, and a quadruple ion trap mass spectrometer.
  • a single peptide from a protein digest is subjected to MS/MS measurement and the observed pattern of fragment ions is compared to the patterns of fragment ions predicted from database sequences.
  • the present invention provides a method to determine the probability that an experimental biological molecule is a biological molecule k described in a database given experimental mass data D and background information I.
  • the probability, P(k ⁇ DI) is calculated from the following formula:
  • the difference between each measured mass of the experimental biological molecule and each theoretical mass of the biological molecule in the data base is calculated. If one or more differences are within a mass tolerance for a particular measured mass, the particular measured mass is considered to be one hit; that is, the particular measured mass and the particular theoretical mass are considered to be matching.
  • the total number of hits found for a particular experimental molecule for a particular database molecule is designated as r.
  • Each measured mass associated with a hit is designated as m consult wherein / is an ordinal number from 1 to r.
  • the theoretical masses associated with the ith hit is designated as m,o
  • the difference between each measured mass, mune associated with an ith hit and one of the theoretical masses associated with the ith hit is determined. Any one of the theoretical masses, m,o, associated with the ith hit can be used to determine these differences. For example, the theoretical mass which produces the smallest difference between the measured mass, mread and the theoretical mass can be used. Alternately, the average of the theoretical masses associated with the hit can be used to determine these differences.
  • N is the quantity of masses in the theoretical mass data and x is a function of the measured mass of the z ' th hit.
  • P(k ⁇ I) is determined from background information (I) known about the experimental biological molecule, prior to consideration of the experimental mass data (D). This background information can be any information about the experimental biological molecule. In one embodiment the P(k ⁇ I) is a P(k ⁇ DI) obtained from previous experimental data generated for the experimental biological molecule.
  • the formula includes a factor which incorporates a determination of whether the measured mass data contains certain digestion patterns, the factor is designated as P p aue r -
  • the digestive patterns can be any digestion pattern that can be observed for biological molecules. Examples of particular digestive patterns for proteins are described below. If certain patterns occur the P(k ⁇ DI ) will increase accordingly.
  • the F pa tte r is calculated by taking a number greater than one and less than 1000 to the power of the quantity of occurrences of such patterns. In a preferred embodiment the number is from 1.5 to 10; most preferably the number is 2.5.
  • the above formula can further include information regarding each theoretical mass associated with an ith hit.
  • the number of theoretical masses within the mass tolerance for each measured mass, mun is counted; and the total number of such theoretical masses is designated as g, for a particular m_.
  • the difference between each measured mass, m consult associated with an z ' th hit and each theoretical mass, m y o, associated with the hit, is determined wherein/ is an ordinal number from 1 to g,.
  • the P(k ⁇ DI) is then calculated from the following formula:
  • f(m,) is a normalized distribution of theoretical masses of the database and wherein c is in the range of 0J to 100.
  • f(m) is a normalized distribution of theoretical masses of the database and wherein c is in the range of 0J to 100.
  • the above probability formulae can further include a function of/, designated as y ⁇ , incorporated into the formulae as follows:
  • y ⁇ can be any function of y.
  • ** y. can be defined as (W '1 ) '1 , wherein Wls a constant equal to or greater than one. In a preferred embodiment is four.
  • the probability formulae of the present invention can be used to identify components of mixtures of biological molecules.
  • a database can be extended to contain entries which are additive combinations of the single proteins of a database.
  • a P(k ⁇ DI) is assigned to each theoretical protein in the database for data generated from a particular experimental protein; and the theoretical proteins which have the highest P(k ⁇ DI) are chosen.
  • the highest P(k ⁇ DI) can be from the top 50% of the database proteins to the top 0.01% of the database proteins. From these chosen proteins a new database is formed which contains additive combinations of the chosen proteins. Additive combinations are database proteins added together in various combinations. P(k ⁇ DI) calculations are performed using this new database.
  • the present invention provides a means for determining the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental mass data (D) and background information (I).
  • the means is any means by which the probability can be determined.
  • the means includes a computer or mass spectra, as would be recognized by a person skilled in the art.
  • a means for generating experimental mass data (D) for the experimental biological molecule comprises measured masses and standard deviaUons, ⁇ , associated with the measured masses; a means for determining a mass range (m ⁇ m ⁇ for the experimental mass data; a means for generating theoretical mass data for the biological molecule k within the mass range (m mm , m max ); a means for counting the number of masses, N, in the theoretical mass data; a means for calculating the difference between each measured mass and each theoretical mass, wherein if one or more differences are within a mass tolerance for a particular measured mass, the particular measured mass is considered to be one hit; a means for designating each measured mass associated with a hit as rr, wherein i is an ordinal number from 1 to r, wherein r is the total number of hits for a particular biological molecule; a means for determining the difference between each measured mass, rr , associated with an th hit and one of the theoretical masses, m-,
  • x is a function of the measured mass of the z ' th hit
  • P(k ⁇ DI) is the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental data (D) and background information ( ).
  • the means for determining the probability that an experimental biological molecule is a biological molecule further includes a means for counting the number of theoretical masses within the mass tolerance for each measured mass, m/, wherein the total number of such theoretical masses is designated as g ⁇ for a particular mf, a means for determining the difference between each measured mass, mread associated with an ith hit and each theoretical mass, m y o, associated with the hit, wherein y is an ordinal number from 1 to g vide and a means for calculating P(k ⁇ DI) from the following formula:
  • P(k ⁇ DI) is the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental data (D) and background information (I).
  • the present invention provides a computer program product including a computer usable medium having computer readable program code means embodied in said medium for determining the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental mass data (D) and background information (I).
  • the computer program product includes computer readable program code means for causing a computer to generate experimental mass data (D) for the experimental biological molecule, wherein D comprises measured masses and standard deviations, ⁇ , associated with the measured masses; computer readable program code means for causing a computer to determine a mass range (rn ⁇ rnX) for the experimental mass data; computer readable program code means for causing a computer to generate theoretical mass data for the biological molecule k within the mass range (m min , m max ); computer readable program code means for causing a computer to count the number of masses, N, in the theoretical mass data; computer readable program code means for causing a computer to calculate the difference between each measured mass and each theoretical mass, wherein if one or more differences are within a mass tolerance for a particular measured mass, the particular measured mass is considered to be one hit; computer readable program code means for causing a computer to designate each measured mass associated with a hit as m_, wherein i is an ordinal number from 1 to r, wherein r
  • x is a function of the measured mass of the th hit
  • Pfk ⁇ DI is the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental data (D) and background information (I).
  • the invention provides a computer program product which includes a computer usable medium having computer readable program code means embodied in said medium for determining the probability that an experimental biological molecule is a biological molecule (k) further including computer readable program code means for causing a computer to count the number of theoretical masses within the mass tolerance for each measured mass, m_,, wherein the total number of sue! theoretical masses is designated as g for a particular ⁇ ; computer readable program code means for causing a computer to determine the difference between each measured mass, ⁇ , associated with an th hit and each theoretical mass, m ⁇ , associated with the hit, wherein is an ordinal number from 1 to g, and computer readable program code means for causing a computer to calculate Pfk ⁇ DI) from the following formula:
  • P(k ⁇ DI) is the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental data (D) and background information (I).
  • a method for determining the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental mass data (D) and background informauon (I) is calculated from the following formula:
  • the present invention includes a means for determining the Pfk ⁇ DI) of the above formula.
  • the means is any means by which the probability can be determined.
  • the means includes a computer or mass spectra, as would be recognized by a person skilled in the art.
  • the present invention includes a computer program product for determining the Pfk ⁇ DI) of the above formula.
  • Every protein is specified by its particular linear sequence of amino acids.
  • One defining signature of a protein is the set of masses of peptide fragments produced by cleavage of the protein by an enzyme of high cleavage specificity.
  • the problem we seek to solve is to use the peptide masses obtained in such a mass spectrometric peptide mapping experiment to identify a protein from a protein sequence database.
  • P(k ⁇ I) is the probability for hypothesis k given only the background information, I; N is the theoretical number of peptides generated by fragmentation of protein k by the protease used in the study; r is the number of hits (i.e., the number of matches between the measured and calculated peptide masses); g, is the multiplicity of the /th hit (i.e., the number of theoretical peptides that match a given experimental peptide mass value); (m mm -m max ) is the range of measure peptide masses; ⁇ is the standard deviation of the mass measurement; m, is the measured mass of the /th hit; m,o is the calculated mass of the /th hit; and F paUern is an empirical term, which increases the probability when overlapping and/or adjacent peptides are observed (see program description).
  • the ranking of the candidate proteins is based on the values determined for their probability P(k ⁇ DI).
  • the Bayesian probability is consistent with common sense. For any given protein k in the database, the probability that protein k is the sample protein increases with increasing number of hits r, increasing mass accuracy (i.e., smaller ⁇ and (m/-m,o)), and decreasing number of theoretically digested fragments N. It has been shown previously that tryptic peptides of higher molecular mass occur with lower frequency than do those with lower molecular mass (Pappin et al., 1993; Feny ⁇ et al, 1998), and are therefore more constraining for protein identification. The present algorithm takes into account this different information content of peptides with different masses through the normalization condition given above.
  • the Bayesian probability should be viewed as a measure of confidence level of the hypothesis that protein k is the sample protein based on available information. There is no absolute certainty for any given identification, only the probability - i.e., the higher the probability, the higher is the confidence level. In a critical situation (i.e., where a false positive result cannot be tolerated), it may be desirable to check the identification with an independent method such as tandem mass spectrometry (MS/MS) (Qin et al., 1997). Ultimately, the value of any given identification is provided by the outcome of the biological experiment that results from the information.
  • MS/MS tandem mass spectrometry
  • the Bayesian algorithm can be readily extended to identify the components of such mixtures.
  • the protein sequence database is expanded to include entries that are 'fused' combinations of single protein sequences.
  • the program will identify the components of binary mixtures.
  • the entries representing binary mixtures are binary combinations of single proteins (usually, the top 50 hits obtained in a prior search for single proteins).
  • the Bayesian probabilities for these 'fused' proteins are calculated in the same way as for single proteins.
  • the Bayesian algorithm can incorporate any additional information obtained for measured peptides (Appendix 1).
  • the additional information provides constraint in database searching to reduce the occurrence of database peptides that randomly match the experimental mass spectral data, thereby improving the confidence level for identifications.
  • Fenyo et al. have investigated the value offered by knowledge of the presence (or absence) and number of particular amino acids contained within a given peptide (so-called 'tag information') (Fenyo et al., 1998).
  • tag information can be obtained in a number of ways.
  • cysteine residues can be identified through chemical alkylation of free thiol moieties (Sechi & Chait.,
  • methionine residues can be inferred by observation of pairs of peaks separated by 16 Da (because methionine residues contained in proteolytic peptides are frequently partially oxidized).
  • the database searched is the NCBI NR non-redundant database (URL http://www.ncbi.nlm.nih.gov/BLAST/blast_databases.html). The presence or absence of signal peptides is considered when such information is available in the corresponding NCBI GenPept format faltfile (URL http://www.ncbi.nlm.nih.gov/Entrez/batch.html). Taxonomy data is derived from the NCBI GenPept flatfiles and Taxonomy databases (URL http://www.ncbi.nlm.nih.gov/Taxonomy/).
  • the mean of the experimental minus calculated masses (for all hits of a given protein in the database) is removed during the probability calculation.
  • An empirical factor has been introduced in the probability calculation to take into account two kinds of commonly observed digestion patterns.
  • the first pattern which we term adjacency, occurs when proteolytic peptides are observed to be adjacent to one another in the protein sequence (see Fig. 4A).
  • the second pattern which we term common-end overlapping, occurs when the observed peptides have one common terminus, but differs at the other terminus by a peptide segment (see Fig. 4A).
  • the probability is increased by a factor of 2.5.
  • the search is performed in two steps. In the first step, the database is searched using the dominant term in the probability equation given above, i.e..
  • NCBI NR non-redundant database is not truly non redundant, even containing protein sequences with trivial differences (highly homologous sequences).
  • a method was devised to detect redundancy based on experimental data and to take into account of such redundancy.
  • the two protein sequences are considered homologous.
  • the sequences which do not have the highest likelihood will be removed from the list for normalization calculation.
  • the method of the present invention includes determining whether the data base includes biological molecules which form a homologous set.
  • the determination that biological molecules in the data base are homologous can be determined by various methods.
  • a homologous set of biological molecules are the biological molecules in the database which have the same theoretical masses associated with the hits for an experimental biological molecule, within a certain percentage. This percentage can be any percentage. A preferred percentage is over fifty percent.
  • biological molecules which make up a identify a homologous set of biological molecules are the biological molecules in the database which have the same theoretical masses associated with the hits for an experimental biological molecule, within a certain percentage, and which have the same amino acid sequences associated with the hits for an experimental biological molecule, within a certain percentage. Preferred percentages for both the mass and sequence information are over fifty percent.
  • DI) is calculated for each of the biological molecules in the homologous set; and the highest P(k
  • FIG. 7 shows the program flowchart.
  • Taxonomic category A representation of a phylogenic tree is provided through which the user can specify the origin of the sample protein, if known.
  • Search mode The program can be specified to search in either "Single protein only” mode or "Single or binary mixture” mode.
  • Mass range If known, the approximate protein mass range of the sample protein can be specified.
  • Number of candidate proteins The number of top candidate proteins in the output display can be specified.
  • Digestion chemistry The proteolytic enzyme or chemical reagent used to cleave the sample protein(s) must be specified. Current choices are trypsin, endoproteinase Arg C, endoproteinase Asp N, endoproteinase Lys-C, N8 protease (cleavage at D and E), N8 protease (cleavage at E), and cyanogen bromide.
  • the maximum number of missed cleavage sites The maximum number of missed cleavage sites within the peptide (yielding incompletely cleaved peptides) must be specified. Allowed values are in the range 0 - 4.
  • Peptide masses and mass tolerances There are three alternative methods for specifying the masses of peptides used to search the database. These are average mass, monoisotopic mass, and a combination of average and monoisotopic masses (this latter alternative is useful when only some of the peaks in the mass spectrum are isotopically resolved.)
  • the mass tolerances for average and monoisotopic masses are specified independently (either as an absolute mass or as a relative tolerance). Because there is a 95% probability for Gaussian distributed measurement errors to be within ⁇ 2 ⁇ (where ⁇ is the standard deviation), the mass tolerance is taken as 2 ⁇ . Either neutral or protonated peptide masses can be specified.
  • Amino acid tags When peptides have been experimentally determined to contain particular amino acids, amino acid tags can be associated with the masses of the peptides.
  • the ProFound output consists of a search result page, which is hyperlinked to pages that provide details about the search results.
  • the search result page consists of a list of protein candidates ranked by probability as well as summary of the input data and search parameters.
  • the sequences of the candidate proteins can be retrieved through links and can be further analyzed by sequence analysis tools contained in PROWL, an interactive environment on the World Wide Web for protein MS (Fenyo et al., 1996).
  • Hyperlinked output pages show graphical and text representations of the matched peptides from the protein candidates. For each candidate protein, graphs are provided to allow the user to quickly assess the experimental peptide mass coverage of the protein and the mass measurement errors
  • MALDI- time-of-flight (TOF) MS was carried out using a commercial instrument (Perseptive Biosystems STR, Framington, MA) operated in the delayed-extraction reflector mode (FWHM resolution ⁇ 5000) or an instrument constructed in-house (Beavis & Chait, 1989, 1990) operated in the continuous-extraction linear mode (FWHM resolution ⁇ 500).
  • the MALDI-ion trap data was obtained using an instrument constructed in-house and described previously.
  • Figure 2 shows a delayed-extraction reflectron MALDI-TOF spectrum of the mixture of peptides produced by in-gel trypsin digestion of a 30 kDa SDS-PAGE protein band from an Saccharomyces cerevisiae nuclear extract. Thirty-five monoisotopic masses derived from Figure 2 were submitted to ProFound in order to identify the protein. Other search parameters were: S. cerevisiae for the taxonomic category; a protein mass range of 0 — 3,000 kDa; unmodified cysteines; a maximum of 2 missed cleavage sites; and a mass tolerance of 0.1 Da.
  • the specified taxonomic category and protein mass range includes the complete set of proteins (or open reading frames (ORFs)) in the S. cerevisiae genome.
  • Table 1 lists the top 4 protein candidates (ranked by normalized probability) found by the search.
  • the top-ranked protein, the ribosomal protein S7A has a probability of 1 and is readily distinguished from the next ranked candidates, which have probabilities of respectively 2 x 10 "51 , 8 x 10 "53 and 5 x 10 "53 .
  • Figures 4A--C shows sequence coverage maps and an error map for the top ranked candidate.
  • the segment coverage map (Fig. 4A) (in which a segment represents a peptide resulting from complete digestion of the protein by trypsin) is useful for visualizing digestion patterns indicative of an authentic protein identification. Bona fide identifications are often characterized by the observation of peptides that are adjacent to one another in the sequence and/or that overlap and have a common terminal (while differing by one segment at the other terminal.) Examples of these two commonly observed patterns are shown in Figure 4A. Because the observation of such patterns raises our confidence level that a candidate protein is present in the sample, we have empirically included a term in the ProFound probability calculation to incorporate this information.
  • the sequence residue coverage map (Fig. 4A)
  • FIG. 4B shows the portion of the ribosomal protein S7A sequence that was observed in the MS peptide mapping experiment. Twenty-three measured masses match 24 theoretical tryptic peptide masses from the ribosomal protein S7A, covering 70% of the sequence.
  • the error map (Fig. 4C) provides a scatter plot showing error (i.e., measured mass - calculated mass) versus mass for each match. The scatter plot is useful for visualizing systematic errors in the mass measurement. When the spectral calibration is free of systematic error, the errors for an authentic hit are normally distributed about zero and are independent of mass value, as in Figure 4C.
  • the bottom portion of Figure 4C is a histogram projection of the scatter plot.
  • the MALDI-TOF mass spectrum shown in Figure 5 was obtained from the products of in-gel digestion of a 105 kDa SDS-PAGE protein band. Peptide masses consisting of 47 monoisotopic masses were submitted to ProFound. (Sometimes the resolution or statistics for higher mass peaks are insufficient for unambiguous identification of the monoisotopic component. In such cases, we determine the average mass of the peptide.) Other search parameters were: S. cerevisae as taxonomic category; protein mass range of 0 — 3000 kDa; unmodified cysteines; 2 maximum missed cleavage sites; mass tolerances of 1 Da for average masses and 0J Da for monoisotopic masses.
  • the top two candidates (YLR409c and YDL060w) had probabilities of 0.99 and 0.01, respectively, which was considerably higher than the probabilities for all the rest of the candidate proteins ( ⁇ 10 "22 with slowly decreasing values).
  • the number of peaks matching with theoretical tryptic peptide masses from YLR409c and YDL060w are 18 each (respective sequence coverage of 24%) and 30%), and the two proteins have no sequence homology. Two such proteins with dominating probabilities provide an indication that the sample may be a binary mixture.
  • ProFound was set up to search for a possible binary mixture using the same data set and other search parameters that was used for the 'single protein only' search.
  • Figure 6 shows a MALDI-TOF mass spectrum obtained from in-gel tryptic digestion of a 30 kDa SDS-PAGE protein band. Thirty-six monoisotopic peptide masses and 1 average peptide mass were submitted to ProFound. Other search parameters were: S. cerevisae as taxonomic category; protein mass range of 0—3000 kDa; unmodified cysteines; a maximum of 2 missed cleavage sites; mass tolerances 0.5 Da for average and 0.1 Da for monoisotopic masses.
  • the number of peaks that match theoretical tryptic peptide masses from RPS1B and RPS1A are 24 and 23, respectively (with sequence coverages of 24% and 30%).
  • the second possibility is that the sample is a binary protein mixture of two highly similar proteins.
  • ProFound was set up to search for possible binary mixtures with the same data set and other search parameters that were used for the 'single protein only' search.
  • the probabilities for all the other single and binary protein candidates are ⁇ 10 " 4 , with slowly decreasing probability values.
  • the two identified proteins are highly homologous, differing by only 7 amino acids in their 254 amino acid sequences.
  • MS/MS mass spectrometric protein identification
  • Search parameters were: S cerevisea as taxonomic category, protein mass range of 0-3000 kDa; a maximum of 4 missed cleavage sites; mass tolerance of 2 Da.
  • Table 4 is a summary of 15 searches using the two independent methods. All the proteins identified with the MS/MS data were confirmed by ProFound using the peptide mapping data, even though the mapping data was of relatively low quality (i.e., resolution 500 FWHM, accuracy + 2 Da). These findings provide independent assurance of the reliability of ProFound for identifying proteins.
  • Improvement o[ ie confidence level of protein identification using tag information Incorporation of amino acid 'tag information' in the ProFound search can reduce the occurrence of database peptides that randomly match the experimental MS data, thereby improving the confidence level of an identification. For example, we have shown previously that inclusion of information regarding the absence or presence of cysteine residues in tryptic peptides from proteins can significantly improve the confidence level of a protein identification (Sechi & Chait, 1998).
  • Appendix 1 Derivation of the Bayesian probability that protein k is the protein under analysis
  • a peptide mapping experiment involves enzymatic or chemical cleavage of the protein, using cleavage reagents with high specificity for particular amino acids. The resulting mixture of peptide fragments is subjected to mass analysis. Each detected peptide fragment ion appears as a peak in the mass spectrum. The position of the peak along the mass axis provides a measure of the mass of the peptide ion.
  • the current mass spectrometry technology does not provide reliable quantitative information from the height of the various peaks in the spectrum, so that peak intensity information is only used to decide on the presence of a peptide fragment.
  • m is a logical notation representing the finding that the mass value for the /th peak is m
  • n is the number of peaks within the mass range from m mm to m max
  • all peptide fragments produced from the protein should be detected.
  • only a subset of the peptide fragments is observed. The reasons for not observing all of the peptides include poor solubility and/or low ionization efficiencies of certain peptides.
  • the posterior probability, Pfk ⁇ DI depends on three terms.
  • the first term, P(k ⁇ I), is the prior probability for the hypothesis given only the information.
  • the second term, P(D ⁇ KI), is the likelihood probability that the data D would be observed if the hypothesis is true.
  • the third term, P(D ⁇ I), is independent of hypothesis k, and is a normalization constant.
  • the posterior probability is proportional to the product of prior probability and likelihood probability
  • H (D-hits/misses) where the superscripts H and M are used to label hits and misses, r is the number of hits, w is the number of misses and the total number of measured masses is r+w.
  • the probability for hits in equation can be factorized as products for individual hits by applying the product rule
  • H is defined as logical one.
  • H ( ⁇ H,mj) is the logical product of two hypotheses (i.e. the /th hit originates from a particular peptide in the protein k and its measured mass is m,).
  • P(H, ⁇ kI mo ,.j H ) in equation (hit/mi) is the probability for the /th measured peptides to be a hit, given protein k and i-1 previous hits. Since the number of available peptides for the /th comparison is (N-i-1), the probability for the /th peptide to a hit is l/(N-i-l) (using the maximum entropy principle [8]), where N is the total number of theoretical peptides.
  • misses are results of either error in protein sequence, unknown modificauon of the protein, or unexpected cleavage of the protein.
  • the measured mass alone does not provide information on the identity of the pepude within the protein.
  • the probability for misses depends on the number of 'modified pepudes' /, which is between w and N-r (tv is the number of misses, r is the number of observed hits, and N is the total number of pepudes).
  • the probability for all misses can be expressed as Pfm ⁇ l r+ , M ⁇ kIm, l l )
  • m r ' is defined as a logical one.
  • P(J)kIm ⁇ r H ) in equation (likelihood-misses— 1) is the probability for there being J modified peptides, given protein k and r observed hits. The probability is assigned by applying the maximum entropy principle,
  • P(M r+J ⁇ kIm ⁇ r Jm rr+J _i ) in equation (likelihood-misses— 1) is the probability for observing a modified peptide, given protein k, J modified peptides and r hits plus j-1 misses being observed already. Since the number of available peptide is N-fr+f-1), and the number of remaining unobserved modified peptides is J-fj-1), the probability for observing a modified peptide is assigned, by applying the maximum entropy principle, as follows:
  • P(m r+J ⁇ kl mi r J m r r + j -i M r+J ) in equation (likelihood-misses— 1) is the likelihood probability for the modified peptide to have a measured mass m r+J . Since m r + j is a miss and is always within the range of the mass measurement (i.e., between the minimum mass m mm and the maximum mass m max ), using the principle of maximum entropy the probability is assigned
  • the normalized posterior probability is obtain by applying the normalization condition of
  • rand exp pep ' possible peptides encompassing 95% of all amino acid compositions ( ⁇ pep for 2000 Da
  • Appendix 3 Pfm r+I I kl m, f m rr / , M M ⁇ in equation (likelihood-misses— 1) is the likelihood probability for the modified peptide to have a measured mass m ⁇ .
  • W-- + is a miss and the likelihood for it to occur should be determined by the mass distribution of the theoretical peptides of proteins in the database. Therefore the probability is assigned using the normalized mass
  • the normalized mass distribution can be derived from the statistical frequency distribution of the theoretical peptide masses of proteins in the database.
  • the probability can be factored into form of Pfm ⁇ I kl m, /'/ m rr+J , M M ⁇ fl / fm ⁇ - jffm ⁇ -m ffm ⁇ ] where 11 f ⁇ n ⁇ mrt _-m m ⁇ ) corresponds to a uniform distribution.
  • the theoretical peptides of given protein k can fall into ⁇ exclusive mum-subsets, where the probability for the occurrence of peptides in different subset can be different and can be determined, for instance, empirically.
  • the exclusive multi-subsets would respectively correspond to the number of missed cleavage sites as 1, 2, ..., a.
  • Another example is in ms/ms fragmentation experiment, where the theoretical fragment ions can be classified according to different ion types (b, y, a, a , c", etc) to form exclusive multi-subsets. Designating the subsets as S, ...J_, it can be shown that the experimental data D can further be expressed as
  • ⁇ YI that the t+s)th. hit is originated from a particular peptide (in protein k) and the peptide belongs to the ⁇ th subset and it's measured mass is m
  • m H ⁇ Vq ⁇ m. HS[ m. HSq and is m ⁇ ⁇ defined as logical one. 0:t+s- ⁇ — m l: * • j t+ ⁇ -J+s- ⁇ 0:0 &
  • P(S I kl m 0l+s _ l ) in the above equation is the probability to observe a peptide which originated from the ⁇ th subset given condition and it is p
  • kl ) is the probability for the (t+sjtSa hit to be originated from a particular peptide in the ⁇ th subset given all hits corresponding to subsets S, ... S , and previous /th to ⁇ s-1)t hits, which are identified to be originated from respectively particular s-1 peptides in the q ⁇ i subset. Since the number of available peptides in the ⁇ th subset for the hit is N -s+ 1, the probabihty is assigned as
  • P(m I kl m ⁇ X q S H ) ls die probability for measured mass value to be m t+s given that the (t+s)t hit is originated from a particular peptide whose mass is known to be m ( )0 The probability is assigned as
  • log(p ⁇ )-log(p 2 ) is the logarithm difference of probabilities between first and second candidate proteins.
  • pi is determined as sum of probabilities for them and p 2 is the probability for the following candidate.

Abstract

A method for determining the probability that an experimental biological molecule is a biological molecule described in a database given experimental mass data and background information.

Description

AN EXPERT SYSTEM FOR PROTEIN IDENT CAΗON USING MASS SPECTROMETRIC INFORMATION COMBINED WITH DATABASE SEARCHING
This application asserts priority of provisional application 60/136,267, filed on May 27, 1999, the contents of which are incorporated herein by reference.
BACKGROUND OF THE INVENTION
The rapid expansion of protein and DNA sequence databases together with technological improvements in biological mass spectrometry (MS) has made the combination of mass spectrometric peptide mapping with database searching (Henzel et al., 1993; Yates et al., 1993; Mann et al., 1993; James et al., 1993; Pappin et al., 1993) a superb method for rapid protein identification. The method (Fig. 1) involves cleavage of proteins with an enzyme having high specificity (usually trypsin), whereupon the resulting proteolytic products are subjected to analysis by either matrix-assisted laser desorption/ionization mass spectrometry (MALDI-MS) or electrospray ionization mass spectrometry (ESI-MS). Using a computer algorithm, the masses determined for the proteolytic peptides are compared with masses calculated for theoretically possible enzymatic cleavage products for every sequence in a protein DNA sequence database. The protein is identified based on an evaluation of this comparison. This peptide mapping method for protein identification is fast because the mass spectra are rapidly collected (<1 min per spectrum for MALDI-time-of-flight analysis) and because the analysis can be performed on the same time-scale. The method is relatively insensitive to unspecified modifications and/or sequence errors in the database because high confidence identifications can be made even when the mapping experiment yields information on only a small percentage of the sequence.
Identification of proteins by the above-described approach requires a scheme for determining the best match between the experimental data and a sequence in the database. Existing schemes for determining the best match include ranking by number of matches (Henzel et al., 1993; Yates et al, 1993; Mann et al., 1993; James et al., 1993) and a scoring system based on the observed frequency of peptides from all proteins in a database in a given molecular weight range (the so-called "MOWSE score" (Pappin et al., 1993)). When the mass spectral data is incomplete (i.e., only a few peaks in the spectrum) and/or of low mass accuracy, the "number-of-matches" approach may be inadequate to make a useful identification. Although the "MOWSE" scoring scheme is superior to the "number-of-matches" approach, it does not take into account the individual properties of any given protein.
The object of the present invention is to provide a more accurate method to identify biological molecules.
SUMMARY OF THE INVENTION The present invention provides a system for identifying biological molecules, for example proteins, using MS peptide mapping data. The system makes use of a Bayesian algorithm that takes into account individual properties of each protein in the database as well as other information relevant to the experiment. Bayesian probability theory has been widely u c to make scientific inference from incomplete information in various disciplines, including biopolymer sequence alignment [Liu & Lawrence, 1999], NMR spectral analysis (Bretthorst, 1988) and radar target identification (Bretthorst, 1996). Here, Bayesian probability theory is applied to make logical inference about the identity of an unknown protein sample against a protein sequence database. The probability for the sample protein to be a specific protein in the database is calculated using the MS data as well as other background information such as protein mass range, species from which the protein originated, mass accuracy, enzyme cleavage chemistry, protein sequence, previous experiments on the sample protein, etc.
DESCRIPTION OF THE FIGURES Figure 1 Flow chart showing protein identification by database searching in conjunction with mass spectrometric peptide mapping experiment.
Figure 2 Delayed-extraction reflectron MALDI-TOF spectrum of an in-gel tryptic digest of a 30 kDa SDS-PAGE protein band. ProFound determined that the band was a single protein: RPS7A (40s ribosomal protein S7A). Trypsin self-digestion products are labeled with 'Trypsin'. The labeled masses are monoisotopic masses. Two peaks labeled with asterisks (*) have masses 16.0 Da higher than the adjacent peaks (see discussion on the use of tag information section).
Figure 3 Normalized probability distribution for top 20 protein candidates using data shown in Figure 2.
Figure 4 Sequence coverage map and error map (A-C) (D)
Figure 5 Delayed-extraction reflectron MALDI-TOF spectrum of an in-gel tryptic digest of a SDS-PAGE protein band. ProFound determined that the band was a mixture, identifying two protein components: YLR409c and YDLOόOw. Trypsin self-digestion products are labeled with 'Trypsin'. The labeled masses are monoisotopic masses.
Figure 6. Delayed-extraction reflectron MALDI-TOF spectrum of an in-gel tryptic digest of a SDS-PAGE protein band. ProFound determined that the band was a mixture, identifying two protein components: RPS1B and RPS1 A. Trypsin self-digestion products are labeled with 'Trypsin'. Both monoisotopic and average (indicated by brackets) masses were simultaneously submitted to ProFound with separately specified mass tolerances.
Figure 7. Program flowchart.
DETAILED DESCRIPTION OF THE INVENTION
The present invention relates to improving current methods for identifying biological molecules. In one embodiment the invention provides a method for determining the probability that an experimental biological molecule is a particular biological molecule described in a database given certain experimental mass data and background information. Biological molecules include any biological polymer that can be degraded into constituent parts. The degradation is preferably into constituent parts at predictable positions to form predictable masses. Examples of biological molecules include proteins, nucleic acid molecules, polysaccharides and carbohydrates.
An experimental biological molecule is a biological molecule which is to be identified; the experimental biological molecule can also be referred to as an unknown biological molecule. A theoretical biological molecule is a biological molecule is a known biological molecule described in a data base.
Proteins are polymers of amino acids. Constituent parts of proteins comprise amino acids. A protein typically contains approximately at least ten amino acids, preferably at least fifty amino acids and more preferably at least 100 amino acids.
Nucleic acids are polymers of nucleotides. Constituent parts of nucleic acids comprise nucleotides. Typically, a nucleic acid contains at least 100 nucleotides, preferably at least 500 nucleotides.
Polysaccharides are polymers of monosaccharides. Constituent parts of polysaccharides comprise one or more monosaccharides. Typically, a polysaccharide contains at least five monosaccharides, preferably at least ten monosaccharides.
Mass data of biological molecules are quantifiable information about the masses of the constituent parts of the biological molecule. Mass data include individual mass spectra and groups of mass spectra. The mass spectra can be in the form of peptide maps, oglionucleotide maps or oligosaccharide maps.
The method of the present invention includes generating experimental mass data (D) for the experimental biological molecule within a certain mass range. D includes the measured masses and standard deviations, σ, associated with the measured masses. The method also includes generating theoretical mass data in the same mass range. In one embodiment the experimental mass data (D) is a subset of the experimental mass data (D).
For example, mass data for proteins can be generated in any manner that provides mass data within a certain accuracy. Examples include matrix-assisted laser desorption/ionization mass spectrometry, electrospray ionization mass spectrometry, chromatography and electrophoresis. Mass data can also be generated by a general purpose computer configured by software or otherwise.
For the purposes of the present invention the mass data, for example a peptide mass, m„ is determined to an accuracy ±Δm„ ύh. ΔmJm, preferably <10,000 ppm, more preferably <100ppm and most preferably <30ppm.
A step in generating mass data of a biological molecule may include first cleaving the biological molecule into constituent parts. Biological molecules may be cleaved by methods known in the art. Preferably, the biological molecules are cleaved into constituent parts at predictable positions to form predictable masses. Methods of cleaving include chemical degradation of the biological molecules. Biological molecules may be degraded by contacting the biological molecule with any chemical substance.
For example, proteins may be predictably degraded into peptides by means of cyanogen bromide and enzymes, such as trypsin, endoproteinase Asp-N, N8 protease, endoproteinase Arg-C, etc. Nucleic acids may be predictably degraded into constituent parts by means of restriction endonucleases, such as Eco Rl, Sma I, BamH I, Hinc II, etc. Polysaccharides may be degraded into constituent parts by means of enzymes, such as maltase, amylase, alpha-mannosidase, etc.
In the present invention a mass range (mmm, mmax) is determined for the experimental mass data. The mass range can be any mass range of the mass data. In one embodiment the mass range is the minimum and maximum measured masses of the experimental biological molecule mass data. A biological molecule database is any compilation of information about characteristics of biological molecules. Databases are the preferred method for storing both polypeptide amino acid sequences and the nucleic acid sequences that code for these polypeptides. The databases come in a variety of different types that have advantages and disadvantages when viewed as the hypothesis for a polypeptide identification experiment.
While the "database entry" for an amino acid sequence may appear to be a simple text file to a user browsing for a particular polypeptide, many databases are organized into very flexible, complicated structures. The detailed implementation of the database on a particular system may be based on a collection of simple text files (a "flat-file" database), a collection of tables (a "relational" database), or it may be organized around concepts that stem from the idea of a protein, gene, or organism (an "object-oriented" database).
Protein mass data may be predicted from nucleic acid sequence databases. Alternatively, protein mass data may be obtained directly from protein sequence databases which contain a collection of amino acid sequences represented by a string of single-letter or three-letter codes for the residues in a polypeptide, starting at the N- terminus of the sequence. These codes may contain nonstandard characters to indicate ambiguity at a particular site (such as "B" indicating that the residue may be "D" (aspartic acid) or "N" (asparagine). The sequences typically have a unique number-letter combination associated with them that is used internally by the database to identify the sequence, usually referred to as the accession number for the sequence.
Databases may contain a combination of amino acid sequences, comments, literature references, and notes on known posttranslational modifications to the sequence. A database that contains these elements is referred to as "annotated." Annotated databases are used if some functional or structural information is known about the mature protein, as opposed to a sequence that is known only from the translation of a stretch of nucleic acid sequence. Non-annotated databases only contain the sequence, an accession number, and a descriptive title. The background information known about an experimental biological molecule by which the data base search can be constrained can include any information. Some examples of background information include information about the species of the experimental biological molecule, knowledge or an assumption about the mass of the experimental biological molecule and the isoelectric point of the experimental biological molecule.
For example, the observed molecular mass or the observed isoelectric point of a protein can be used in combination with the measured masses of peptides generated by proteolysis to constrain the search for a polypeptide. In particular, the comparison between the theoretical mass data of the database proteins and the mass data of the unknown protein may be constrained to only those proteins of the database which are within a chosen mass range. The chosen mass range is preferably within 50% of the mass of the unknown protein, more preferably within 35%, most preferably within 25%. Similarly, the comparison between the theoretical mass data of the database proteins and the mass data of the unknown protein may be constrained to only those proteins of the database which are within a chosen isoelectric point range. The isoelectric point (pi) of a protein is the pH at which its net charge is zero. The chosen isoelectric point range is preferably within 50% of the isoelectric point of the unknown protein, more preferably within 35%, most preferably within 25%.
Optionally, further information of the experimental biological molecule, such as a protein's sequence, is obtained by generating fragment mass data of the experimental and theoretical biological molecules. Fragment mass data for a peptide can be generated in any manner which provides fragment mass data within a certain accuracy. Experimental conditions include the type of energy used to generate the fragment mass data. Nibrational excitation energy can be used. The vibrational excitation may be generated by collisions of the peptide with electrons, photons, gas molecules or a surface. Electronic excitation can be used. The electronic excitation may be generated by collisions of the peptide with electrons, photons, gas molecules (e.g. argon) or a surface.
In another example, the experimental fragment mass spectrum of a peptide from an enzymatically digested unknown protein is compared with the theoretical masses calculated by applying the rules for the specificity of the enzyme, and the rules for the fragmentation as known to those of ordinary skill in the art, to the amino acid sequence of a database protein. For example, the software tool PepFrag (ProteoMetrics) allows for searching protein or nucleotide sequence databases using a combination of mass spectra data and fragmentation mass spectra data.
Fragment mass data for the purposes of this invention can be generated by using multidimensional mass spectrometry (MS/MS), also known as tandem mass spectrometry. A number of types of mass spectrometers can be used including a triple- quadruple mass spectrometer, a Fourier-transform cyclotron resonance mass spectrometer, a tandem time-of-flight mass spectrometer, and a quadruple ion trap mass spectrometer. A single peptide from a protein digest is subjected to MS/MS measurement and the observed pattern of fragment ions is compared to the patterns of fragment ions predicted from database sequences.
In one embodiment the present invention provides a method to determine the probability that an experimental biological molecule is a biological molecule k described in a database given experimental mass data D and background information I. In one embodiment the probability, P(k\DI), is calculated from the following formula:
Figure imgf000009_0001
The difference between each measured mass of the experimental biological molecule and each theoretical mass of the biological molecule in the data base is calculated. If one or more differences are within a mass tolerance for a particular measured mass, the particular measured mass is considered to be one hit; that is, the particular measured mass and the particular theoretical mass are considered to be matching. The total number of hits found for a particular experimental molecule for a particular database molecule is designated as r. Each measured mass associated with a hit is designated as m„ wherein / is an ordinal number from 1 to r. The theoretical masses associated with the ith hit is designated as m,o
There can be more than one theoretical mass associated with the ith hit. The difference between each measured mass, m„ associated with an ith hit and one of the theoretical masses associated with the ith hit is determined. Any one of the theoretical masses, m,o, associated with the ith hit can be used to determine these differences. For example, the theoretical mass which produces the smallest difference between the measured mass, m„ and the theoretical mass can be used. Alternately, the average of the theoretical masses associated with the hit can be used to determine these differences.
N is the quantity of masses in the theoretical mass data and x is a function of the measured mass of the z'th hit.
P(k\I) is determined from background information (I) known about the experimental biological molecule, prior to consideration of the experimental mass data (D). This background information can be any information about the experimental biological molecule. In one embodiment the P(k\I) is a P(k\DI) obtained from previous experimental data generated for the experimental biological molecule.
The formula includes a factor which incorporates a determination of whether the measured mass data contains certain digestion patterns, the factor is designated as Ppauer - The digestive patterns can be any digestion pattern that can be observed for biological molecules. Examples of particular digestive patterns for proteins are described below. If certain patterns occur the P(k\DI ) will increase accordingly. In one embodiment, the Fpatter is calculated by taking a number greater than one and less than 1000 to the power of the quantity of occurrences of such patterns. In a preferred embodiment the number is from 1.5 to 10; most preferably the number is 2.5. In one embodiment the above formula can further include information regarding each theoretical mass associated with an ith hit. In this embodiment the number of theoretical masses within the mass tolerance for each measured mass, m„ is counted; and the total number of such theoretical masses is designated as g, for a particular m_. The difference between each measured mass, m„ associated with an z'th hit and each theoretical mass, myo, associated with the hit, is determined wherein/ is an ordinal number from 1 to g,. The P(k\DI) is then calculated from the following formula:
10
P(k I DI) oc P(k (m, ~ " > )2 F
N! -∑eχpj p.attern
/=1 2σ
In one embodiment in the above formulae x is defined as
Figure imgf000011_0001
wherein f(m,) is a normalized distribution of theoretical masses of the database and wherein c is in the range of 0J to 100. A more detailed explanation of the normalized mass distribution f(m) can be found in Appendix 3.
In one embodiment c is
Figure imgf000011_0002
In one embodiment the above probability formulae can further include a function of/, designated as y}, incorporated into the formulae as follows:
Figure imgf000012_0001
The function of y} can be any function of y. In a preferred embodiment **y. can be defined as (W'1)'1 , wherein Wls a constant equal to or greater than one. In a preferred embodiment is four.
All of the above probability formulae can be normalized by the following calculation:
Figure imgf000012_0002
The probability formulae of the present invention can be used to identify components of mixtures of biological molecules. For example, a database can be extended to contain entries which are additive combinations of the single proteins of a database. In one embodiment of the present invention a P(k \DI) is assigned to each theoretical protein in the database for data generated from a particular experimental protein; and the theoretical proteins which have the highest P(k \ DI) are chosen. The highest P(k \DI) can be from the top 50% of the database proteins to the top 0.01% of the database proteins. From these chosen proteins a new database is formed which contains additive combinations of the chosen proteins. Additive combinations are database proteins added together in various combinations. P(k \DI) calculations are performed using this new database.
In one embodiment the present invention provides a means for determining the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental mass data (D) and background information (I). The means is any means by which the probability can be determined. For example, the means includes a computer or mass spectra, as would be recognized by a person skilled in the art. Included in the means is a means for generating experimental mass data (D) for the experimental biological molecule, wherein D comprises measured masses and standard deviaUons, σ, associated with the measured masses; a means for determining a mass range (m^ m^ for the experimental mass data; a means for generating theoretical mass data for the biological molecule k within the mass range (mmm, mmax); a means for counting the number of masses, N, in the theoretical mass data; a means for calculating the difference between each measured mass and each theoretical mass, wherein if one or more differences are within a mass tolerance for a particular measured mass, the particular measured mass is considered to be one hit; a means for designating each measured mass associated with a hit as rr, wherein i is an ordinal number from 1 to r, wherein r is the total number of hits for a particular biological molecule; a means for determining the difference between each measured mass, rr , associated with an th hit and one of the theoretical masses, m-,, associated with the hit; a means for determining whether the measured mass data contains a digestion pattern, wherein each occurrence of such pattern is incorporated into a factor designated as F^^^ a means for determining P(k 11) from background information (I) known about the experimental biological molecule, prior to consideration of the experimental mass data (D); a means for calculating P(k \ DI) from the following formula:
Figure imgf000013_0001
wherein x is a function of the measured mass of the z'th hit, and wherein P(k\DI) is the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental data (D) and background information ( ).
In one embodiment the means for determining the probability that an experimental biological molecule is a biological molecule (k) further includes a means for counting the number of theoretical masses within the mass tolerance for each measured mass, m/, wherein the total number of such theoretical masses is designated as gι for a particular mf, a means for determining the difference between each measured mass, m„ associated with an ith hit and each theoretical mass, myo, associated with the hit, wherein y is an ordinal number from 1 to g„ and a means for calculating P(k\DI) from the following formula:
Figure imgf000014_0001
wherein P(k\DI) is the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental data (D) and background information (I).
In another embodiment the present invention provides a computer program product including a computer usable medium having computer readable program code means embodied in said medium for determining the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental mass data (D) and background information (I). The computer program product includes computer readable program code means for causing a computer to generate experimental mass data (D) for the experimental biological molecule, wherein D comprises measured masses and standard deviations, σ, associated with the measured masses; computer readable program code means for causing a computer to determine a mass range (rn^ rnX) for the experimental mass data; computer readable program code means for causing a computer to generate theoretical mass data for the biological molecule k within the mass range (mmin, mmax); computer readable program code means for causing a computer to count the number of masses, N, in the theoretical mass data; computer readable program code means for causing a computer to calculate the difference between each measured mass and each theoretical mass, wherein if one or more differences are within a mass tolerance for a particular measured mass, the particular measured mass is considered to be one hit; computer readable program code means for causing a computer to designate each measured mass associated with a hit as m_, wherein i is an ordinal number from 1 to r, wherein r is the total number of hits for a particular biological molecule; computer readable program code means for causing a computer to determine the difference between each measured mass, rr , associated with an zth hit and one of the theoretical masses, associated with the hit, computer readable program code means for causing a computer to determine whether the measured mass data contains a digestion pattern, wherein each occurrence of such pattern is incorporated into a factor designated as E^,^ computer readable program code means for causing a computer to determine Pfk 11) from background information (I) known about the experimental biological molecule, prior to consideration of the experimental mass data (D); computer readable program code means for causing a computer to calculate Pfk | DI) from the following formula:
Figure imgf000015_0001
wherein x is a function of the measured mass of the th hit, and wherein Pfk\DI) is the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental data (D) and background information (I).
In one embodiment the invention provides a computer program product which includes a computer usable medium having computer readable program code means embodied in said medium for determining the probability that an experimental biological molecule is a biological molecule (k) further including computer readable program code means for causing a computer to count the number of theoretical masses within the mass tolerance for each measured mass, m_,, wherein the total number of sue!" theoretical masses is designated as g for a particular^; computer readable program code means for causing a computer to determine the difference between each measured mass, π , associated with an th hit and each theoretical mass, m^, associated with the hit, wherein is an ordinal number from 1 to g, and computer readable program code means for causing a computer to calculate Pfk \ DI) from the following formula:
Figure imgf000015_0002
wherein P(k\DI) is the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental data (D) and background information (I).
In another embodiment a method for determining the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental mass data (D) and background informauon (I) is calculated from the following formula:
Figure imgf000016_0001
The variables are as described above. Additionally the probability can be normalized as above.
Additionally, the present invention includes a means for determining the Pfk\DI) of the above formula. The means is any means by which the probability can be determined. For example, the means includes a computer or mass spectra, as would be recognized by a person skilled in the art. Additionally, the present invention includes a computer program product for determining the Pfk\DI) of the above formula.
Methods
1) Algorithm
Every protein is specified by its particular linear sequence of amino acids. One defining signature of a protein is the set of masses of peptide fragments produced by cleavage of the protein by an enzyme of high cleavage specificity. The problem we seek to solve is to use the peptide masses obtained in such a mass spectrometric peptide mapping experiment to identify a protein from a protein sequence database.
Let k designate the hypothesis that 'protein k is the protein being analyzed', where protein k is an entry in the protein sequence database; D is the experimental data; and / is the available background information (e.g., species from which the protein originated, approximate molecular mass of the protein, mass accuracy of the peptide mass measurement, enzyme cleavage chemistry, previous experiments on the sample protein). Bayes' probability theory and the maximum entropy principle (Bretthosrt, 1996) are applied to derive the probability for the hypothesis k given data D and background information / (Appendix 1). In the derivation, the following assumptions are made: 1) the protein being analyzed exists in the database and 2) all the detected ion species are digestion products of the protein. The probability for each hypothesis k given data D and background information / is given by (Appendix 1)
10
P(k I DI) oc P(k I /)
Figure imgf000017_0001
with normalization condition
∑P(k \ DI) = \ keUatabase
The above formula can be approximated as,
Figure imgf000017_0002
where P(k\I) is the probability for hypothesis k given only the background information, I; N is the theoretical number of peptides generated by fragmentation of protein k by the protease used in the study; r is the number of hits (i.e., the number of matches between the measured and calculated peptide masses); g, is the multiplicity of the /th hit (i.e., the number of theoretical peptides that match a given experimental peptide mass value); (mmm-mmax) is the range of measure peptide masses; σ is the standard deviation of the mass measurement; m, is the measured mass of the /th hit; m,o is the calculated mass of the /th hit; and FpaUern is an empirical term, which increases the probability when overlapping and/or adjacent peptides are observed (see program description).
The ranking of the candidate proteins is based on the values determined for their probability P(k\DI).
2) Interpretation of the probability P(k\DI)
The Bayesian probability is consistent with common sense. For any given protein k in the database, the probability that protein k is the sample protein increases with increasing number of hits r, increasing mass accuracy (i.e., smaller σ and (m/-m,o)), and decreasing number of theoretically digested fragments N. It has been shown previously that tryptic peptides of higher molecular mass occur with lower frequency than do those with lower molecular mass (Pappin et al., 1993; Fenyδ et al, 1998), and are therefore more constraining for protein identification. The present algorithm takes into account this different information content of peptides with different masses through the normalization condition given above.
The Bayesian probability should be viewed as a measure of confidence level of the hypothesis that protein k is the sample protein based on available information. There is no absolute certainty for any given identification, only the probability - i.e., the higher the probability, the higher is the confidence level. In a critical situation (i.e., where a false positive result cannot be tolerated), it may be desirable to check the identification with an independent method such as tandem mass spectrometry (MS/MS) (Qin et al., 1997). Ultimately, the value of any given identification is provided by the outcome of the biological experiment that results from the information.
3) Identification of components in protein mixtures
Frequently, it proves difficult to separate proteins completely from one another, and a protein sample may contain a mixture of proteins. The Bayesian algorithm can be readily extended to identify the components of such mixtures. The protein sequence database is expanded to include entries that are 'fused' combinations of single protein sequences. At the present time, the program will identify the components of binary mixtures. Thus, the entries representing binary mixtures are binary combinations of single proteins (usually, the top 50 hits obtained in a prior search for single proteins). The Bayesian probabilities for these 'fused' proteins are calculated in the same way as for single proteins.
4) Improvement of the confidence level of protein identification using additional information obtained for the measured peptides
The Bayesian algorithm can incorporate any additional information obtained for measured peptides (Appendix 1). The additional information provides constraint in database searching to reduce the occurrence of database peptides that randomly match the experimental mass spectral data, thereby improving the confidence level for identifications. Fenyo et al. have investigated the value offered by knowledge of the presence (or absence) and number of particular amino acids contained within a given peptide (so-called 'tag information') (Fenyo et al., 1998). Experimentally, tag information can be obtained in a number of ways. Thus, for example, cysteine residues can be identified through chemical alkylation of free thiol moieties (Sechi & Chait.,
1998) and methionine residues can be inferred by observation of pairs of peaks separated by 16 Da (because methionine residues contained in proteolytic peptides are frequently partially oxidized).
5) Program
The database searched is the NCBI NR non-redundant database (URL http://www.ncbi.nlm.nih.gov/BLAST/blast_databases.html). The presence or absence of signal peptides is considered when such information is available in the corresponding NCBI GenPept format faltfile (URL http://www.ncbi.nlm.nih.gov/Entrez/batch.html). Taxonomy data is derived from the NCBI GenPept flatfiles and Taxonomy databases (URL http://www.ncbi.nlm.nih.gov/Taxonomy/). To counter the effect of mass independent systematic errors in the mass measurements, the mean of the experimental minus calculated masses (for all hits of a given protein in the database) is removed during the probability calculation. An empirical factor has been introduced in the probability calculation to take into account two kinds of commonly observed digestion patterns. The first pattern, which we term adjacency, occurs when proteolytic peptides are observed to be adjacent to one another in the protein sequence (see Fig. 4A). The second pattern, which we term common-end overlapping, occurs when the observed peptides have one common terminus, but differs at the other terminus by a peptide segment (see Fig. 4A). Upon each occurrence of adjacency or common-end overlapping, the probability is increased by a factor of 2.5. To increase the speed of the identification program, the search is performed in two steps. In the first step, the database is searched using the dominant term in the probability equation given above, i.e..
Figure imgf000020_0001
The top 1500 protein candidates selected using this simplified formula are then reanalyzed using the full equation, (Eq. 1).
Since NCBI NR non-redundant database is not truly non redundant, even containing protein sequences with trivial differences (highly homologous sequences). A method was devised to detect redundancy based on experimental data and to take into account of such redundancy. When two protein candidates have over certain percentage (over 50%) of common matches whose calculated masses agree to 10"5 Da for average masses 10"6 Da for monoisotopic masses, the two protein sequences are considered homologous. Among the homologous sequences, the sequences which do not have the highest likelihood will be removed from the list for normalization calculation.
In one embodiment the method of the present invention includes determining whether the data base includes biological molecules which form a homologous set. The determination that biological molecules in the data base are homologous can be determined by various methods. For example, a homologous set of biological molecules are the biological molecules in the database which have the same theoretical masses associated with the hits for an experimental biological molecule, within a certain percentage. This percentage can be any percentage. A preferred percentage is over fifty percent. By another method biological molecules which make up a identify a homologous set of biological molecules are the biological molecules in the database which have the same theoretical masses associated with the hits for an experimental biological molecule, within a certain percentage, and which have the same amino acid sequences associated with the hits for an experimental biological molecule, within a certain percentage. Preferred percentages for both the mass and sequence information are over fifty percent.
In one example, the P(k|DI) is calculated for each of the biological molecules in the homologous set; and the highest P(k|DI) is assigned to all of the homologous biological molecules in the homologous set.
Figure 7 shows the program flowchart.
5.1) Program input
1. Taxonomic category: A representation of a phylogenic tree is provided through which the user can specify the origin of the sample protein, if known.
2. Search mode: The program can be specified to search in either "Single protein only" mode or "Single or binary mixture" mode. 3. Mass range: If known, the approximate protein mass range of the sample protein can be specified.
4. Number of candidate proteins: The number of top candidate proteins in the output display can be specified.
5. Digestion chemistry: The proteolytic enzyme or chemical reagent used to cleave the sample protein(s) must be specified. Current choices are trypsin, endoproteinase Arg C, endoproteinase Asp N, endoproteinase Lys-C, N8 protease (cleavage at D and E), N8 protease (cleavage at E), and cyanogen bromide.
6. The maximum number of missed cleavage sites: The maximum number of missed cleavage sites within the peptide (yielding incompletely cleaved peptides) must be specified. Allowed values are in the range 0 - 4.
7. Peptide masses and mass tolerances: There are three alternative methods for specifying the masses of peptides used to search the database. These are average mass, monoisotopic mass, and a combination of average and monoisotopic masses (this latter alternative is useful when only some of the peaks in the mass spectrum are isotopically resolved.) The mass tolerances for average and monoisotopic masses are specified independently (either as an absolute mass or as a relative tolerance). Because there is a 95% probability for Gaussian distributed measurement errors to be within ± 2σ (where σ is the standard deviation), the mass tolerance is taken as 2σ. Either neutral or protonated peptide masses can be specified. 8. Amino acid tags: When peptides have been experimentally determined to contain particular amino acids, amino acid tags can be associated with the masses of the peptides.
9. Additional digests: If a sample has been separately digested by different enzymes, the data from each different digestion can be fed into the program and used for protein identification.
10. Modifications: At present, modifications of cysteine residues can be specified. In the future, modifications at other amino acid residues will be incorporated.
5.2) Program output The ProFound output consists of a search result page, which is hyperlinked to pages that provide details about the search results. The search result page consists of a list of protein candidates ranked by probability as well as summary of the input data and search parameters. The sequences of the candidate proteins can be retrieved through links and can be further analyzed by sequence analysis tools contained in PROWL, an interactive environment on the World Wide Web for protein MS (Fenyo et al., 1996). Hyperlinked output pages show graphical and text representations of the matched peptides from the protein candidates. For each candidate protein, graphs are provided to allow the user to quickly assess the experimental peptide mass coverage of the protein and the mass measurement errors
(Fig. 4).
6) Biochemical procedures and mass spectrometry
The method used for in-gel protein digestions was as described previously [13] except that the gel-band soak time was extended from 4 hours to 24 hours. Trypsin digestion of membrane-bound proteins was as described previously [Zhang et al., 1994]. MALDI- time-of-flight (TOF) MS was carried out using a commercial instrument (Perseptive Biosystems STR, Framington, MA) operated in the delayed-extraction reflector mode (FWHM resolution ~ 5000) or an instrument constructed in-house (Beavis & Chait, 1989, 1990) operated in the continuous-extraction linear mode (FWHM resolution ~ 500). The MALDI-ion trap data was obtained using an instrument constructed in-house and described previously.
EXPERIMENTS
Identification of single isolated proteins
Figure 2 shows a delayed-extraction reflectron MALDI-TOF spectrum of the mixture of peptides produced by in-gel trypsin digestion of a 30 kDa SDS-PAGE protein band from an Saccharomyces cerevisiae nuclear extract. Thirty-five monoisotopic masses derived from Figure 2 were submitted to ProFound in order to identify the protein. Other search parameters were: S. cerevisiae for the taxonomic category; a protein mass range of 0 — 3,000 kDa; unmodified cysteines; a maximum of 2 missed cleavage sites; and a mass tolerance of 0.1 Da. The specified taxonomic category and protein mass range includes the complete set of proteins (or open reading frames (ORFs)) in the S. cerevisiae genome. Table 1 lists the top 4 protein candidates (ranked by normalized probability) found by the search. The top-ranked protein, the ribosomal protein S7A, has a probability of 1 and is readily distinguished from the next ranked candidates, which have probabilities of respectively 2 x 10"51, 8 x 10"53 and 5 x 10"53. We plot the probabilities of the top 20 candidates in Figure 3. The probability is observed to make a large transition from the first to second candidate, and varies much more slowly for the remaining candidates. This type of probability distribution pattern provides an unambiguous high confidence identification signature for the top ranked protein.
Figures 4A--C shows sequence coverage maps and an error map for the top ranked candidate. The segment coverage map (Fig. 4A) (in which a segment represents a peptide resulting from complete digestion of the protein by trypsin) is useful for visualizing digestion patterns indicative of an authentic protein identification. Bona fide identifications are often characterized by the observation of peptides that are adjacent to one another in the sequence and/or that overlap and have a common terminal (while differing by one segment at the other terminal.) Examples of these two commonly observed patterns are shown in Figure 4A. Because the observation of such patterns raises our confidence level that a candidate protein is present in the sample, we have empirically included a term in the ProFound probability calculation to incorporate this information. The sequence residue coverage map (Fig. 4B) shows the portion of the ribosomal protein S7A sequence that was observed in the MS peptide mapping experiment. Twenty-three measured masses match 24 theoretical tryptic peptide masses from the ribosomal protein S7A, covering 70% of the sequence. The error map (Fig. 4C) provides a scatter plot showing error (i.e., measured mass - calculated mass) versus mass for each match. The scatter plot is useful for visualizing systematic errors in the mass measurement. When the spectral calibration is free of systematic error, the errors for an authentic hit are normally distributed about zero and are independent of mass value, as in Figure 4C. The bottom portion of Figure 4C is a histogram projection of the scatter plot. In cases where there are a sufficient number of matched peaks, the histogram of errors for an authentic hit shows a peaked distribution (Fig. 4C). By contrast, the error plot for a randomly hit protein (e.g., the second candidate, myol, Fig. 4D) is a nearly uniform distribution of mass errors within the plotted range. We note, however, that this observation only holds true when the error of the mass measurement is either quite small (< 0J Da) or relatively large (> 0.5 Da) (Appendix 2).
Identification of protein components in binary mixtures
The MALDI-TOF mass spectrum shown in Figure 5 was obtained from the products of in-gel digestion of a 105 kDa SDS-PAGE protein band. Peptide masses consisting of 47 monoisotopic masses were submitted to ProFound. (Sometimes the resolution or statistics for higher mass peaks are insufficient for unambiguous identification of the monoisotopic component. In such cases, we determine the average mass of the peptide.) Other search parameters were: S. cerevisae as taxonomic category; protein mass range of 0 — 3000 kDa; unmodified cysteines; 2 maximum missed cleavage sites; mass tolerances of 1 Da for average masses and 0J Da for monoisotopic masses. When the search was performed in the 'single protein' mode, the top two candidates (YLR409c and YDL060w) had probabilities of 0.99 and 0.01, respectively, which was considerably higher than the probabilities for all the rest of the candidate proteins (< 10"22 with slowly decreasing values). The number of peaks matching with theoretical tryptic peptide masses from YLR409c and YDL060w are 18 each (respective sequence coverage of 24%) and 30%), and the two proteins have no sequence homology. Two such proteins with dominating probabilities provide an indication that the sample may be a binary mixture. To test this hypothesis, ProFound was set up to search for a possible binary mixture using the same data set and other search parameters that was used for the 'single protein only' search. The result of this search identifies, with high confidence, the simultaneous presence of YLR409c and YDL060w (Table 2). This binary protein mixture has a probability of one, while the probabilities for all the other protein candidates are < 10" , with slowly decreased values. The validity of this mixture identification can be tested by the use of the subtraction method [Jensen et al., 1997]. For this purpose, 18 peaks corresponding to the tryptic peptides from YLR409c were removed from the peptide mass list. The remaining masses, together with the same search parameters, were submitted to ProFound to search for the hypothetical second protein. ProFound identifies YDL060w as the leading candidate with a probability of one (~10 higher than the next protein candidate).
Figure 6 shows a MALDI-TOF mass spectrum obtained from in-gel tryptic digestion of a 30 kDa SDS-PAGE protein band. Thirty-six monoisotopic peptide masses and 1 average peptide mass were submitted to ProFound. Other search parameters were: S. cerevisae as taxonomic category; protein mass range of 0—3000 kDa; unmodified cysteines; a maximum of 2 missed cleavage sites; mass tolerances 0.5 Da for average and 0.1 Da for monoisotopic masses. When the search mode was for 'single protein only' the top two candidates RPS1B (40 s ribosomal protein, MM=28,681 Da) and RPS1A (40 s ribosomal protein, MM=28,612 Da, highly homologous to RPS1B) with probabilities of respectively 1 and 0.001, while the probability for the remaining candidates are < 10"53. The number of peaks that match theoretical tryptic peptide masses from RPS1B and RPS1A are 24 and 23, respectively (with sequence coverages of 24% and 30%). There are two possibilities that can yield two such dominating candidates. The first is that there is only one protein present in sample (and the second ranked candidate represents a closely similar protein that is not present). The second possibility is that the sample is a binary protein mixture of two highly similar proteins. To test this second hypothesis, ProFound was set up to search for possible binary mixtures with the same data set and other search parameters that were used for the 'single protein only' search. The result of this binary search (Table 3) provides strong evidence that the band is a mixture of RPS1B and RPS1A (probability = 1). The probabilities for all the other single and binary protein candidates are < 10" 4, with slowly decreasing probability values. The two identified proteins are highly homologous, differing by only 7 amino acids in their 254 amino acid sequences. Of the 37 peptide peaks in the mass spectrum, 19 are common to both RPS1B and RPS1A, while 5 correspond to RPS1B only and 4 correspond to RPS1A only. Unlike the previous example, it would be difficult to identify the second protein component using the subtraction method (Jensen et al., 1997) because only 4 peptides (together with 9 peaks that did not match RPS1A) would remain after subtraction of the 24 peaks from the first protein RPS1B.
Independent verification of the MS peptide mapping method for protein identification Two independent strategies for mass spectrometric protein identification are peptide mapping and fragmentation of individual peptide ions (MS/MS) (Qin et al., 1997). Here, we use MS/MS as a method for independently checking the accuracy of the peptide mapping strategy for identifying proteins.
In a previous described experiment (Qin et al., 1997), we used matrix-assisted laser desorption/ionization-ion trap-mass spectrometry to obtain both tryptic map MS data and MS/MS fragment ion data from the same samples. The MS/MS fragmentation information was used to identify proteins with the program PepFrag (Fenyo et al., 1998), which requires an exact match of the peptide mass and the peptide fragment masses with theoretical masses generated from a database-derived peptide. Here, we compare the result obtained with ProFound using the peptide mapping data with the independent identification using MS/MS data from the same sample. Search parameters were: S cerevisea as taxonomic category, protein mass range of 0-3000 kDa; a maximum of 4 missed cleavage sites; mass tolerance of 2 Da. Table 4 is a summary of 15 searches using the two independent methods. All the proteins identified with the MS/MS data were confirmed by ProFound using the peptide mapping data, even though the mapping data was of relatively low quality (i.e., resolution 500 FWHM, accuracy + 2 Da). These findings provide independent assurance of the reliability of ProFound for identifying proteins.
Improvement o[ ie confidence level of protein identification using tag information Incorporation of amino acid 'tag information' in the ProFound search (see Methods) can reduce the occurrence of database peptides that randomly match the experimental MS data, thereby improving the confidence level of an identification. For example, we have shown previously that inclusion of information regarding the absence or presence of cysteine residues in tryptic peptides from proteins can significantly improve the confidence level of a protein identification (Sechi & Chait, 1998).
Appendix
Appendix 1 Derivation of the Bayesian probability that protein k is the protein under analysis
1) Peptide mapping experiment A peptide mapping experiment involves enzymatic or chemical cleavage of the protein, using cleavage reagents with high specificity for particular amino acids. The resulting mixture of peptide fragments is subjected to mass analysis. Each detected peptide fragment ion appears as a peak in the mass spectrum. The position of the peak along the mass axis provides a measure of the mass of the peptide ion. The current mass spectrometry technology does not provide reliable quantitative information from the height of the various peaks in the spectrum, so that peak intensity information is only used to decide on the presence of a peptide fragment. Thus, the the experimental data is limited to the information on the masses of the peptide fragments and can be expressed as a joint hypothesis D=mι...m„ where m, is a logical notation representing the finding that the mass value for the /th peak is m, and n is the number of peaks within the mass range from mmm to mmax Ideally, all peptide fragments produced from the protein should be detected. However, in practice, only a subset of the peptide fragments is observed. The reasons for not observing all of the peptides include poor solubility and/or low ionization efficiencies of certain peptides.
2) Applying Bayesian probability theory We apply principles discussed in ref. (Bretthorst. 1996) to the present protein identification problem.
Bayes's theorem {PfAB) = P(A)P(B\A) = P(B)P(A\B), where P(X\ Y) is the probability of X given that Y is known} is applied to calculate the probability that a protein, existed in a protein sequence database, gives rise to the experimental peptide map. Let k designate the hypothesis that 'protein k is the protein being analyzed', where protein k is an entry in the protein sequence database; D is the experimental data; and / is the available background information. By applying Bayes' theorem, the posterior probability that hypothesis k is true, in view of the data D and background information I, is
P(D \ I)
The posterior probability, Pfk\DI), depends on three terms. The first term, P(k\I), is the prior probability for the hypothesis given only the information. The second term, P(D\KI), is the likelihood probability that the data D would be observed if the hypothesis is true. The third term, P(D\I), is independent of hypothesis k, and is a normalization constant. Thus, the posterior probability is proportional to the product of prior probability and likelihood probability,
P(k\DI)∞P(k\I)P(D\kI) (bayes)
21) Prior probability P(k\I)
There are two situations with respect to the assignment of the prior probability P(k\I). In the first, where the only knowledge is the background information on the species from which the sample protein originated and the assumption about the mass range of the sample protein, the principle of maximum entropy (Bretthorst, 1996) assigns a uniform prior probability for every hypothesis which satisfies the constraints set by the background information and assumption (prior probability is zero for the hypothesis which doesn't satisfy the constraints). In the second situation, where we have previous knowledge concerning the hypothesis k from a previous experiment, the prior probability before considering the current data is the posterior probability from the previous experiment. Thus the prior probability for hypothesis k which satisfy the background information and assumption is assigned as
f constant If no previous data available
} P(k I DprevI) If previous data. ZV<.v, available
In this way. data from multiple digestions are natually incorporated.
2.2) Likelihood probability for data P(D\kI)
For a given protein k, measured peptide masses fall into two subsets: hits and misses. Hits are those measured masses that agree with predicted masses (within the mass tolerance). Misses are the remaining masses that cannot be accounted for by the known protein sequence. In the absence of contaminating proteins, misses arise from several possible sources. These include errors in the database, unknown modifications of the proteins, and unexpected cleavage of the protein. The resulting peptides are termed 'modified peptides'. It can be shown that the experimental data D can also be expressed as: D=m,H ...mr Hmr+l M ...mr+w M
Figure imgf000031_0001
(D-hits/misses) where the superscripts H and M are used to label hits and misses, r is the number of hits, w is the number of misses and the total number of measured masses is r+w. m,H (≡H,m„ i=l, ...,r) is a logical product that the /th hit originates from a particular peptide in protein k and its measured mass is m,. mr+J M (≡Mr+j mr+}, f=l, ...,w) is also a logical product that the y'th miss originates from a modified peptide and its measured mass is rrir+j- "'m n aim trip q ιugιv,aι
Figure imgf000031_0002
Figure imgf000031_0003
jjiuuu^is u± rrιm ιιιm+l ... mn aiiu rrip rrlp+]
...mq M respectively. When calculating P(D\kI), the likelihood probability for data D. the product rule is applied to facto. D into the probability for hits and the probability for misses given the hits.
P(D\kI)=P(m, r Hmr+I r,w M\kI)
=P(m, r H\kI) P(mr+i r+w M\kImι ") (hits/misses) In what follows, hits and misses will be considered separately.
2.2.1) Likelihood probability for hits Pføi 'lkl)
The probability for hits in equation (hits/misses) can be factorized as products for individual hits by applying the product rule
Pfml \ kl m _λ) (hit/hit)
Figure imgf000031_0004
where moH is defined as logical one. m,H(≡H,mj) is the logical product of two hypotheses (i.e. the /th hit originates from a particular peptide in the protein k and its measured mass is m,). Each term can be expressed as
P(m,H\kIm„ ,.ιH)
=P(Hlm kI mo t-iH)
=P(H,\kl m ,-ιH)Pfm,\kI m0 ,.," H)
=P(Hl\kI mo ,.,H)P(ml\kI m0 l.,H mlo) i=l, ...,r (hit/nu) where the condition H, is equivalent to the condition that the theoretical mass value of peptide / is m,o-
P(H,\kI mo ,.jH) in equation (hit/mi) is the probability for the /th measured peptides to be a hit, given protein k and i-1 previous hits. Since the number of available peptides for the /th comparison is (N-i-1), the probability for the /th peptide to a hit is l/(N-i-l) (using the maximum entropy principle [8]), where N is the total number of theoretical peptides. We thus have
P(H.\kI m0 l-ι )= l/(N-i-l), i=l, ...,r (prior-Η)
P(m,\kl mo ,-iH m,o) in equation (hit/mi) is the probability for the measured mass value to be m, given its theoretical mass is m,ø. Assuming that the mass measurement is free of systematic error, P(m,\kI mπ ,-iH m,o) = Pfm,-ml0\I) i=l, ...,r (likelihood-mi)
Figure imgf000032_0001
where σ is the standard deviation of the mass measurement. In the case where m matches with g theoreUcal masses within given mass tolerance, the probability for the zth hit is
P(m' klm, l")
= P(∑Hllmll \ kI m ,_])
/=!
Figure imgf000033_0001
Thus the probability for hits is given by P(m, " \ kl)
Figure imgf000033_0002
(likelihood-hits)
2.2.2) likelihood probabiliyt for misses Pfm^., ^. | klm, r"j
We assume that misses are results of either error in protein sequence, unknown modificauon of the protein, or unexpected cleavage of the protein. However, the measured mass alone does not provide information on the identity of the pepude within the protein. The probability for misses depends on the number of 'modified pepudes' /, which is between w and N-r (tv is the number of misses, r is the number of observed hits, and N is the total number of pepudes). By applying the sum rule and the product rule, the probability for all misses can be expressed as Pfm^l r+,M \ kIm, ll)
Figure imgf000034_0001
= kl )/>«„,, I kl , )
Figure imgf000034_0002
= R(J I kl m?r)P(mM u+w I A/ m?rJ)
>=w (likelihood-misses~l) = ∑P(J I kl Of! P( ^ I kl mrJmM +J_,)
./=w 7=1 jV-r w
= ∑R(J I W -)]"[ R( r+mr+, I kl m,H rJ + )
.l=w J=\
N-r w
= ∑R(J I kl m?,)Y[P(Mr+J I A .//*, _,)?(/*„, | kl mHJm^,_Mr^
In the above equation, mr' is defined as a logical one. The lower limit for summation is changed from J=0 to
Figure imgf000034_0003
,-- ' \klm, r HJ), the probability for having v misses given modified pepudes is 0 when / is smaller than w.
P(J)kImι r H) in equation (likelihood-misses— 1) is the probability for there being J modified peptides, given protein k and r observed hits. The probability is assigned by applying the maximum entropy principle,
CJ
P(J I kl m?, ) = N-r
∑ NN'-.r
1=0
= jr (prior-J) ηN-r
P(Mr+J\kImι r Jmrr+J_i ) in equation (likelihood-misses— 1) is the probability for observing a modified peptide, given protein k, J modified peptides and r hits plus j-1 misses being observed already. Since the number of available peptide is N-fr+f-1), and the number of remaining unobserved modified peptides is J-fj-1), the probability for observing a modified peptide is assigned, by applying the maximum entropy principle, as follows:
P(Mr+J\kI mι r HJ mr r+].,M)=(J-j+l)/(N-r-j+l) j=l, ...,w
N-r≥ ≥w (prior-Mr+j)
P(mr+J\kl mi r J mr r+j-i Mr+J) in equation (likelihood-misses— 1) is the likelihood probability for the modified peptide to have a measured mass mr+J . Since mr+j is a miss and is always within the range of the mass measurement (i.e., between the minimum mass mmm and the maximum mass mmax ), using the principle of maximum entropy the probability is assigned
P(mr+j\kl rn; r ..' nr r+J.ι Mr+j)=l/(mmax-mm,„). j=l, ...,w (Iikelihood- mr )
Substituting equations (prior-J), (prior-Mr+j) and (Iikelihood-m,.^) into equation (likelihood-misses— 1), the probability for all misses is given by
Figure imgf000035_0001
(likelihood-misses— 2) 1
2(W maχ - Wm, )
2.3) Posterior probability P(k\DI)
Substituting equations (likelihood-hits) and (likelihood-miss — 2) into equation (hits/misses), and then substituting equation (hits/misses) into equation (bayes), the posterior probability is given by P(k\DI)
Figure imgf000036_0001
(bayes-final)
Assuming that the protein being analyzed exists in the protein sequence database, the normalized posterior probability is obtain by applying the normalization condition of
∑ P(k \ DI) = \ ke Database
3) Extension to incorporate additional information obtained for the measured peptides The Bayesian algorithm we developed can be naturally extended to incorporate any additional information of the measured peptides. When an experimentally observed peptide is known to have certain property, such as number of particular amino acids or partial amino acid sequence, this knowledge is included in the background information /. In calculating the likelihood probability for the data with a particular type of information (D4), the number of available peptides for experimental detection is NA, the theoretical number of peptides containing the specified type of information within the protein sequence k. By substituting N with NA in equation (bayes-final), the likelihood probability for D4 is given by P(DΛ I kl)
Figure imgf000037_0001
where A denotes the specified type of information and rA is the number of hits which satisfy both the mass and the additional information constraints. In addition to the increase of probability discrimination of the authentic protein by eliminate random hits from random protein, it can be seen that the probability (for the authentic protein) further improves by a factor of N/Nj for each measured peptide having the additional information.
Appendix 2
Let Δexp denote accuracy of mass measurement, which is also the width of error
distribution of peptide mass measurement; Δpep half width of the mass distribution of all
rand exp pep ' possible peptides, encompassing 95% of all amino acid compositions (Δpep for 2000 Da
has been found to be ~0J Da (Mann, 1995)); Δraπd half width of error distribution of
randomly matched peptides. Under the condition that Δexp is smaller than the distance
between neighboring peptide mass distributions, where no peptide mass is possible,
if Δexp« ΔpeP, then Δexp« Δpep * Δrand-
I.e., when experimental mass accuracy (Δexp) is considerably smaller than the width of
peptide mass distribution (Δpep), the error distribution of matched peptides for hit protein (Δexp) is considerably smaller than that of randomly matched peptides for unrelated
(randomly hit) proteins (Δrand)-
Appendix 3 Pfmr+I I kl m, f mrr / ,M M^ in equation (likelihood-misses— 1) is the likelihood probability for the modified peptide to have a measured mass m^ . W--+ is a miss and the likelihood for it to occur should be determined by the mass distribution of the theoretical peptides of proteins in the database. Therefore the probability is assigned using the normalized mass
Figure imgf000038_0001
P(m l \ kI m, ]1J mrr+l
Figure imgf000038_0002
j= 1,...,w (likelihood-/*.,,)
The normalized mass distribution can be derived from the statistical frequency distribution of the theoretical peptide masses of proteins in the database. The probability can be factored into form of Pfm^ I kl m, /'/ mrr+J ,M M^fl / fm^- jffm^-m ffm^] where 11 fτnιmrt_-m) corresponds to a uniform distribution.
Substituting equations (prior- J), (prior-Mr+)) and (likelihood-^) into equation (likelihood-misses— 1), the probability for all misses is given by
Figure imgf000038_0003
= Σ XXrXW ~XT~ mr,j ) (likelihood-misses-2)
^ N - r - j + 2.3) Posterior probability P(k | DI)
Substituting equations (likeUhood-hits) and (likelihood-miss — 2) into equation
(hits/misses), and then substituting equation (hits/misses) into equation (bayes), the posterior probability is given by
P(k \ DI)
Figure imgf000039_0001
(m, - m,0)2
Figure imgf000039_0002
2σ;
(bayes-final)
J - j + l where it can ∑: π=ι N - - 7 + 1
Appendix 4
The theoretical peptides of given protein k, can fall into ^ exclusive mum-subsets, where the probability for the occurrence of peptides in different subset can be different and can be determined, for instance, empirically. For example, in peptide mapping experiment, given, a, a maximum number of missed cleavage sites within peptides, the exclusive multi-subsets would respectively correspond to the number of missed cleavage sites as 1, 2, ..., a. Another example is in ms/ms fragmentation experiment, where the theoretical fragment ions can be classified according to different ion types (b, y, a, a , c", etc) to form exclusive multi-subsets. Designating the subsets as S, ...J_, it can be shown that the experimental data D can further be expressed as
D - m MS, ∞, M
\:r. +...+rz_\ +l:r m r.+l:r+w where r, ...,r„(r, + ... +r,—r) are the number of hits corresponding to the respective subsets.
Designating the total number hits corresponding to subsets 1 to q-1 as t(—r, + ... +r ,), a hit originated from the ^th subset is a logical product
ΪYI
Figure imgf000040_0001
that the t+s)th. hit is originated from a particular peptide (in protein k) and the peptide belongs to the ^th subset and it's measured mass is m
Likelihood probability for hits
Figure imgf000040_0002
I )
Assuming the number of theoretical peptides in qt subset is N and the probability for an observed peptide to be originated from the ^th subset is p . The probability for the (t+s)άι hit can be factored
Figure imgf000040_0003
= P(Sq I kl m?™)P(H_ι+t I kl m?™Sq)P(ml+ \ kl m^SqHt+s) q=1,...,%s=1,...,rιl
where mH{Vq} = m.HS[ m.HSq and is mϋ ^ defined as logical one. 0:t+s-\ — ml: * • j t+\-J+s-\ 0:0 &
P(S I kl m0l+s_l ) in the above equation is the probability to observe a peptide which originated from the ^th subset given condition and it is p
25
P(s, */ ',) C I y.. S /,...,
P(Ht+s | kl
Figure imgf000040_0004
) is the probability for the (t+sjtSa hit to be originated from a particular peptide in the ^th subset given all hits corresponding to subsets S, ... S , and previous /th to {s-1)t hits, which are identified to be originated from respectively particular s-1 peptides in the qήi subset. Since the number of available peptides in the ^th subset for the hit is N -s+ 1, the probabihty is assigned as
Figure imgf000041_0001
P(m I kl m^X q S H ) ls die probability for measured mass value to be m t+s given that the (t+s)t hit is originated from a particular peptide whose mass is known to be m ( )0 The probability is assigned as
10
P(ml+s \ kI m^_ SqHl+s )
Figure imgf000041_0002
In the case where m matches with gt+s theoretical masses within given mass tolerance, the probabihty for the {t+s)t hit is
Figure imgf000041_0003
Thus, the probabihty for hits is given by
Figure imgf000041_0004
where f=r,
Tables
Table 1 ProFound search results obtained with the data shown in Figure 2
Figure imgf000043_0001
Table 2. ProFound search results with data obtained from in-gel tryptic digest shown in Figure 8
Figure imgf000043_0002
Table 3 ProFound search results with data obtained from in-gel tryptic digest shown in Figure 9
Figure imgf000044_0001
Table 4 Summary on identifications made by Ion trap MS/MS and peptide mapping
Figure imgf000044_0002
(a) log(pι)-log(p2) is the logarithm difference of probabilities between first and second candidate proteins. When some top candidates are highly homologue, pi is determined as sum of probabilities for them and p2 is the probability for the following candidate.

Claims

WE CLAIM:
1. A method for determining the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental mass data (D) and background information (I), the method comprising,
a) Generating experimental mass data (D) for the experimental biological molecule, wherein D comprises measured masses and standard deviations, σ. associated with the measured masses; b) Determining a mass range (mmm, mmax) for the experimental mass data; c) Generating theoretical mass data for the biological molecule k within the mass range mmm, mmax), d) Counting the number of masses, N, in the theoretical mass data; e) Calculating the difference between each measured mass and each theoretical mass, wherein if one or more differences are within a mass tolerance for a particular measured mass, the particular measured mass is considered to be one hit; f) Designating each measured mass associated with a hit as m„ wherein / is an ordinal number from 1 to r, wherein r is the total number of hits for a particular biological molecule; g) Determining the difference between each measured mass, m„ associated with an /th hit and one of the theoretical masses, m,o, associated with the hit; h) Determining whether the measured mass data contains a digestion pattern, wherein each occurrence of a digestion pattern is incorporated into a factor designated as -r pattern? i) Determining Pfk\I) from background information (I) known about the experimental biological molecule, prior to consideration of the experimental mass data (D); j) Calculating Pfk\DI) from the following formula:
Figure imgf000046_0001
wherein x is a function of the measured mass of the /th hit, and wherein Pfk\DI) is the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental data (D) and background information (I).
2. A method according to Claim 1 , step (h), wherein the one of a theoretical masses, m,o, associated with the hit is the theoretical mass which produces the smallest difference between the measured mass, m„ and the theoretical mass.
3. A method according to Claim 1, step (h), wherein the calculated theoretical mass associated with the hit is the average of the theoretical masses associated with the hit.
4. A method according to Claim 1 further comprising: a) Counting the number of theoretical masses within the mass tolerance for each measured mass, m_, wherein the total number of such theoretical masses is designated as g, for a particular m,; b) Determining the difference between each measured mass, m„ associated with an /th hit and each theoretical mass, mυo, associated with the hit, whereiny is an ordinal number from 1 to g„ and c) Calculating Pfk\DI) from the following formula:
Figure imgf000046_0002
wherein P(k\DI) is the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental data (D) and background information (I).
5. A method according to Claim 1 wherein
Figure imgf000047_0001
6. A method according to Claim 1 wherein x is defined as
Figure imgf000047_0002
wherein ffm,) is a normalized distribution of theoretical masses of the database and wherein c is in the range of 0J to 100.
7. A method according to Claim 6 wherein c is
Figure imgf000047_0003
8. A method according to Claim 4 wherein the formula further comprises a function ofj, designated as y}, wherein
P(k I DI) oc P(k
Figure imgf000047_0004
9. A method according to Claim 8 wherein /, is (WJ'')'' , wherein W a constant equal to or greater than one.
10. A method according to Claim 9 wherein W is four.
11. A method according to Claim 1 wherein the experimental mass data are generated by a computer.
12. A method according to Claim 1 wherein the experimental mass data are generated by a mass spectrometer.
13. A method according to Claim 1 wherein the theoretical mass data are generated by a computer.
14 A method according to Claim 1 wherein the background information (I) of the P(k\I) comprises information about the species of the experimental biological molecule.
15. A method according to Claim 1 wherein the background information (I) of the P(k\I) comprises knowledge or an assumption about the mass of the experimental biological molecule.
16. A method according to Claim 1 wherein the background information (I) of the P(k\I) comprises information about the isoelectric point of the experimental biological molecule
17. A method according to Claim 1 wherein the P(k\I) is a P(k\DI) obtained from previous experimental data generated for the experimental biological molecule.
18. A method according to Claim 17 the previous experimental data is mass data of the experimental biological molecule.
19. A method according to Claim 1 wherein the mass range (mmm, mmax) is the minimum and maximum measured masses of the experimental biological molecule.
20. A method according to Claim 1 wherein the Fpattem is calculated by taking a number greater than one and less than 1000 to the power of the quantity of occurrences of the pattern.
21. A method according to Claim 20 wherein the number is 2.5.
22. A method according to Claim 1 wherein the experimental mass data (D) is a subset of the experimental mass data (D).
23. A method according to Claim 1 further comprising: a) determining whether the data base includes biological molecules which form a homologous set; b) calculating the P(k|DI) for each of the biological molecules in the homologous set; and c) assigning the highest P(k|DI) to all of the homologous biological molecules in the homologous set.
24. A method according to Claim 23 wherein a homologous set of biological molecules are the biological molecules in the database which have the same theoretical masses associated with the hits for an experimental biological molecule, within a certain percentage.
25. A method according to Claim 24 wherein the percentage is over fifty percent.
26. A method according to Claim 23 wherein a homologous set of biological molecules are the biological molecules in the database which have the same theoretical masses associated with the hits for an experimental biological molecule, within a certain percentage, and which have the same amino acid sequences associated with the hits for an experimental biological molecule, within a certain percentage.
27. A method according to Claim 26 wherein the percentages are over fifty percent.
28. A method according to Claim 1 wherein the experimental and theoretical mass data is fragment mass data.
29. A method according to Claim 1 wherein the experimental biological is a mixture of biological molecules.
30. A method according to Claim 29 wherein the data base comprises additive combinations of biological molecules.
31. The method of Claim 1 wherein the biological molecules are proteins.
32. The method of Claim 1 wherein the biological molecules are nucleic acid molecules.
33. The method of Claim 1 wherein the biological molecules are polysaccharides.
34. A means for determining the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental mass data (D) and background information (I) comprising, a) a means for generating experimental mass data (D) for the experimental biological molecule, wherein D comprises measured masses and standard deviations, σ, associated with the measured masses; b) a means for determining a mass range (mmin, mmax) for the experimental mass data; c) a means for generating theoretical mass data for the biological molecule k within the mass range (mmin, mmax); d) a means for counting the number of masses, N, in the theoretical mass data; e) a means for calculating the difference between each measured mass and each theoretical mass, wherein if one or more differences are within a mass tolerance for a particular measured mass, the particular measured mass is considered to be one hit; f) a means for designating each measured mass associated with a hit as m„ wherein i is an ordinal number from 1 to r, wherein r is the total number of hits for a particular biological molecule; g) a means for determining the difference between each measured mass, m_, associated with an /th hit and one of the theoretical masses, m,o, associated with the hit, h) a means for determining whether the measured mass data contains a digestion pattern, wherein each occurrence of such pattern is incorporated into a factor designated as Fpatlern; i) a means for determining Pfk\I) from background information (I) known about the experimental biological molecule, prior to consideration of the experimental mass data (D); j) a means for calculating P(k\DI) from the following formula:
Figure imgf000051_0001
wherein x is a function of the measured mass of the /th hit, and vv nerein Pfk\DI) is the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental data (D) and background information (I).
35. A means for determining the probability that an experimental biological molecule is a biological molecule (k) according to Claim 34 further comprising: a) a means for counting the number of theoretical masses within the mass tolerance for each measured mass, m7, wherein the total number of such theoretical masses is designated as gj for a particular mf, b) a means for determining the difference between each measured mass, m„ associated with an /th hit and each theoretical mass, mvo, associated with the hit, wherein/ is an ordinal number from 1 to g„ and c) a means for calculating Pfk\DI) from the following formula:
P(k I DI) oc
Figure imgf000052_0001
wherein Pfk\DI) is the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental data (D) and background information (I).
36. A computer program product comprising: a computer usable medium having computer readable program code means embodied in said medium for determining the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental mass data (D) and background information (I), said computer program product including: a) computer readable program code means for causing a computer to generate experimental mass data (D) for the experimental biological molecule, wherein D comprises measured masses and standard deviations, σ , associated with the measured masses; b) computer readable program code means for causing a computer to determine a mass range (mmm, mmax) for the experimental mass data; c) computer readable program code means for causing a computer to generate theoretical mass data for the biological molecule k within the mass range (mmm, mmax); d) computer readable program code means for causing a computer to count the number of masses, N, in the theoretical mass data; e) computer readable program code means for causing a computer to calculate the difference between each measured mass and each theoretical mass, wherein if one or more differences are within a mass tolerance for a particular measured mass, the particular measured mass is considered to be one hit; f) computer readable program code means for causing a computer to designate each measured mass associated with a hit as m„ wherein / is an ordinal number from 1 to r, wherein r is the total number of hits for a particular biological molecule; g) computer readable program code means for causing a computer to determine the difference between each measured mass, m_, associated with an /th hit and one of the theoretical masses, m,ø, associated with the hit, h) computer readable program code means for causing a computer to determine whether the measured mass data contains a digestion pattern, wherein each occurrence of such pattern is incorporated into a factor designated as Fpaern; i) computer readable program code means for causing a computer to determine
P(k\I) from background information (I) known about the experimental biological molecule, prior to consideration of the experimental mass data (D); j) computer readable program code means for causing a computer to calculate
P(k\DI) from the following formula:
Figure imgf000053_0001
wherein x is a function of the measured mass of the /th hit, and wherein Pfk\DI) is the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental data (D) and background information (I).
37. A computer program product comprising: a computer usable medium having computer readable program code means embodied in said medium for determining the probability that an experimental biological molecule is a biological molecule (k) according to Claim 36 further comprising: a) computer readable program code means for causing a computer to count the number of theoretical masses within the mass tolerance for each measured mass, m/, wherein the total number of such theoretical masses is designated as g/for a particular mi, b) computer readable program code means for causing a computer to determine the difference between each measured mass, m„ associated with an /th hit and each theoretical mass, mvo, associated with the hit, wherein/ is an ordinal number from 1 to g„ and c) computer readable program code means for causing a computer to calculate P(k\DI) from the following formula:
Figure imgf000054_0001
wherein P(k\DI) is the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental data (D) and background information (I).
38. A method for determining the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental mass data (D) and background information (I), the method comprising, a) Generating experimental mass data (D) for the experimental biological molecule, wherein D comprises measured masses and standard deviations, σ, associated with the measured masses; b) Determining a mass range (mmin, mmax) for the experimental mass data; c) Generating theoretical mass data for the biological molecule k within the mass range (mmin, mmax); d) Counting the number of masses, N, in the theoretical mass data; e) Calculating the difference between each measured mass and each theoretical mass, wherein if one or more differences are within a mass tolerance for a particular measured mass, the particular measured mass is considered to be one hit; f) Designating each measured mass associated with a hit as m„ wherein / is an ordinal number from 1 to r, wherein r is the total number of hits for a particular biological molecule; g) Counting the number of theoretical masses within the mass tolerance for each measured mass, m,, wherein the total number of such theoretical masses is designated as gt for a particular m , h) Determining the difference between each measured mass, m„ associated with an th hit and one of a theoretical masses, m o, associated with the hit or a calculated theoretical mass associated with the hit; i) Determining whether the measured mass data contains a digestion pattern, wherein each occurrence of such pattern is incorporated into a factor designated as t 'pattern? j) Determining Pfk\I) from information (I) known about the experimental biological molecule, prior to consideration of the experimental mass data (D); k) Calculating P(k\DI) from the following formula:
Figure imgf000055_0001
wherein P(k\DI) is the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental data (D) and background information (I).
39. A method according to Claim 30 wherein
∑P(k \ DI) = \ keDalabase
40. A method according to Claim 38, step (h), wherein the one of a theoretical masses, m,ø, associated with the hit is the theoretical mass which produces the smallest difference between the measured mass, m„ and the theoretical mass.
41. A method according to Claim 38, step (h), wherein the calculated theoretical mass associated with the hit is the average of the theoretical masses associated with the hit.
42. A method according to Claim 38 wherein the experimental mass data (D) is a subset of the experimental mass data (D).
43. A method according to Claim 38 wherein the experimental and theoretical mass data is fragment mass data.
PCT/US2000/014809 1999-05-27 2000-05-26 An expert system for protein identification using mass spectrometric information combined with database searching WO2000073787A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US13626799P 1999-05-27 1999-05-27
US60/136,267 1999-05-27

Publications (1)

Publication Number Publication Date
WO2000073787A1 true WO2000073787A1 (en) 2000-12-07

Family

ID=22472097

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2000/014809 WO2000073787A1 (en) 1999-05-27 2000-05-26 An expert system for protein identification using mass spectrometric information combined with database searching

Country Status (1)

Country Link
WO (1) WO2000073787A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001096861A1 (en) * 2000-06-14 2001-12-20 Jan Eriksson System for molecule identification
WO2003038728A2 (en) * 2001-11-01 2003-05-08 Biobridge Computing Ab A computer system and method using mass spectrometry data and a protein database for identifying unknown proteins
DE102004051016A1 (en) * 2004-10-20 2006-05-04 Protagen Ag Method and system for elucidating the primary structure of biopolymers
CN103389335A (en) * 2012-05-11 2013-11-13 中国科学院大连化学物理研究所 Analysis device and method for identifying biomacromolecules
EP3735259A4 (en) * 2017-12-29 2021-09-08 Nautilus Biotechnology, Inc. Decoding approaches for protein identification
US11721412B2 (en) 2017-10-23 2023-08-08 Nautilus Subsidiary, Inc. Methods for identifying a protein in a sample of unknown proteins
US11768201B1 (en) 2016-12-01 2023-09-26 Nautilus Subsidiary, Inc. Methods of assaying proteins

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5538897A (en) * 1994-03-14 1996-07-23 University Of Washington Use of mass spectrometry fragmentation patterns of peptides to identify amino acid sequences in databases
US6051378A (en) * 1996-03-04 2000-04-18 Genetrace Systems Inc. Methods of screening nucleic acids using mass spectrometry

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5538897A (en) * 1994-03-14 1996-07-23 University Of Washington Use of mass spectrometry fragmentation patterns of peptides to identify amino acid sequences in databases
US6051378A (en) * 1996-03-04 2000-04-18 Genetrace Systems Inc. Methods of screening nucleic acids using mass spectrometry

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001096861A1 (en) * 2000-06-14 2001-12-20 Jan Eriksson System for molecule identification
WO2003038728A2 (en) * 2001-11-01 2003-05-08 Biobridge Computing Ab A computer system and method using mass spectrometry data and a protein database for identifying unknown proteins
WO2003038728A3 (en) * 2001-11-01 2003-11-06 Biobridge Computing Ab A computer system and method using mass spectrometry data and a protein database for identifying unknown proteins
DE102004051016A1 (en) * 2004-10-20 2006-05-04 Protagen Ag Method and system for elucidating the primary structure of biopolymers
CN103389335A (en) * 2012-05-11 2013-11-13 中国科学院大连化学物理研究所 Analysis device and method for identifying biomacromolecules
US11768201B1 (en) 2016-12-01 2023-09-26 Nautilus Subsidiary, Inc. Methods of assaying proteins
US11721412B2 (en) 2017-10-23 2023-08-08 Nautilus Subsidiary, Inc. Methods for identifying a protein in a sample of unknown proteins
EP3735259A4 (en) * 2017-12-29 2021-09-08 Nautilus Biotechnology, Inc. Decoding approaches for protein identification
US11282586B2 (en) 2017-12-29 2022-03-22 Nautilus Biotechnology, Inc. Decoding approaches for protein identification
US11282585B2 (en) 2017-12-29 2022-03-22 Nautilus Biotechnology, Inc. Decoding approaches for protein identification
US11545234B2 (en) 2017-12-29 2023-01-03 Nautilus Biotechnology, Inc. Decoding approaches for protein identification

Similar Documents

Publication Publication Date Title
US6393367B1 (en) Method for evaluating the quality of comparisons between experimental and theoretical mass data
Shevchenko et al. Peptide sequencing by mass spectrometry for homology searches and cloning of genes
Mann et al. Error-tolerant identification of peptides in sequence databases by peptide sequence tags
Fischer et al. Protein cleavage strategies for an improved analysis of the membrane proteome
US8278115B2 (en) Methods for processing tandem mass spectral data for protein sequence analysis
US20110093205A1 (en) Proteomics previewer
Liska et al. Combining mass spectrometry with database interrogation strategies in proteomics
US6446010B1 (en) Method for assessing significance of protein identification
US20020046002A1 (en) Method to evaluate the quality of database search results and the performance of database search algorithms
WO2000073787A1 (en) An expert system for protein identification using mass spectrometric information combined with database searching
US7691643B2 (en) Mass analysis method and mass analysis apparatus
WO2002031509A2 (en) Method for determining mass altering moiety in peptides
US8712695B2 (en) Method, system, and computer program product for scoring theoretical peptides
US9702882B2 (en) Method and system for analyzing mass spectrometry data
JP6489224B2 (en) Peptide assignment method and peptide assignment system
JP4702284B2 (en) Protein analysis method
WO2003075306A1 (en) Method for protein identification using mass spectrometry data
US20050192755A1 (en) Methods and systems for identification of macromolecules
US20040044481A1 (en) Method for protein identification using mass spectrometry data
US20020152033A1 (en) Method for evaluating the quality of database search results by means of expectation value
WO2001096861A1 (en) System for molecule identification
JP6003842B2 (en) Protein identification method and identification apparatus
Oh et al. Peptide identification by tandem mass spectra: an efficient parallel searching
Ramachandran et al. FPTMS: Frequency-based approach to identify the peptide from the low-energy collision-induced dissociation tandem mass spectra
US20050074816A1 (en) Method for protein identification from tandem mass spectral employing both spectrum comparison and de novo sequencing for biomedical applications

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): CA JP US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP