WO2000073787A1

WO2000073787A1 - An expert system for protein identification using mass spectrometric information combined with database searching

Info

Publication number: WO2000073787A1
Application number: PCT/US2000/014809
Authority: WO
Inventors: Wenzhu Zhang; Brian T. Chait; David FENYÖ; Chao Tang
Original assignee: Rockefeller University; Proteometrics, Llc
Priority date: 1999-05-27
Filing date: 2000-05-26
Publication date: 2000-12-07

Abstract

A method for determining the probability that an experimental biological molecule is a biological molecule described in a database given experimental mass data and background information.

Description

AN EXPERT SYSTEM FOR PROTEIN IDENT CAΗON USING MASS SPECTROMETRIC INFORMATION COMBINED WITH DATABASE SEARCHING

This application asserts priority of provisional application 60/136,267, filed on May 27, 1999, the contents of which are incorporated herein by reference.

BACKGROUND OF THE INVENTION

The rapid expansion of protein and DNA sequence databases together with technological improvements in biological mass spectrometry (MS) has made the combination of mass spectrometric peptide mapping with database searching (Henzel et al., 1993; Yates et al., 1993; Mann et al., 1993; James et al., 1993; Pappin et al., 1993) a superb method for rapid protein identification. The method (Fig. 1) involves cleavage of proteins with an enzyme having high specificity (usually trypsin), whereupon the resulting proteolytic products are subjected to analysis by either matrix-assisted laser desorption/ionization mass spectrometry (MALDI-MS) or electrospray ionization mass spectrometry (ESI-MS). Using a computer algorithm, the masses determined for the proteolytic peptides are compared with masses calculated for theoretically possible enzymatic cleavage products for every sequence in a protein DNA sequence database. The protein is identified based on an evaluation of this comparison. This peptide mapping method for protein identification is fast because the mass spectra are rapidly collected (<1 min per spectrum for MALDI-time-of-flight analysis) and because the analysis can be performed on the same time-scale. The method is relatively insensitive to unspecified modifications and/or sequence errors in the database because high confidence identifications can be made even when the mapping experiment yields information on only a small percentage of the sequence.

Identification of proteins by the above-described approach requires a scheme for determining the best match between the experimental data and a sequence in the database. Existing schemes for determining the best match include ranking by number of matches (Henzel et al., 1993; Yates et al, 1993; Mann et al., 1993; James et al., 1993) and a scoring system based on the observed frequency of peptides from all proteins in a database in a given molecular weight range (the so-called "MOWSE score" (Pappin et al., 1993)). When the mass spectral data is incomplete (i.e., only a few peaks in the spectrum) and/or of low mass accuracy, the "number-of-matches" approach may be inadequate to make a useful identification. Although the "MOWSE" scoring scheme is superior to the "number-of-matches" approach, it does not take into account the individual properties of any given protein.

The object of the present invention is to provide a more accurate method to identify biological molecules.

SUMMARY OF THE INVENTION The present invention provides a system for identifying biological molecules, for example proteins, using MS peptide mapping data. The system makes use of a Bayesian algorithm that takes into account individual properties of each protein in the database as well as other information relevant to the experiment. Bayesian probability theory has been widely u c to make scientific inference from incomplete information in various disciplines, including biopolymer sequence alignment [Liu & Lawrence, 1999], NMR spectral analysis (Bretthorst, 1988) and radar target identification (Bretthorst, 1996). Here, Bayesian probability theory is applied to make logical inference about the identity of an unknown protein sample against a protein sequence database. The probability for the sample protein to be a specific protein in the database is calculated using the MS data as well as other background information such as protein mass range, species from which the protein originated, mass accuracy, enzyme cleavage chemistry, protein sequence, previous experiments on the sample protein, etc.

DESCRIPTION OF THE FIGURES Figure 1 Flow chart showing protein identification by database searching in conjunction with mass spectrometric peptide mapping experiment.

Figure 2 Delayed-extraction reflectron MALDI-TOF spectrum of an in-gel tryptic digest of a 30 kDa SDS-PAGE protein band. ProFound determined that the band was a single protein: RPS7A (40s ribosomal protein S7A). Trypsin self-digestion products are labeled with 'Trypsin'. The labeled masses are monoisotopic masses. Two peaks labeled with asterisks (*) have masses 16.0 Da higher than the adjacent peaks (see discussion on the use of tag information section).

Figure 3 Normalized probability distribution for top 20 protein candidates using data shown in Figure 2.

Figure 4 Sequence coverage map and error map (A-C) (D)

Figure 5 Delayed-extraction reflectron MALDI-TOF spectrum of an in-gel tryptic digest of a SDS-PAGE protein band. ProFound determined that the band was a mixture, identifying two protein components: YLR409c and YDLOόOw. Trypsin self-digestion products are labeled with 'Trypsin'. The labeled masses are monoisotopic masses.

Figure 6. Delayed-extraction reflectron MALDI-TOF spectrum of an in-gel tryptic digest of a SDS-PAGE protein band. ProFound determined that the band was a mixture, identifying two protein components: RPS1B and RPS1 A. Trypsin self-digestion products are labeled with 'Trypsin'. Both monoisotopic and average (indicated by brackets) masses were simultaneously submitted to ProFound with separately specified mass tolerances.

Figure 7. Program flowchart.

DETAILED DESCRIPTION OF THE INVENTION

The present invention relates to improving current methods for identifying biological molecules. In one embodiment the invention provides a method for determining the probability that an experimental biological molecule is a particular biological molecule described in a database given certain experimental mass data and background information. Biological molecules include any biological polymer that can be degraded into constituent parts. The degradation is preferably into constituent parts at predictable positions to form predictable masses. Examples of biological molecules include proteins, nucleic acid molecules, polysaccharides and carbohydrates.

An experimental biological molecule is a biological molecule which is to be identified; the experimental biological molecule can also be referred to as an unknown biological molecule. A theoretical biological molecule is a biological molecule is a known biological molecule described in a data base.

Proteins are polymers of amino acids. Constituent parts of proteins comprise amino acids. A protein typically contains approximately at least ten amino acids, preferably at least fifty amino acids and more preferably at least 100 amino acids.

Nucleic acids are polymers of nucleotides. Constituent parts of nucleic acids comprise nucleotides. Typically, a nucleic acid contains at least 100 nucleotides, preferably at least 500 nucleotides.

Polysaccharides are polymers of monosaccharides. Constituent parts of polysaccharides comprise one or more monosaccharides. Typically, a polysaccharide contains at least five monosaccharides, preferably at least ten monosaccharides.

Mass data of biological molecules are quantifiable information about the masses of the constituent parts of the biological molecule. Mass data include individual mass spectra and groups of mass spectra. The mass spectra can be in the form of peptide maps, oglionucleotide maps or oligosaccharide maps.

The method of the present invention includes generating experimental mass data (D) for the experimental biological molecule within a certain mass range. D includes the measured masses and standard deviations, σ, associated with the measured masses. The method also includes generating theoretical mass data in the same mass range. In one embodiment the experimental mass data (D) is a subset of the experimental mass data (D).

For example, mass data for proteins can be generated in any manner that provides mass data within a certain accuracy. Examples include matrix-assisted laser desorption/ionization mass spectrometry, electrospray ionization mass spectrometry, chromatography and electrophoresis. Mass data can also be generated by a general purpose computer configured by software or otherwise.

For the purposes of the present invention the mass data, for example a peptide mass, m„ is determined to an accuracy ±Δm„ ύh. ΔmJm, preferably <10,000 ppm, more preferably <100ppm and most preferably <30ppm.

A step in generating mass data of a biological molecule may include first cleaving the biological molecule into constituent parts. Biological molecules may be cleaved by methods known in the art. Preferably, the biological molecules are cleaved into constituent parts at predictable positions to form predictable masses. Methods of cleaving include chemical degradation of the biological molecules. Biological molecules may be degraded by contacting the biological molecule with any chemical substance.

For example, proteins may be predictably degraded into peptides by means of cyanogen bromide and enzymes, such as trypsin, endoproteinase Asp-N, N8 protease, endoproteinase Arg-C, etc. Nucleic acids may be predictably degraded into constituent parts by means of restriction endonucleases, such as Eco Rl, Sma I, BamH I, Hinc II, etc. Polysaccharides may be degraded into constituent parts by means of enzymes, such as maltase, amylase, alpha-mannosidase, etc.

In the present invention a mass range (m_mm, m_max) is determined for the experimental mass data. The mass range can be any mass range of the mass data. In one embodiment the mass range is the minimum and maximum measured masses of the experimental biological molecule mass data. A biological molecule database is any compilation of information about characteristics of biological molecules. Databases are the preferred method for storing both polypeptide amino acid sequences and the nucleic acid sequences that code for these polypeptides. The databases come in a variety of different types that have advantages and disadvantages when viewed as the hypothesis for a polypeptide identification experiment.

While the "database entry" for an amino acid sequence may appear to be a simple text file to a user browsing for a particular polypeptide, many databases are organized into very flexible, complicated structures. The detailed implementation of the database on a particular system may be based on a collection of simple text files (a "flat-file" database), a collection of tables (a "relational" database), or it may be organized around concepts that stem from the idea of a protein, gene, or organism (an "object-oriented" database).

Protein mass data may be predicted from nucleic acid sequence databases. Alternatively, protein mass data may be obtained directly from protein sequence databases which contain a collection of amino acid sequences represented by a string of single-letter or three-letter codes for the residues in a polypeptide, starting at the N- terminus of the sequence. These codes may contain nonstandard characters to indicate ambiguity at a particular site (such as "B" indicating that the residue may be "D" (aspartic acid) or "N" (asparagine). The sequences typically have a unique number-letter combination associated with them that is used internally by the database to identify the sequence, usually referred to as the accession number for the sequence.

Databases may contain a combination of amino acid sequences, comments, literature references, and notes on known posttranslational modifications to the sequence. A database that contains these elements is referred to as "annotated." Annotated databases are used if some functional or structural information is known about the mature protein, as opposed to a sequence that is known only from the translation of a stretch of nucleic acid sequence. Non-annotated databases only contain the sequence, an accession number, and a descriptive title. The background information known about an experimental biological molecule by which the data base search can be constrained can include any information. Some examples of background information include information about the species of the experimental biological molecule, knowledge or an assumption about the mass of the experimental biological molecule and the isoelectric point of the experimental biological molecule.

For example, the observed molecular mass or the observed isoelectric point of a protein can be used in combination with the measured masses of peptides generated by proteolysis to constrain the search for a polypeptide. In particular, the comparison between the theoretical mass data of the database proteins and the mass data of the unknown protein may be constrained to only those proteins of the database which are within a chosen mass range. The chosen mass range is preferably within 50% of the mass of the unknown protein, more preferably within 35%, most preferably within 25%. Similarly, the comparison between the theoretical mass data of the database proteins and the mass data of the unknown protein may be constrained to only those proteins of the database which are within a chosen isoelectric point range. The isoelectric point (pi) of a protein is the pH at which its net charge is zero. The chosen isoelectric point range is preferably within 50% of the isoelectric point of the unknown protein, more preferably within 35%, most preferably within 25%.

Optionally, further information of the experimental biological molecule, such as a protein's sequence, is obtained by generating fragment mass data of the experimental and theoretical biological molecules. Fragment mass data for a peptide can be generated in any manner which provides fragment mass data within a certain accuracy. Experimental conditions include the type of energy used to generate the fragment mass data. Nibrational excitation energy can be used. The vibrational excitation may be generated by collisions of the peptide with electrons, photons, gas molecules or a surface. Electronic excitation can be used. The electronic excitation may be generated by collisions of the peptide with electrons, photons, gas molecules (e.g. argon) or a surface.

In another example, the experimental fragment mass spectrum of a peptide from an enzymatically digested unknown protein is compared with the theoretical masses calculated by applying the rules for the specificity of the enzyme, and the rules for the fragmentation as known to those of ordinary skill in the art, to the amino acid sequence of a database protein. For example, the software tool PepFrag (ProteoMetrics) allows for searching protein or nucleotide sequence databases using a combination of mass spectra data and fragmentation mass spectra data.

Fragment mass data for the purposes of this invention can be generated by using multidimensional mass spectrometry (MS/MS), also known as tandem mass spectrometry. A number of types of mass spectrometers can be used including a triple- quadruple mass spectrometer, a Fourier-transform cyclotron resonance mass spectrometer, a tandem time-of-flight mass spectrometer, and a quadruple ion trap mass spectrometer. A single peptide from a protein digest is subjected to MS/MS measurement and the observed pattern of fragment ions is compared to the patterns of fragment ions predicted from database sequences.

In one embodiment the present invention provides a method to determine the probability that an experimental biological molecule is a biological molecule k described in a database given experimental mass data D and background information I. In one embodiment the probability, P(k\DI), is calculated from the following formula:

The difference between each measured mass of the experimental biological molecule and each theoretical mass of the biological molecule in the data base is calculated. If one or more differences are within a mass tolerance for a particular measured mass, the particular measured mass is considered to be one hit; that is, the particular measured mass and the particular theoretical mass are considered to be matching. The total number of hits found for a particular experimental molecule for a particular database molecule is designated as r. Each measured mass associated with a hit is designated as m„ wherein / is an ordinal number from 1 to r. The theoretical masses associated with the ith hit is designated as m,o

There can be more than one theoretical mass associated with the ith hit. The difference between each measured mass, m„ associated with an ith hit and one of the theoretical masses associated with the ith hit is determined. Any one of the theoretical masses, m,o, associated with the ith hit can be used to determine these differences. For example, the theoretical mass which produces the smallest difference between the measured mass, m„ and the theoretical mass can be used. Alternately, the average of the theoretical masses associated with the hit can be used to determine these differences.

N is the quantity of masses in the theoretical mass data and x is a function of the measured mass of the z^'th hit.

P(k\I) is determined from background information (I) known about the experimental biological molecule, prior to consideration of the experimental mass data (D). This background information can be any information about the experimental biological molecule. In one embodiment the P(k\I) is a P(k\DI) obtained from previous experimental data generated for the experimental biological molecule.

The formula includes a factor which incorporates a determination of whether the measured mass data contains certain digestion patterns, the factor is designated as P_paue_r - The digestive patterns can be any digestion pattern that can be observed for biological molecules. Examples of particular digestive patterns for proteins are described below. If certain patterns occur the P(k\DI ) will increase accordingly. In one embodiment, the F_patte_r is calculated by taking a number greater than one and less than 1000 to the power of the quantity of occurrences of such patterns. In a preferred embodiment the number is from 1.5 to 10; most preferably the number is 2.5. In one embodiment the above formula can further include information regarding each theoretical mass associated with an ith hit. In this embodiment the number of theoretical masses within the mass tolerance for each measured mass, m„ is counted; and the total number of such theoretical masses is designated as g, for a particular m_. The difference between each measured mass, m„ associated with an z^'th hit and each theoretical mass, m_yo, associated with the hit, is determined wherein/ is an ordinal number from 1 to g,. The P(k\DI) is then calculated from the following formula:

10

P(k I DI) oc P(k (^m, ^~ " > )² F

N! -∑^eχpj p.attern

/=1 2σ

In one embodiment in the above formulae x is defined as

wherein f(m,) is a normalized distribution of theoretical masses of the database and wherein c is in the range of 0J to 100. A more detailed explanation of the normalized mass distribution f(m) can be found in Appendix 3.

In one embodiment c is

In one embodiment the above probability formulae can further include a function of/, designated as y_}, incorporated into the formulae as follows:

The function of y_} can be any function of y. In a preferred embodiment ^**y. can be defined as (W^'1)^'1 , wherein Wls a constant equal to or greater than one. In a preferred embodiment is four.

All of the above probability formulae can be normalized by the following calculation:

The probability formulae of the present invention can be used to identify components of mixtures of biological molecules. For example, a database can be extended to contain entries which are additive combinations of the single proteins of a database. In one embodiment of the present invention a P(k \DI) is assigned to each theoretical protein in the database for data generated from a particular experimental protein; and the theoretical proteins which have the highest P(k \ DI) are chosen. The highest P(k \DI) can be from the top 50% of the database proteins to the top 0.01% of the database proteins. From these chosen proteins a new database is formed which contains additive combinations of the chosen proteins. Additive combinations are database proteins added together in various combinations. P(k \DI) calculations are performed using this new database.

In one embodiment the present invention provides a means for determining the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental mass data (D) and background information (I). The means is any means by which the probability can be determined. For example, the means includes a computer or mass spectra, as would be recognized by a person skilled in the art. Included in the means is a means for generating experimental mass data (D) for the experimental biological molecule, wherein D comprises measured masses and standard deviaUons, σ, associated with the measured masses; a means for determining a mass range (m^ m^ for the experimental mass data; a means for generating theoretical mass data for the biological molecule k within the mass range (m_mm, m_max); a means for counting the number of masses, N, in the theoretical mass data; a means for calculating the difference between each measured mass and each theoretical mass, wherein if one or more differences are within a mass tolerance for a particular measured mass, the particular measured mass is considered to be one hit; a means for designating each measured mass associated with a hit as rr, wherein i is an ordinal number from 1 to r, wherein r is the total number of hits for a particular biological molecule; a means for determining the difference between each measured mass, rr , associated with an th hit and one of the theoretical masses, m-,, associated with the hit; a means for determining whether the measured mass data contains a digestion pattern, wherein each occurrence of such pattern is incorporated into a factor designated as F^^^ a means for determining P(k 11) from background information (I) known about the experimental biological molecule, prior to consideration of the experimental mass data (D); a means for calculating P(k \ DI) from the following formula:

wherein x is a function of the measured mass of the z^'th hit, and wherein P(k\DI) is the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental data (D) and background information ( ).

In one embodiment the means for determining the probability that an experimental biological molecule is a biological molecule (k) further includes a means for counting the number of theoretical masses within the mass tolerance for each measured mass, m/, wherein the total number of such theoretical masses is designated as gι for a particular mf, a means for determining the difference between each measured mass, m„ associated with an ith hit and each theoretical mass, m_yo, associated with the hit, wherein y is an ordinal number from 1 to g„ and a means for calculating P(k\DI) from the following formula:

wherein P(k\DI) is the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental data (D) and background information (I).

In another embodiment the present invention provides a computer program product including a computer usable medium having computer readable program code means embodied in said medium for determining the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental mass data (D) and background information (I). The computer program product includes computer readable program code means for causing a computer to generate experimental mass data (D) for the experimental biological molecule, wherein D comprises measured masses and standard deviations, σ, associated with the measured masses; computer readable program code means for causing a computer to determine a mass range (rn^ rnX) for the experimental mass data; computer readable program code means for causing a computer to generate theoretical mass data for the biological molecule k within the mass range (m_min, m_max); computer readable program code means for causing a computer to count the number of masses, N, in the theoretical mass data; computer readable program code means for causing a computer to calculate the difference between each measured mass and each theoretical mass, wherein if one or more differences are within a mass tolerance for a particular measured mass, the particular measured mass is considered to be one hit; computer readable program code means for causing a computer to designate each measured mass associated with a hit as m_, wherein i is an ordinal number from 1 to r, wherein r is the total number of hits for a particular biological molecule; computer readable program code means for causing a computer to determine the difference between each measured mass, rr , associated with an zth hit and one of the theoretical masses, associated with the hit, computer readable program code means for causing a computer to determine whether the measured mass data contains a digestion pattern, wherein each occurrence of such pattern is incorporated into a factor designated as E^,^ computer readable program code means for causing a computer to determine Pfk 11) from background information (I) known about the experimental biological molecule, prior to consideration of the experimental mass data (D); computer readable program code means for causing a computer to calculate Pfk | DI) from the following formula:

wherein x is a function of the measured mass of the th hit, and wherein Pfk\DI) is the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental data (D) and background information (I).

In one embodiment the invention provides a computer program product which includes a computer usable medium having computer readable program code means embodied in said medium for determining the probability that an experimental biological molecule is a biological molecule (k) further including computer readable program code means for causing a computer to count the number of theoretical masses within the mass tolerance for each measured mass, m_,, wherein the total number of sue!" theoretical masses is designated as g for a particular^; computer readable program code means for causing a computer to determine the difference between each measured mass, π , associated with an th hit and each theoretical mass, m^, associated with the hit, wherein is an ordinal number from 1 to g, and computer readable program code means for causing a computer to calculate Pfk \ DI) from the following formula:

In another embodiment a method for determining the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental mass data (D) and background informauon (I) is calculated from the following formula:

The variables are as described above. Additionally the probability can be normalized as above.

Additionally, the present invention includes a means for determining the Pfk\DI) of the above formula. The means is any means by which the probability can be determined. For example, the means includes a computer or mass spectra, as would be recognized by a person skilled in the art. Additionally, the present invention includes a computer program product for determining the Pfk\DI) of the above formula.

Methods

1) Algorithm

Every protein is specified by its particular linear sequence of amino acids. One defining signature of a protein is the set of masses of peptide fragments produced by cleavage of the protein by an enzyme of high cleavage specificity. The problem we seek to solve is to use the peptide masses obtained in such a mass spectrometric peptide mapping experiment to identify a protein from a protein sequence database.

Let k designate the hypothesis that 'protein k is the protein being analyzed', where protein k is an entry in the protein sequence database; D is the experimental data; and / is the available background information (e.g., species from which the protein originated, approximate molecular mass of the protein, mass accuracy of the peptide mass measurement, enzyme cleavage chemistry, previous experiments on the sample protein). Bayes' probability theory and the maximum entropy principle (Bretthosrt, 1996) are applied to derive the probability for the hypothesis k given data D and background information / (Appendix 1). In the derivation, the following assumptions are made: 1) the protein being analyzed exists in the database and 2) all the detected ion species are digestion products of the protein. The probability for each hypothesis k given data D and background information / is given by (Appendix 1)

10

P(k I DI) oc P(k I /)

with normalization condition

∑P(k \ DI) = \ keUatabase

The above formula can be approximated as,

where P(k\I) is the probability for hypothesis k given only the background information, I; N is the theoretical number of peptides generated by fragmentation of protein k by the protease used in the study; r is the number of hits (i.e., the number of matches between the measured and calculated peptide masses); g, is the multiplicity of the /th hit (i.e., the number of theoretical peptides that match a given experimental peptide mass value); (m_mm-m_max) is the range of measure peptide masses; σ is the standard deviation of the mass measurement; m, is the measured mass of the /th hit; m,o is the calculated mass of the /th hit; and F_paUern is an empirical term, which increases the probability when overlapping and/or adjacent peptides are observed (see program description).

The ranking of the candidate proteins is based on the values determined for their probability P(k\DI).

2) Interpretation of the probability P(k\DI)

The Bayesian probability is consistent with common sense. For any given protein k in the database, the probability that protein k is the sample protein increases with increasing number of hits r, increasing mass accuracy (i.e., smaller σ and (m/-m,o)), and decreasing number of theoretically digested fragments N. It has been shown previously that tryptic peptides of higher molecular mass occur with lower frequency than do those with lower molecular mass (Pappin et al., 1993; Fenyδ et al, 1998), and are therefore more constraining for protein identification. The present algorithm takes into account this different information content of peptides with different masses through the normalization condition given above.

The Bayesian probability should be viewed as a measure of confidence level of the hypothesis that protein k is the sample protein based on available information. There is no absolute certainty for any given identification, only the probability - i.e., the higher the probability, the higher is the confidence level. In a critical situation (i.e., where a false positive result cannot be tolerated), it may be desirable to check the identification with an independent method such as tandem mass spectrometry (MS/MS) (Qin et al., 1997). Ultimately, the value of any given identification is provided by the outcome of the biological experiment that results from the information.

3) Identification of components in protein mixtures

Frequently, it proves difficult to separate proteins completely from one another, and a protein sample may contain a mixture of proteins. The Bayesian algorithm can be readily extended to identify the components of such mixtures. The protein sequence database is expanded to include entries that are 'fused' combinations of single protein sequences. At the present time, the program will identify the components of binary mixtures. Thus, the entries representing binary mixtures are binary combinations of single proteins (usually, the top 50 hits obtained in a prior search for single proteins). The Bayesian probabilities for these 'fused' proteins are calculated in the same way as for single proteins.

4) Improvement of the confidence level of protein identification using additional information obtained for the measured peptides

The Bayesian algorithm can incorporate any additional information obtained for measured peptides (Appendix 1). The additional information provides constraint in database searching to reduce the occurrence of database peptides that randomly match the experimental mass spectral data, thereby improving the confidence level for identifications. Fenyo et al. have investigated the value offered by knowledge of the presence (or absence) and number of particular amino acids contained within a given peptide (so-called 'tag information') (Fenyo et al., 1998). Experimentally, tag information can be obtained in a number of ways. Thus, for example, cysteine residues can be identified through chemical alkylation of free thiol moieties (Sechi & Chait.,

1998) and methionine residues can be inferred by observation of pairs of peaks separated by 16 Da (because methionine residues contained in proteolytic peptides are frequently partially oxidized).

5) Program

The database searched is the NCBI NR non-redundant database (URL http://www.ncbi.nlm.nih.gov/BLAST/blast_databases.html). The presence or absence of signal peptides is considered when such information is available in the corresponding NCBI GenPept format faltfile (URL http://www.ncbi.nlm.nih.gov/Entrez/batch.html). Taxonomy data is derived from the NCBI GenPept flatfiles and Taxonomy databases (URL http://www.ncbi.nlm.nih.gov/Taxonomy/). To counter the effect of mass independent systematic errors in the mass measurements, the mean of the experimental minus calculated masses (for all hits of a given protein in the database) is removed during the probability calculation. An empirical factor has been introduced in the probability calculation to take into account two kinds of commonly observed digestion patterns. The first pattern, which we term adjacency, occurs when proteolytic peptides are observed to be adjacent to one another in the protein sequence (see Fig. 4A). The second pattern, which we term common-end overlapping, occurs when the observed peptides have one common terminus, but differs at the other terminus by a peptide segment (see Fig. 4A). Upon each occurrence of adjacency or common-end overlapping, the probability is increased by a factor of 2.5. To increase the speed of the identification program, the search is performed in two steps. In the first step, the database is searched using the dominant term in the probability equation given above, i.e..

The top 1500 protein candidates selected using this simplified formula are then reanalyzed using the full equation, (Eq. 1).

Since NCBI NR non-redundant database is not truly non redundant, even containing protein sequences with trivial differences (highly homologous sequences). A method was devised to detect redundancy based on experimental data and to take into account of such redundancy. When two protein candidates have over certain percentage (over 50%) of common matches whose calculated masses agree to 10^"5 Da for average masses 10^"6 Da for monoisotopic masses, the two protein sequences are considered homologous. Among the homologous sequences, the sequences which do not have the highest likelihood will be removed from the list for normalization calculation.

In one embodiment the method of the present invention includes determining whether the data base includes biological molecules which form a homologous set. The determination that biological molecules in the data base are homologous can be determined by various methods. For example, a homologous set of biological molecules are the biological molecules in the database which have the same theoretical masses associated with the hits for an experimental biological molecule, within a certain percentage. This percentage can be any percentage. A preferred percentage is over fifty percent. By another method biological molecules which make up a identify a homologous set of biological molecules are the biological molecules in the database which have the same theoretical masses associated with the hits for an experimental biological molecule, within a certain percentage, and which have the same amino acid sequences associated with the hits for an experimental biological molecule, within a certain percentage. Preferred percentages for both the mass and sequence information are over fifty percent.

In one example, the P(k|DI) is calculated for each of the biological molecules in the homologous set; and the highest P(k|DI) is assigned to all of the homologous biological molecules in the homologous set.

Figure 7 shows the program flowchart.

5.1) Program input

1. Taxonomic category: A representation of a phylogenic tree is provided through which the user can specify the origin of the sample protein, if known.

2. Search mode: The program can be specified to search in either "Single protein only" mode or "Single or binary mixture" mode. 3. Mass range: If known, the approximate protein mass range of the sample protein can be specified.

4. Number of candidate proteins: The number of top candidate proteins in the output display can be specified.

5. Digestion chemistry: The proteolytic enzyme or chemical reagent used to cleave the sample protein(s) must be specified. Current choices are trypsin, endoproteinase Arg C, endoproteinase Asp N, endoproteinase Lys-C, N8 protease (cleavage at D and E), N8 protease (cleavage at E), and cyanogen bromide.

6. The maximum number of missed cleavage sites: The maximum number of missed cleavage sites within the peptide (yielding incompletely cleaved peptides) must be specified. Allowed values are in the range 0 - 4.

7. Peptide masses and mass tolerances: There are three alternative methods for specifying the masses of peptides used to search the database. These are average mass, monoisotopic mass, and a combination of average and monoisotopic masses (this latter alternative is useful when only some of the peaks in the mass spectrum are isotopically resolved.) The mass tolerances for average and monoisotopic masses are specified independently (either as an absolute mass or as a relative tolerance). Because there is a 95% probability for Gaussian distributed measurement errors to be within ± 2σ (where σ is the standard deviation), the mass tolerance is taken as 2σ. Either neutral or protonated peptide masses can be specified. 8. Amino acid tags: When peptides have been experimentally determined to contain particular amino acids, amino acid tags can be associated with the masses of the peptides.

9. Additional digests: If a sample has been separately digested by different enzymes, the data from each different digestion can be fed into the program and used for protein identification.

10. Modifications: At present, modifications of cysteine residues can be specified. In the future, modifications at other amino acid residues will be incorporated.

5.2) Program output The ProFound output consists of a search result page, which is hyperlinked to pages that provide details about the search results. The search result page consists of a list of protein candidates ranked by probability as well as summary of the input data and search parameters. The sequences of the candidate proteins can be retrieved through links and can be further analyzed by sequence analysis tools contained in PROWL, an interactive environment on the World Wide Web for protein MS (Fenyo et al., 1996). Hyperlinked output pages show graphical and text representations of the matched peptides from the protein candidates. For each candidate protein, graphs are provided to allow the user to quickly assess the experimental peptide mass coverage of the protein and the mass measurement errors

(Fig. 4).

6) Biochemical procedures and mass spectrometry

The method used for in-gel protein digestions was as described previously [13] except that the gel-band soak time was extended from 4 hours to 24 hours. Trypsin digestion of membrane-bound proteins was as described previously [Zhang et al., 1994]. MALDI- time-of-flight (TOF) MS was carried out using a commercial instrument (Perseptive Biosystems STR, Framington, MA) operated in the delayed-extraction reflector mode (FWHM resolution ~ 5000) or an instrument constructed in-house (Beavis & Chait, 1989, 1990) operated in the continuous-extraction linear mode (FWHM resolution ~ 500). The MALDI-ion trap data was obtained using an instrument constructed in-house and described previously.

EXPERIMENTS

Identification of single isolated proteins

Figure 2 shows a delayed-extraction reflectron MALDI-TOF spectrum of the mixture of peptides produced by in-gel trypsin digestion of a 30 kDa SDS-PAGE protein band from an Saccharomyces cerevisiae nuclear extract. Thirty-five monoisotopic masses derived from Figure 2 were submitted to ProFound in order to identify the protein. Other search parameters were: S. cerevisiae for the taxonomic category; a protein mass range of 0 — 3,000 kDa; unmodified cysteines; a maximum of 2 missed cleavage sites; and a mass tolerance of 0.1 Da. The specified taxonomic category and protein mass range includes the complete set of proteins (or open reading frames (ORFs)) in the S. cerevisiae genome. Table 1 lists the top 4 protein candidates (ranked by normalized probability) found by the search. The top-ranked protein, the ribosomal protein S7A, has a probability of 1 and is readily distinguished from the next ranked candidates, which have probabilities of respectively 2 x 10^"51, 8 x 10^"53 and 5 x 10^"53. We plot the probabilities of the top 20 candidates in Figure 3. The probability is observed to make a large transition from the first to second candidate, and varies much more slowly for the remaining candidates. This type of probability distribution pattern provides an unambiguous high confidence identification signature for the top ranked protein.

Figures 4A--C shows sequence coverage maps and an error map for the top ranked candidate. The segment coverage map (Fig. 4A) (in which a segment represents a peptide resulting from complete digestion of the protein by trypsin) is useful for visualizing digestion patterns indicative of an authentic protein identification. Bona fide identifications are often characterized by the observation of peptides that are adjacent to one another in the sequence and/or that overlap and have a common terminal (while differing by one segment at the other terminal.) Examples of these two commonly observed patterns are shown in Figure 4A. Because the observation of such patterns raises our confidence level that a candidate protein is present in the sample, we have empirically included a term in the ProFound probability calculation to incorporate this information. The sequence residue coverage map (Fig. 4B) shows the portion of the ribosomal protein S7A sequence that was observed in the MS peptide mapping experiment. Twenty-three measured masses match 24 theoretical tryptic peptide masses from the ribosomal protein S7A, covering 70% of the sequence. The error map (Fig. 4C) provides a scatter plot showing error (i.e., measured mass - calculated mass) versus mass for each match. The scatter plot is useful for visualizing systematic errors in the mass measurement. When the spectral calibration is free of systematic error, the errors for an authentic hit are normally distributed about zero and are independent of mass value, as in Figure 4C. The bottom portion of Figure 4C is a histogram projection of the scatter plot. In cases where there are a sufficient number of matched peaks, the histogram of errors for an authentic hit shows a peaked distribution (Fig. 4C). By contrast, the error plot for a randomly hit protein (e.g., the second candidate, myol, Fig. 4D) is a nearly uniform distribution of mass errors within the plotted range. We note, however, that this observation only holds true when the error of the mass measurement is either quite small (< 0J Da) or relatively large (> 0.5 Da) (Appendix 2).

Identification of protein components in binary mixtures

The MALDI-TOF mass spectrum shown in Figure 5 was obtained from the products of in-gel digestion of a 105 kDa SDS-PAGE protein band. Peptide masses consisting of 47 monoisotopic masses were submitted to ProFound. (Sometimes the resolution or statistics for higher mass peaks are insufficient for unambiguous identification of the monoisotopic component. In such cases, we determine the average mass of the peptide.) Other search parameters were: S. cerevisae as taxonomic category; protein mass range of 0 — 3000 kDa; unmodified cysteines; 2 maximum missed cleavage sites; mass tolerances of 1 Da for average masses and 0J Da for monoisotopic masses. When the search was performed in the 'single protein' mode, the top two candidates (YLR409c and YDL060w) had probabilities of 0.99 and 0.01, respectively, which was considerably higher than the probabilities for all the rest of the candidate proteins (< 10^"22 with slowly decreasing values). The number of peaks matching with theoretical tryptic peptide masses from YLR409c and YDL060w are 18 each (respective sequence coverage of 24%) and 30%), and the two proteins have no sequence homology. Two such proteins with dominating probabilities provide an indication that the sample may be a binary mixture. To test this hypothesis, ProFound was set up to search for a possible binary mixture using the same data set and other search parameters that was used for the 'single protein only' search. The result of this search identifies, with high confidence, the simultaneous presence of YLR409c and YDL060w (Table 2). This binary protein mixture has a probability of one, while the probabilities for all the other protein candidates are < 10^" , with slowly decreased values. The validity of this mixture identification can be tested by the use of the subtraction method [Jensen et al., 1997]. For this purpose, 18 peaks corresponding to the tryptic peptides from YLR409c were removed from the peptide mass list. The remaining masses, together with the same search parameters, were submitted to ProFound to search for the hypothetical second protein. ProFound identifies YDL060w as the leading candidate with a probability of one (~10 higher than the next protein candidate).

Figure 6 shows a MALDI-TOF mass spectrum obtained from in-gel tryptic digestion of a 30 kDa SDS-PAGE protein band. Thirty-six monoisotopic peptide masses and 1 average peptide mass were submitted to ProFound. Other search parameters were: S. cerevisae as taxonomic category; protein mass range of 0—3000 kDa; unmodified cysteines; a maximum of 2 missed cleavage sites; mass tolerances 0.5 Da for average and 0.1 Da for monoisotopic masses. When the search mode was for 'single protein only' the top two candidates RPS1B (40 s ribosomal protein, MM=28,681 Da) and RPS1A (40 s ribosomal protein, MM=28,612 Da, highly homologous to RPS1B) with probabilities of respectively 1 and 0.001, while the probability for the remaining candidates are < 10^"53. The number of peaks that match theoretical tryptic peptide masses from RPS1B and RPS1A are 24 and 23, respectively (with sequence coverages of 24% and 30%). There are two possibilities that can yield two such dominating candidates. The first is that there is only one protein present in sample (and the second ranked candidate represents a closely similar protein that is not present). The second possibility is that the sample is a binary protein mixture of two highly similar proteins. To test this second hypothesis, ProFound was set up to search for possible binary mixtures with the same data set and other search parameters that were used for the 'single protein only' search. The result of this binary search (Table 3) provides strong evidence that the band is a mixture of RPS1B and RPS1A (probability = 1). The probabilities for all the other single and binary protein candidates are < 10^{" 4}, with slowly decreasing probability values. The two identified proteins are highly homologous, differing by only 7 amino acids in their 254 amino acid sequences. Of the 37 peptide peaks in the mass spectrum, 19 are common to both RPS1B and RPS1A, while 5 correspond to RPS1B only and 4 correspond to RPS1A only. Unlike the previous example, it would be difficult to identify the second protein component using the subtraction method (Jensen et al., 1997) because only 4 peptides (together with 9 peaks that did not match RPS1A) would remain after subtraction of the 24 peaks from the first protein RPS1B.

Independent verification of the MS peptide mapping method for protein identification Two independent strategies for mass spectrometric protein identification are peptide mapping and fragmentation of individual peptide ions (MS/MS) (Qin et al., 1997). Here, we use MS/MS as a method for independently checking the accuracy of the peptide mapping strategy for identifying proteins.

In a previous described experiment (Qin et al., 1997), we used matrix-assisted laser desorption/ionization-ion trap-mass spectrometry to obtain both tryptic map MS data and MS/MS fragment ion data from the same samples. The MS/MS fragmentation information was used to identify proteins with the program PepFrag (Fenyo et al., 1998), which requires an exact match of the peptide mass and the peptide fragment masses with theoretical masses generated from a database-derived peptide. Here, we compare the result obtained with ProFound using the peptide mapping data with the independent identification using MS/MS data from the same sample. Search parameters were: S cerevisea as taxonomic category, protein mass range of 0-3000 kDa; a maximum of 4 missed cleavage sites; mass tolerance of 2 Da. Table 4 is a summary of 15 searches using the two independent methods. All the proteins identified with the MS/MS data were confirmed by ProFound using the peptide mapping data, even though the mapping data was of relatively low quality (i.e., resolution 500 FWHM, accuracy + 2 Da). These findings provide independent assurance of the reliability of ProFound for identifying proteins.

Improvement o[ ie confidence level of protein identification using tag information Incorporation of amino acid 'tag information' in the ProFound search (see Methods) can reduce the occurrence of database peptides that randomly match the experimental MS data, thereby improving the confidence level of an identification. For example, we have shown previously that inclusion of information regarding the absence or presence of cysteine residues in tryptic peptides from proteins can significantly improve the confidence level of a protein identification (Sechi & Chait, 1998).

Appendix

Appendix 1 Derivation of the Bayesian probability that protein k is the protein under analysis

1) Peptide mapping experiment A peptide mapping experiment involves enzymatic or chemical cleavage of the protein, using cleavage reagents with high specificity for particular amino acids. The resulting mixture of peptide fragments is subjected to mass analysis. Each detected peptide fragment ion appears as a peak in the mass spectrum. The position of the peak along the mass axis provides a measure of the mass of the peptide ion. The current mass spectrometry technology does not provide reliable quantitative information from the height of the various peaks in the spectrum, so that peak intensity information is only used to decide on the presence of a peptide fragment. Thus, the the experimental data is limited to the information on the masses of the peptide fragments and can be expressed as a joint hypothesis D=mι...m„ where m, is a logical notation representing the finding that the mass value for the /th peak is m, and n is the number of peaks within the mass range from m_mm to m_max Ideally, all peptide fragments produced from the protein should be detected. However, in practice, only a subset of the peptide fragments is observed. The reasons for not observing all of the peptides include poor solubility and/or low ionization efficiencies of certain peptides.

2) Applying Bayesian probability theory We apply principles discussed in ref. (Bretthorst. 1996) to the present protein identification problem.

Bayes's theorem {PfAB) = P(A)P(B\A) = P(B)P(A\B), where P(X\ Y) is the probability of X given that Y is known} is applied to calculate the probability that a protein, existed in a protein sequence database, gives rise to the experimental peptide map. Let k designate the hypothesis that 'protein k is the protein being analyzed', where protein k is an entry in the protein sequence database; D is the experimental data; and / is the available background information. By applying Bayes' theorem, the posterior probability that hypothesis k is true, in view of the data D and background information I, is

P(D \ I)

The posterior probability, Pfk\DI), depends on three terms. The first term, P(k\I), is the prior probability for the hypothesis given only the information. The second term, P(D\KI), is the likelihood probability that the data D would be observed if the hypothesis is true. The third term, P(D\I), is independent of hypothesis k, and is a normalization constant. Thus, the posterior probability is proportional to the product of prior probability and likelihood probability,

P(k\DI)∞P(k\I)P(D\kI) (bayes)

21) Prior probability P(k\I)

There are two situations with respect to the assignment of the prior probability P(k\I). In the first, where the only knowledge is the background information on the species from which the sample protein originated and the assumption about the mass range of the sample protein, the principle of maximum entropy (Bretthorst, 1996) assigns a uniform prior probability for every hypothesis which satisfies the constraints set by the background information and assumption (prior probability is zero for the hypothesis which doesn't satisfy the constraints). In the second situation, where we have previous knowledge concerning the hypothesis k from a previous experiment, the prior probability before considering the current data is the posterior probability from the previous experiment. Thus the prior probability for hypothesis k which satisfy the background information and assumption is assigned as

f constant If no previous data available

} P(k I D_prevI) If previous data. ZV<.v, available

In this way. data from multiple digestions are natually incorporated.

2.2) Likelihood probability for data P(D\kI)

For a given protein k, measured peptide masses fall into two subsets: hits and misses. Hits are those measured masses that agree with predicted masses (within the mass tolerance). Misses are the remaining masses that cannot be accounted for by the known protein sequence. In the absence of contaminating proteins, misses arise from several possible sources. These include errors in the database, unknown modifications of the proteins, and unexpected cleavage of the protein. The resulting peptides are termed 'modified peptides'. It can be shown that the experimental data D can also be expressed as: D=m,^H ...m_r ^Hm_r+l ^M ...m_r+w ^M

(D-hits/misses) where the superscripts H and M are used to label hits and misses, r is the number of hits, w is the number of misses and the total number of measured masses is r+w. m,^H (≡H,m„ i=l, ...,r) is a logical product that the /th hit originates from a particular peptide in protein k and its measured mass is m,. m_r+J ^M (≡M_r+_j m_r+_}, f=l, ...,w) is also a logical product that the y^'th miss originates from a modified peptide and its measured mass is rrir+_j- "'m n aim trip q ιugιv,aι

jjiuuu^is u± rrι_m ιιι_m+_l ... m_n aiiu rrip rrlp₊]

...m_q ^M respectively. When calculating P(D\kI), the likelihood probability for data D. the product rule is applied to facto. D into the probability for hits and the probability for misses given the hits.

P(D\kI)=P(m, _r ^Hm_{r+I r},_w ^M\kI)

=P(m, _r ^H\kI) P(m_{r+i r+w} ^M\kImι ") (hits/misses) In what follows, hits and misses will be considered separately.

2.2.1) Likelihood probability for hits Pføi 'lkl)

The probability for hits in equation (hits/misses) can be factorized as products for individual hits by applying the product rule

Pfm_l \ kl m __λ) (hit/hit)

where mo^H is defined as logical one. m,^H(≡H,mj) is the logical product of two hypotheses (i.e. the /th hit originates from a particular peptide in the protein k and its measured mass is m,). Each term can be expressed as

P(m,^H\kIm„ ,.ι^H)

=P(H_lm kI mo t-i^H)

=P(H,\kl m ,-ι^H)Pfm,\kI m₀ ,.," H)

=P(H_l\kI mo ,.,^H)P(m_l\kI m_{0 l}.,^H m_lo) i=l, ...,r (hit/nu) where the condition H, is equivalent to the condition that the theoretical mass value of peptide / is m,o-

P(H,\kI mo ,.j^H) in equation (hit/mi) is the probability for the /th measured peptides to be a hit, given protein k and i-1 previous hits. Since the number of available peptides for the /th comparison is (N-i-1), the probability for the /th peptide to a hit is l/(N-i-l) (using the maximum entropy principle [8]), where N is the total number of theoretical peptides. We thus have

P(H_.\kI m_{0 l}-ι )= l/(N-i-l), i=l, ...,r (prior-Η)

P(m,\kl mo ,-i^H m,o) in equation (hit/mi) is the probability for the measured mass value to be m, given its theoretical mass is m,ø. Assuming that the mass measurement is free of systematic error, P(m,\kI m_π ,-i^H m,o) = Pfm,-m_l0\I) i=l, ...,r (likelihood-mi)

where σ is the standard deviation of the mass measurement. In the case where m matches with g theoreUcal masses within given mass tolerance, the probability for the zth hit is

P(m' klm, _l")

= P(∑H_llm_ll \ kI m ,__])

/=!

Thus the probability for hits is given by P(m, " \ kl)

(likelihood-hits)

2.2.2) likelihood probabiliyt for misses Pfm^., ^. | klm, _r"j

We assume that misses are results of either error in protein sequence, unknown modificauon of the protein, or unexpected cleavage of the protein. However, the measured mass alone does not provide information on the identity of the pepude within the protein. The probability for misses depends on the number of 'modified pepudes' /, which is between w and N-r (tv is the number of misses, r is the number of observed hits, and N is the total number of pepudes). By applying the sum rule and the product rule, the probability for all misses can be expressed as Pfm^_{l r+},^M \ kIm, l^l)

= kl )/^>«„,, I kl , )

= R(J I kl m?_r)P(m^M _u+w I A/ m?_rJ)

^■>^=w (likelihood-misses~l) = ∑P(J I kl Of! P( ^ I kl m_rJm^M _+J_,)

./=w 7=1 jV-r w

= ∑R(J I W -)]^"[ R( _r+m_r+, I kl m,^H _rJ ₊ )

.l=w J=\

N-r w

= ∑R(J I kl m?,)Y[P(M_r+J I A .//*, _,)?(/*„, | kl m^HJm^,_M_r^

In the above equation, m_r' is defined as a logical one. The lower limit for summation is changed from J=0 to

,-- ' \klm, _r ^HJ), the probability for having v misses given modified pepudes is 0 when / is smaller than w.

P(J)kImι _r ^H) in equation (likelihood-misses— 1) is the probability for there being J modified peptides, given protein k and r observed hits. The probability is assigned by applying the maximum entropy principle,

C^J

P(J I kl m?, ) = N-r

∑ NN'-.r

1=0

₌ j_r (prior-J) ηN-r

P(M_r+J\kImι _r Jm_rr+J_i ) in equation (likelihood-misses— 1) is the probability for observing a modified peptide, given protein k, J modified peptides and r hits plus j-1 misses being observed already. Since the number of available peptide is N-fr+f-1), and the number of remaining unobserved modified peptides is J-fj-1), the probability for observing a modified peptide is assigned, by applying the maximum entropy principle, as follows:

P(M_r+J\kI mι _r ^HJ m_{r r+]}.,^M)=(J-j+l)/(N-r-j+l) j=l, ...,w

N-r≥ ≥w (prior-M_r+j)

P(m_r+J\kl mi _r J m_{r r}+_j-i M_r+J) in equation (likelihood-misses— 1) is the likelihood probability for the modified peptide to have a measured mass m_r+J . Since m_r+_j is a miss and is always within the range of the mass measurement (i.e., between the minimum mass m_mm and the maximum mass m_max ), using the principle of maximum entropy the probability is assigned

P(m_r+_j\kl rn; _r ..' n_{r r+J}.ι M_r+_j)=l/(m_max-m_m,„). j=l, ...,w (Iikelihood- m_r )

Substituting equations (prior-J), (prior-M_r+j) and (Iikelihood-m,.^) into equation (likelihood-misses— 1), the probability for all misses is given by

(likelihood-misses— 2) 1

2(^W _maχ - ^Wm, )

2.3) Posterior probability P(k\DI)

Substituting equations (likelihood-hits) and (likelihood-miss — 2) into equation (hits/misses), and then substituting equation (hits/misses) into equation (bayes), the posterior probability is given by P(k\DI)

(bayes-final)

Assuming that the protein being analyzed exists in the protein sequence database, the normalized posterior probability is obtain by applying the normalization condition of

∑ P(k \ DI) = \ ke Database

3) Extension to incorporate additional information obtained for the measured peptides The Bayesian algorithm we developed can be naturally extended to incorporate any additional information of the measured peptides. When an experimentally observed peptide is known to have certain property, such as number of particular amino acids or partial amino acid sequence, this knowledge is included in the background information /. In calculating the likelihood probability for the data with a particular type of information (D⁴), the number of available peptides for experimental detection is N_A, the theoretical number of peptides containing the specified type of information within the protein sequence k. By substituting N with N_A in equation (bayes-final), the likelihood probability for D⁴ is given by P(D^Λ I kl)

where A denotes the specified type of information and r_A is the number of hits which satisfy both the mass and the additional information constraints. In addition to the increase of probability discrimination of the authentic protein by eliminate random hits from random protein, it can be seen that the probability (for the authentic protein) further improves by a factor of N/Nj for each measured peptide having the additional information.

Appendix 2

Let Δ_exp denote accuracy of mass measurement, which is also the width of error

distribution of peptide mass measurement; Δ_pep half width of the mass distribution of all

rand exp pep ' possible peptides, encompassing 95% of all amino acid compositions (Δ_pep for 2000 Da

has been found to be ~0J Da (Mann, 1995)); Δ_raπd half width of error distribution of

randomly matched peptides. Under the condition that Δ_exp is smaller than the distance

between neighboring peptide mass distributions, where no peptide mass is possible,

if Δ_exp« Δpe_P, then Δ_exp« Δ_{pep *} Δ_rand-

I.e., when experimental mass accuracy (Δ_exp) is considerably smaller than the width of

peptide mass distribution (Δ_pep), the error distribution of matched peptides for hit protein (Δ_exp) is considerably smaller than that of randomly matched peptides for unrelated

(randomly hit) proteins (Δ_rand)-

Appendix 3 Pfm_r+I I kl m, f m_{rr /} ,^M M^ in equation (likelihood-misses— 1) is the likelihood probability for the modified peptide to have a measured mass m^ . W--₊ is a miss and the likelihood for it to occur should be determined by the mass distribution of the theoretical peptides of proteins in the database. Therefore the probability is assigned using the normalized mass

P(m _l \ kI m, ]¹J m_rr+l

j= 1,...,w (likelihood-/*.,,)

The normalized mass distribution can be derived from the statistical frequency distribution of the theoretical peptide masses of proteins in the database. The probability can be factored into form of Pfm^ I kl m, /'/ m_rr+J ,^M M^fl / fm^- jffm_^-m ffm^] where 11 fτn_ιmrt_-m_mι) corresponds to a uniform distribution.

Substituting equations (prior- J), (prior-M_r+)) and (likelihood-^) into equation (likelihood-misses— 1), the probability for all misses is given by

⁼ Σ XXrXW ^~X_T — ^{~ m}r,j ) (likelihood-misses-2)

^ N - r - j + 2.3) Posterior probability P(k | DI)

Substituting equations (likeUhood-hits) and (likelihood-miss — 2) into equation

(hits/misses), and then substituting equation (hits/misses) into equation (bayes), the posterior probability is given by

P(k \ DI)

(m, - m,₀)²

2σ;

(bayes-final)

J - j + l where it can ∑: ^■π=ι N - - 7 + 1

Appendix 4

The theoretical peptides of given protein k, can fall into ^ exclusive mum-subsets, where the probability for the occurrence of peptides in different subset can be different and can be determined, for instance, empirically. For example, in peptide mapping experiment, given, a, a maximum number of missed cleavage sites within peptides, the exclusive multi-subsets would respectively correspond to the number of missed cleavage sites as 1, 2, ..., a. Another example is in ms/ms fragmentation experiment, where the theoretical fragment ions can be classified according to different ion types (b, y, a, a , c", etc) to form exclusive multi-subsets. Designating the subsets as S, ...J_, it can be shown that the experimental data D can further be expressed as

D - m MS, ∞, M

\:r. +...+r_z__\ +l:r m r.+l:r+w where r, ...,r„(r, + ... +r,—r) are the number of hits corresponding to the respective subsets.

Designating the total number hits corresponding to subsets 1 to q-1 as t(—r, + ... +r ,), a hit originated from the ^th subset is a logical product

ΪYI

that the t+s)th. hit is originated from a particular peptide (in protein k) and the peptide belongs to the ^th subset and it's measured mass is m

Likelihood probability for hits

I )

Assuming the number of theoretical peptides in qt subset is N and the probability for an observed peptide to be originated from the ^th subset is p . The probability for the (t+s)άι hit can be factored

= P(S_q I kl m?™)P(H_rι__ι+t I kl m?™S_q)P(m_l+ \ kl m^S_qH_t+s) q=1,...,%s=1,...,r_ιl

where m^H{Vq} = m.^HS[ m.^HSq and is mϋ ^ defined as logical one. 0:t+s-\ — ^ml: * ^{• j} t+\-J+s-\ 0:0 &

P(S I kl m_0l+s__l ) in the above equation is the probability to observe a peptide which originated from the ^th subset given condition and it is p

25

P⁽s, */ ',) C I _y.. S /,...,

P(H_t+s | kl

) is the probability for the (t+sjtSa hit to be originated from a particular peptide in the ^th subset given all hits corresponding to subsets S, ... S , and previous /th to {s-1)t hits, which are identified to be originated from respectively particular s-1 peptides in the qήi subset. Since the number of available peptides in the ^th subset for the hit is N -s+ 1, the probabihty is assigned as

P(m I kl m^^{X q} S H ) ^ls die probability for measured mass value to be ^m _t+s given that the (t+s)t hit is originated from a particular peptide whose mass is known to be m ( )0 The probability is assigned as

10

P(m_l+s \ kI m^_ S_qH_l+s )

In the case where m matches with g_t+s theoretical masses within given mass tolerance, the probabihty for the {t+s)t hit is

Thus, the probabihty for hits is given by

where f=r,

Tables

Table 1 ProFound search results obtained with the data shown in Figure 2

Table 2. ProFound search results with data obtained from in-gel tryptic digest shown in Figure 8

Table 3 ProFound search results with data obtained from in-gel tryptic digest shown in Figure 9

Table 4 Summary on identifications made by Ion trap MS/MS and peptide mapping

(a) log(pι)-log(p₂) is the logarithm difference of probabilities between first and second candidate proteins. When some top candidates are highly homologue, pi is determined as sum of probabilities for them and p₂ is the probability for the following candidate.

Claims

WE CLAIM:

1. A method for determining the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental mass data (D) and background information (I), the method comprising,

a) Generating experimental mass data (D) for the experimental biological molecule, wherein D comprises measured masses and standard deviations, σ. associated with the measured masses; b) Determining a mass range (m_mm, m_max) for the experimental mass data; c) Generating theoretical mass data for the biological molecule k within the mass range m_mm, m_max), d) Counting the number of masses, N, in the theoretical mass data; e) Calculating the difference between each measured mass and each theoretical mass, wherein if one or more differences are within a mass tolerance for a particular measured mass, the particular measured mass is considered to be one hit; f) Designating each measured mass associated with a hit as m„ wherein / is an ordinal number from 1 to r, wherein r is the total number of hits for a particular biological molecule; g) Determining the difference between each measured mass, m„ associated with an /th hit and one of the theoretical masses, m,o, associated with the hit; h) Determining whether the measured mass data contains a digestion pattern, wherein each occurrence of a digestion pattern is incorporated into a factor designated as -r pattern? i) Determining Pfk\I) from background information (I) known about the experimental biological molecule, prior to consideration of the experimental mass data (D); j) Calculating Pfk\DI) from the following formula:

wherein x is a function of the measured mass of the /th hit, and wherein Pfk\DI) is the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental data (D) and background information (I).

2. A method according to Claim 1 , step (h), wherein the one of a theoretical masses, m,o, associated with the hit is the theoretical mass which produces the smallest difference between the measured mass, m„ and the theoretical mass.

3. A method according to Claim 1, step (h), wherein the calculated theoretical mass associated with the hit is the average of the theoretical masses associated with the hit.

4. A method according to Claim 1 further comprising: a) Counting the number of theoretical masses within the mass tolerance for each measured mass, m_, wherein the total number of such theoretical masses is designated as g, for a particular m,; b) Determining the difference between each measured mass, m„ associated with an /th hit and each theoretical mass, m_υo, associated with the hit, whereiny is an ordinal number from 1 to g„ and c) Calculating Pfk\DI) from the following formula:

5. A method according to Claim 1 wherein

6. A method according to Claim 1 wherein x is defined as

wherein ffm,) is a normalized distribution of theoretical masses of the database and wherein c is in the range of 0J to 100.

7. A method according to Claim 6 wherein c is

8. A method according to Claim 4 wherein the formula further comprises a function ofj, designated as y_}, wherein

P(k I DI) oc P(k

9. A method according to Claim 8 wherein /, is (W^J'')^'' , wherein W a constant equal to or greater than one.

10. A method according to Claim 9 wherein W is four.

11. A method according to Claim 1 wherein the experimental mass data are generated by a computer.

12. A method according to Claim 1 wherein the experimental mass data are generated by a mass spectrometer.

13. A method according to Claim 1 wherein the theoretical mass data are generated by a computer.

14 A method according to Claim 1 wherein the background information (I) of the P(k\I) comprises information about the species of the experimental biological molecule.

15. A method according to Claim 1 wherein the background information (I) of the P(k\I) comprises knowledge or an assumption about the mass of the experimental biological molecule.

16. A method according to Claim 1 wherein the background information (I) of the P(k\I) comprises information about the isoelectric point of the experimental biological molecule

17. A method according to Claim 1 wherein the P(k\I) is a P(k\DI) obtained from previous experimental data generated for the experimental biological molecule.

18. A method according to Claim 17 the previous experimental data is mass data of the experimental biological molecule.

19. A method according to Claim 1 wherein the mass range (m_mm, m_max) is the minimum and maximum measured masses of the experimental biological molecule.

20. A method according to Claim 1 wherein the F_pattem is calculated by taking a number greater than one and less than 1000 to the power of the quantity of occurrences of the pattern.

21. A method according to Claim 20 wherein the number is 2.5.

22. A method according to Claim 1 wherein the experimental mass data (D) is a subset of the experimental mass data (D).

23. A method according to Claim 1 further comprising: a) determining whether the data base includes biological molecules which form a homologous set; b) calculating the P(k|DI) for each of the biological molecules in the homologous set; and c) assigning the highest P(k|DI) to all of the homologous biological molecules in the homologous set.

24. A method according to Claim 23 wherein a homologous set of biological molecules are the biological molecules in the database which have the same theoretical masses associated with the hits for an experimental biological molecule, within a certain percentage.

25. A method according to Claim 24 wherein the percentage is over fifty percent.

26. A method according to Claim 23 wherein a homologous set of biological molecules are the biological molecules in the database which have the same theoretical masses associated with the hits for an experimental biological molecule, within a certain percentage, and which have the same amino acid sequences associated with the hits for an experimental biological molecule, within a certain percentage.

27. A method according to Claim 26 wherein the percentages are over fifty percent.

28. A method according to Claim 1 wherein the experimental and theoretical mass data is fragment mass data.

29. A method according to Claim 1 wherein the experimental biological is a mixture of biological molecules.

30. A method according to Claim 29 wherein the data base comprises additive combinations of biological molecules.

31. The method of Claim 1 wherein the biological molecules are proteins.

32. The method of Claim 1 wherein the biological molecules are nucleic acid molecules.

33. The method of Claim 1 wherein the biological molecules are polysaccharides.

34. A means for determining the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental mass data (D) and background information (I) comprising, a) a means for generating experimental mass data (D) for the experimental biological molecule, wherein D comprises measured masses and standard deviations, σ, associated with the measured masses; b) a means for determining a mass range (m_min, m_max) for the experimental mass data; c) a means for generating theoretical mass data for the biological molecule k within the mass range (m_min, m_max); d) a means for counting the number of masses, N, in the theoretical mass data; e) a means for calculating the difference between each measured mass and each theoretical mass, wherein if one or more differences are within a mass tolerance for a particular measured mass, the particular measured mass is considered to be one hit; f) a means for designating each measured mass associated with a hit as m„ wherein i is an ordinal number from 1 to r, wherein r is the total number of hits for a particular biological molecule; g) a means for determining the difference between each measured mass, m_, associated with an /th hit and one of the theoretical masses, m,o, associated with the hit, h) a means for determining whether the measured mass data contains a digestion pattern, wherein each occurrence of such pattern is incorporated into a factor designated as F_patlern; i) a means for determining Pfk\I) from background information (I) known about the experimental biological molecule, prior to consideration of the experimental mass data (D); j) a means for calculating P(k\DI) from the following formula:

wherein x is a function of the measured mass of the /th hit, and vv nerein Pfk\DI) is the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental data (D) and background information (I).

35. A means for determining the probability that an experimental biological molecule is a biological molecule (k) according to Claim 34 further comprising: a) a means for counting the number of theoretical masses within the mass tolerance for each measured mass, m₇, wherein the total number of such theoretical masses is designated as gj for a particular mf, b) a means for determining the difference between each measured mass, m„ associated with an /th hit and each theoretical mass, m_vo, associated with the hit, wherein/ is an ordinal number from 1 to g„ and c) a means for calculating Pfk\DI) from the following formula:

P(k I DI) oc

wherein Pfk\DI) is the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental data (D) and background information (I).

36. A computer program product comprising: a computer usable medium having computer readable program code means embodied in said medium for determining the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental mass data (D) and background information (I), said computer program product including: a) computer readable program code means for causing a computer to generate experimental mass data (D) for the experimental biological molecule, wherein D comprises measured masses and standard deviations, σ , associated with the measured masses; b) computer readable program code means for causing a computer to determine a mass range (m_mm, m_max) for the experimental mass data; c) computer readable program code means for causing a computer to generate theoretical mass data for the biological molecule k within the mass range (m_mm, m_max); d) computer readable program code means for causing a computer to count the number of masses, N, in the theoretical mass data; e) computer readable program code means for causing a computer to calculate the difference between each measured mass and each theoretical mass, wherein if one or more differences are within a mass tolerance for a particular measured mass, the particular measured mass is considered to be one hit; f) computer readable program code means for causing a computer to designate each measured mass associated with a hit as m„ wherein / is an ordinal number from 1 to r, wherein r is the total number of hits for a particular biological molecule; g) computer readable program code means for causing a computer to determine the difference between each measured mass, m_, associated with an /th hit and one of the theoretical masses, m,ø, associated with the hit, h) computer readable program code means for causing a computer to determine whether the measured mass data contains a digestion pattern, wherein each occurrence of such pattern is incorporated into a factor designated as F_pa„_ern; i) computer readable program code means for causing a computer to determine

P(k\I) from background information (I) known about the experimental biological molecule, prior to consideration of the experimental mass data (D); j) computer readable program code means for causing a computer to calculate

P(k\DI) from the following formula:

37. A computer program product comprising: a computer usable medium having computer readable program code means embodied in said medium for determining the probability that an experimental biological molecule is a biological molecule (k) according to Claim 36 further comprising: a) computer readable program code means for causing a computer to count the number of theoretical masses within the mass tolerance for each measured mass, m_/, wherein the total number of such theoretical masses is designated as g_/for a particular mi, b) computer readable program code means for causing a computer to determine the difference between each measured mass, m„ associated with an /th hit and each theoretical mass, m_vo, associated with the hit, wherein/ is an ordinal number from 1 to g„ and c) computer readable program code means for causing a computer to calculate P(k\DI) from the following formula:

38. A method for determining the probability that an experimental biological molecule is a biological molecule (k) described in a database given experimental mass data (D) and background information (I), the method comprising, a) Generating experimental mass data (D) for the experimental biological molecule, wherein D comprises measured masses and standard deviations, σ, associated with the measured masses; b) Determining a mass range (m_min, m_max) for the experimental mass data; c) Generating theoretical mass data for the biological molecule k within the mass range (m_min, m_max); d) Counting the number of masses, N, in the theoretical mass data; e) Calculating the difference between each measured mass and each theoretical mass, wherein if one or more differences are within a mass tolerance for a particular measured mass, the particular measured mass is considered to be one hit; f) Designating each measured mass associated with a hit as m„ wherein / is an ordinal number from 1 to r, wherein r is the total number of hits for a particular biological molecule; g) Counting the number of theoretical masses within the mass tolerance for each measured mass, m,, wherein the total number of such theoretical masses is designated as g_t for a particular m , h) Determining the difference between each measured mass, m„ associated with an th hit and one of a theoretical masses, m o, associated with the hit or a calculated theoretical mass associated with the hit; i) Determining whether the measured mass data contains a digestion pattern, wherein each occurrence of such pattern is incorporated into a factor designated as t 'pattern? j) Determining Pfk\I) from information (I) known about the experimental biological molecule, prior to consideration of the experimental mass data (D); k) Calculating P(k\DI) from the following formula:

39. A method according to Claim 30 wherein

∑P(k \ DI) = \ keDalabase

40. A method according to Claim 38, step (h), wherein the one of a theoretical masses, m,ø, associated with the hit is the theoretical mass which produces the smallest difference between the measured mass, m„ and the theoretical mass.

41. A method according to Claim 38, step (h), wherein the calculated theoretical mass associated with the hit is the average of the theoretical masses associated with the hit.

42. A method according to Claim 38 wherein the experimental mass data (D) is a subset of the experimental mass data (D).

43. A method according to Claim 38 wherein the experimental and theoretical mass data is fragment mass data.