WO2002027313A1

WO2002027313A1 - Methods for determining aproximations of three dimensional polypeptide structure

Info

Publication number: WO2002027313A1
Application number: PCT/US2001/030308
Authority: WO
Inventors: Jeffrey Skolnick
Original assignee: Geneformatics, Inc.
Priority date: 2000-09-26
Filing date: 2001-09-26
Publication date: 2002-04-04
Also published as: AU2001294822A1

Abstract

This invention provides methods for computationally determining protein structure by employing combined threading to optimize threading results, Sequence profiles are employed to generate an initial probe-template alignment that is then used in the evaluation of pair interactions. Two types of sequence profiles are employed, a 'close' set, typically comprised of sequences whose identity lies between about 35 % and about 90 %; and a 'distant' set, typically comprised of sequences with a FASTA E-score less than 10. preferably, a total of four scoring functions are used in a hierarachical process to provide an initial alignment of the probe sequence in each of the templates. The same database is then screened with a scoring function comprised of sequence plus secondary structure, and further preferably includes pair interaction profiles. In preferred embodiments, a set of the top scoring sequences (for example, four scoring functions times the top five structures) is used to construct a protein-specific pair potential based on consensus side chain contacts occurring in 25 % of the structures. In subsequent threading iterations, the protein-specific pair potential, when combined in a composite fashion, is found to be more sensitive in identifying the correct pairs than when th eoriginal statistical potential is used, and it increases the number of recognized structures for the combined scoring functions.

Description

METHODS FOR DETERMINING APPROXIMATIONS OF THREE DIMENSIONAL POLYPEPTIDE STRUCTURE

RELATED APPLICATION This application claims priority to United States provisional patent application serial number 60/235,464, filed September 26, 2000.

FIELD OF THE INVENTION

This invention relates to protein structure and function analysis. More particularly, this invention relates to computational methods of determining approximate protein structures, such as are useful in determining protein biochemical function, for example.

BACKGROUND OF THE INVENTION

The following description includes information that may be useful in understanding the present invention. It is not an admission that any of the information provided herein is prior art, or relevant, to the presently claimed inventions, or that any publication specifically or implicitly referenced is prior art.

There have been numerous methodologies developed for assessing protein biochemical function. For example, "sequence-based" approaches to biochemical function annotation have been used to provide functional assignments for between 40-60% of the open reading frames (ORFs) in any given genome. Thus, functional assignments of a large percentage of ORFs about which nothing is known, termed ORFans, still remain to be determined. However, even with those ORFs for which a functional assignment has been developed, the function for many are not proven to be accurate. Such methodologies include PSEBLAST, such as described by Altschul SF, et al, J Mol. Biol. 215: 403-410 (1990), and Pearson WR. Methods Enzymol. 266: 227-258 (1996); and "sequence motif- based" methods (i.e., use of local sequence descriptors) such as discussed by Bairoch A, et al. The PROSITE Database. Its Status in 1995. Nucleic Acids Res. 24: 189-196 (1995); Henikoff S, and Henikoff JG, Protein Family Classification Based on Searching a Database of Blocks. Genomics 19: 97-107 (1994); Attwood TK, et al. PRINTS - A Database of Protein Motif Fingerprints. Nucleic Acids Res, 22: 3590-3596 (1994); Attwood TK, et al., Novel Developments With the PRINTS Protein Fingerprint Database, Nucleic Acids Res., 25: 212-216 (1997); and Nevill-Manning CG, et al., Highly Protein Sequence Motifs for Genome Analysis. Proc. Natl. Acad. Sci. USA 95: 5865-5871 (1998).

Sequence-based approaches, such as those noted above, fail to provide accurate structure for proteins within protein families as the number of proteins within such families becomes more diverse. To overcome this problem, advances in those approaches attempted to combine one-dimensional sequence information with information about known structure, such as reported by Yu L, et al, Protein Sci. 7: 2499-2510 (1998).

Additionally, alternative structure-based approaches to computational function assignment have been developed that employ the sequence-structure-function paradigm as disclosed by Fetrow JS, and Skolnick J., J. Mol. Biol. 281: 949-968 (1998); Fetrow JS, et al., J. Mol. Biol. 282: 703-711 (1998); Zhang L, et al., Folding and Design 3: 535-548 (1998); Fetrow JS, et al., FASEB J. 13: 1866-1874 (1999); Siew N, et al, Prediction of Disulfide Oxidoreductase Function in Nine Genomes. 2000; in preparation; Zhang B, et al., Protein Sci 8: 1104-15 (1999); and Skolnick J, and Fetrow J., TIBTECH, 18: 34-39 (2000). In these examples, models predicted by threading are screened for matches to known active sites and, if a match is found, then a biochemical function assignment is made. One key to the success of this sequence-structure-function approach is the use of the best threading algorithm available so that more distant relationships between proteins can be recognized. In this regard, the recent CASP3 results (Third Community Wide Experimental on the Critical Assessment of Techniques for Protein Structure Prediction; http://predictioncenter.llnl.gov) have pointed out both the strengths and weaknesses of contemporary threading algorithms.

Following CASP3, substantial^" and significant improvements in the algorithms for threading have been developed that allows for better use of "pair interactions" in particular and that demonstrably provide increased efficacy over current methodologies, especially that of other methods that were used to provide such databases as described by Fischer D, et al, Pac Symp Biocomput 300-18 (1996).

With respect to reported threading algorithms themselves, they are essentially defined by three elements. First, the nature of the interactions between different residues and the functional form of the "energy" calculated for this set of interactions must be selected. By "energy" is meant a collection of terms that assesses the relative fitness of different structures and sequences. In this process wherein the interactions are provided values or "scores," functions used that comprise of a variety of function terms must be weighted according to their relative importance. Energy terms used in the previously reported methods have included "local burial status" of residues, "secondary structure propensities" or predicted secondary structure as well as additional energy penalty terms. By "local burial status" is meant where the side chains are exposed to solvent or completely covered in the core of the protein. By "secondary structure propensities" is meant terms that reflect the different preferences for local secondary structure (helices, strands, turns or loops). Additionally, the methods have included "pair" or higher order interactions. By "pair interactions" is meant interactions between pairs of amino acid residues in the sequence. By "higher order" is meant interactions between a minimum of three distinct residues. Contemporary algorithms often include an essential term that is related to the sequence identity between a template protein sequence and a probe sequence. The template sequence, or evolutionary sequence component, is often designed to improve template protein recognition ability by a given probe and the quality of the predicted structural alignment.

Second, if "pair interactions" are included, then the "type" of "interaction centers" must be selected. By "type" is meant backbones, alpha carbons, side chains, or side chain centers of mass. By "interaction centers" is meant where the interaction is evaluated. Commonly used choices for interaction center types are the Cαs, the Cβs, as discussed by Maiorov VN, and Crippen GM., J Mol. Biol. 277: 876-888 (1992); Tropsha A, et al., in Pacific Symposium on Biocomputing '96 (Hunter, L. and Klein, T. E., eds.), pp. 614-623 (1996); Jones DT, et al., Nature 358: 86-89 (1992); Koretke KK, et al., Protein Sci. 5: 1043-1059 (1996), and the side chain centers of mass, as well as specially defined interaction centers such as described by Bryant SH, and Lawrence CE., Proteins 16: 92-112 (1993) Lathrop R, and Smith TR, J. Mol. Biol. 255: 641-665 (1996), or additionally, a center type based on any side chain atom as described by Godzik A, et al., J. Mol. Biol 227: . 227-238 (1992). The functional form of the pair energy ranges from "contact potentials" to "continuous distance-dependent potentials" to "interaction environments." By "contact potentials" is meant potentials that are a step function and are a constant below a distance cutoff and are zero beyond that cutoff. By "continuous distance-dependent potentials" is meant those potentials that are a function of distance and are differentiable. By "interaction environments" is meant the environment of amino acids surrounding a given residue in a protein. Third, given an energy function, a search procedure that finds the optimal alignment between the probe sequence and each structural template must be employed. When all the interactions are "local" in nature (for example, a fitness score defined by "mutation matrices" and "secondary structure propensities"), then "dynamic programming" is the best choice. By "local" is meant those that are nearby in sequence, typically nearest neighbor. By "mutation matrices" is meant terms that reflect the different tendencies of amino acids to mutate during evolution. By "secondary structure propensities" is meant terms reflecting different tendencies of amino acids to adopt different types of secondary structure. By "dynamic programming is meant an algorithm that finds the best score of an alignment.

If a "non-local" scoring function is used (for example, pair interactions), then there must be a mechanism by which the interactions are updated in the template structure to reflect the probe sequence. By "non-local" is meant those residues non-local in sequence (i.e., at least 5 residues apart). Some approaches have employed "dynamic programming" with a "frozen" approximation (i.e., where the interaction partners or a set of local environmental preferences are taken from the template protein in the first threading pass). By "frozen" is meant keeping the partners fixed. In some methods, this might be followed by "iterative updating." By "interative updating" is meant the alignment to calculate the pair interaction is successively updated. Still other workers employ "double dynamic" programming, which updates some interactions recognized as being the most important in a first pass of the dynamic programming algorithms. Still other methods evaluate the non- local scoring function directly and search for the optimal probe-template alignment by such as that discussed by Bryant SH, and Lawrence CE., Proteins 16: 92-112 (1993) (i.e., Monte Carlo method) or branch-and-bound search strategies, as discussed by Lathrop R, and Smith TR, J. Mol. Biol. 255: 641-665 (1996).

It should be recognized that almost all search protocols do not allow the actual template structure to adjust in order to reflect the actual structural modifications in the probe structure relative to that of the template. Algorithms such as Monte Carlo and branch and bound strategies permit the partner from the probe sequence found in the current alignment to be used, but they do not allow the template's backbone structure to dynamically readjust to reflect the probe sequence. Such readjustments might be quite important when the probe and template structure differ substantially, for example, when a template protein's glycine is replaced by the probe's tryptophan. Unfortunately, this is precisely the realm where threading would be expected to be the most valuable as compare to pure sequence-based methods.

In principle, the advantage of threading over pure sequence-based approaches is that it employs structural rather than evolutionary information. However, as evidenced by CASP3, many of the successful fold-recognition approaches are pseudo one-dimensional in nature and use evolutionary information (typically implemented in the form of sequence profiles) plus predicted secondary structure. Furthermore, the evolutionary component contributes a significant fraction of the selectivity as discussed by Murzin AG., Proteins 37: 88-103 (1999). Moving to approaches where structure played a more prominent role in CASP3 methodology, Domingues FS., et al., Proteins Suppl: 112-20 (1999) employed a burial energy and a frozen approximation to evaluate pair interactions. However, a single sequence was used rather than sequence profiles. Although this represents a more structure- based approach to threading, all interactions are still implemented at the pseudo one- dimensional level in order to enable the use of dynamic programming. In another example, Panchenko A, et al., Proteins Suppl:133-40 (1999) was unique among the predictors in CASP3 in that they explicitly examined interactions in a structural core of a protein identified on the basis of evolutionary conservation of the structure across a protein family. In some sense, this approach is closest to the original idea of threading. However, a PSI- BLAST sequence-profile component was employed resulting in a conclusion that the combination of both sequence profiles and contact potentials improved the success rate over that when either term is used alone. Since they employed a non-local scoring function, dynamic programming could not be used to search for the best match of a sequence to a given structure. Rather, a Monte Carlo search procedure would be needed to search for the best sequence-structure fitness and such calculations take a considerable amount of computer time. Therefore, application of the method on a genomic scale would require considerable computer resources. Further, for the identification of the core, a number of structures in the protein family must be solved. Overall, the general consensus is that progress was made in CASP3, with alignment quality having improved since CASP2 but threading is likely to be able to perform better if distant homology recognition targets rather than 'pure' folding recognition targets were used, a bias that likely results from the implementation of 'distant homology' filters.

Thus, inventions that extend the ability of threading techniques to address "pure" fold recognition situations are still required. But, the best results seem to occur when a sequence-profile term is combined with threading potentials. The instant specification describes the technical and scientific advancements required for this purpose by disclosing methods which are termed the "PROSPECTOR" methods (PROtein Structure .Predictor Employing Combined Threading to Optimize Results), which provide for pair interactions that improve the sequence-structure specificity over that of sequence-profile terms used alone. As a result of the instant invention, structure-function assignments for novel protein sequences can be made by using multiple scoring functions. When multiple scoring functions are combined, the resulting recognition ability is substantial over previously reported approaches.

Definitions. The following terms have the following meanings when used herein and in the appended claims. Other terms are defined elsewhere in the specification. Terms not specifically defined in the specification shall have their art recognized meaning.

As used herein, an "amino acid" is a molecule having the structure wherein a central carbon atom (the alpha (α)-carbon atom, or "Cα") is linked to a hydrogen atom, a carboxylic acid group (the carbon atom of which is referred to herein as a "carboxyl carbon atom"), an amino group (the nitrogen atom of which is referred to herein as an "amino nitrogen atom"), and a side chain group, R. When incorporated into a peptide, polypeptide, or protein, an amino acid loses one or more atoms of its amino and carboxylic groups in the dehydration reaction that links one amino acid to another. As a result, when incorporated into a protein, an amino acid is referred to as an "amino acid residue." In the case of naturally occurring proteins, an amino acid residue's R group differentiates the 20 amino acids from which proteins are synthesized, although one or more amino acid residues in a protein may be derivatized or modified following incorporation into protein in biological systems (e.g., by glycosylation and/or by the formation of cystine through the oxidation of the thiol side chains of two non-adjacent cysteine amino acid residues, resulting in a disulfide covalent bond that frequently plays an important role in stabilizing the folded conformation of a protein, etc.). As those in the art will appreciate, non-naturally occurring amino acids can also be incorporated into proteins, particularly those produced by synthetic methods, including solid state and other automated synthesis methods. Examples of such amino acids include, without limitation, α-amino isobutyric acid, 4-amino butyric acid, L- amino butyric acid, 6-amino hexanoic acid, 2-amino isobutyric acid, 3 -amino propionic acid, ornithine, norlensine, norvaline, hydroxproline, sarcosine, citralline, cysteic acid, t- butylglyine, t-butylalanine, phenylylycine, cyclohexylalanine, β-alanine, fluoro-amino acids, designer amino acids (e.g., β-methyl amino acids, -methyl amino acids, Nα-methyl amino acids) and amino acid analogs in general. In addition, when an α-carbon atom has four different groups (as is the case with the 20 amino acids used by biological systems to synthesize proteins, except for glycine, which has two hydrogen atoms bonded to the α carbon atom), two different enantiomeric forms of each amino acid exist, designated D and L. In mammals, only L-amino acids are incorporated into naturally occurring polypeptides. Of course, the instant invention envisions proteins incorporating one or more D- and L- amino acids, as well as proteins comprised of just D- or L- amino acid residues.

"Protein" refers to any polymer of two or more individual amino acids (whether or not naturally occurring) linked via a peptide bond, and occurs when the carboxyl carbon atom of the carboxylic acid group bonded to the α-carbon of one amino acid (or amino acid residue) becomes covalently bound to the amino nitrogen atom of amino group bonded to the α-carbon of an adjacent amino acid. These peptide bond linkages, and the atoms comprising them (i.e., α-carbon atoms, carboxyl carbon atoms (and their substituent oxygen atoms), and amino nitrogen atoms (and their substituent hydrogen atoms)) form the "polypeptide backbone" of the protein. In simplest terms, the polypeptide backbone shall be understood to refer the amino nitrogen atoms, α-carbon atoms, and carboxyl carbon atoms of the protein, although two or more of these atoms (with or without their substituent atoms) may also be represented as a pseudoatom. Indeed, any representation representing a polypeptide backbone that can be used in computationally analyzing the protein, for example, to determine a biochemical function, will be understood to be included within the meaning of the term "polypeptide backbone." The term "protein" is understood to include the terms "polypeptide" and "peptide"

(which, at times, may be used interchangeably herein) within its meaning, h addition, proteins comprising multiple polypeptide subunits (e.g., DNA polymerase IE, RNA polymerase II) or other components (for example, an RNA molecule, as occurs in telomerase) will also be understood to be included within the meaning of "protein" as used herein. Similarly, fragments and domains of proteins and polypeptides are also within the scope of the invention and may be referred to herein as "proteins." A protein "domain" will be understood to mean a portion of a larger protein which, in isolation, assumes a three dimensional conformation corresponding to the conformation the domain assumes when it exists in the larger protein. h a protein, the peptide bonds between adjacent amino acid residues are resonance hybrids of two different electron isomeric structures, wherein a bond between a carbonyl carbon (the carbon atom of the carboxylic acid group of one amino acid after its incorporation into a protein) and a nitrogen atom of the amino group of the α-carbon of the next amino acid places the carbonyl carbon approximately 1.33 A away from the nitrogen atom of the next amino acid, a distance about midway between the distances that would be expected for a double bond (about 1.25 A) and a single bond (about 1.45 A). This partial double bond character prevents free rotation of the carbonyl carbon and amino nitrogen about the bond there between under physiological conditions. As a result, the atoms bonded to the carbonyl carbon and amino nitrogen reside in the same plane, and provide discrete regions of structural rigidity, and hence conformational predictability, in proteins.

Beyond the peptide bond, each amino acid residue contributes two additional single covalent bonds to the polypeptide chain. While the peptide bond limits rotational freedom of the carbonyl carbon and the amino nitrogen of adjacent amino acids, the single bonds of each residue (between the α-carbon and carbonyl carbon (the phi (φ) bond) and between the α-carbon and amino nitrogen (the psi (ψ) bond) of each amino acid), have greater rotational freedom. Similarly, the single bond between a α-carbon and its attached R-group provides limited rotational freedom. Collectively, such structural flexibility enables a number of possible conformations to be assumed at a given region within a polypeptide. As discussed in greater detail below, the particular conformation actually assumed depends on thermodynamic considerations, with the lowest energy conformation being preferred. In addition to primary structure, proteins also have secondary, tertiary, and, in multisubunit proteins, quaternary structure. Secondary structure refers to local conformation of the polypeptide chain, with reference to the covalently linked atoms of the peptide bonds and α-carbon linkages that string the amino acids of the protein together. Side chain groups are not typically included in such descriptions. Representative examples of secondary structures include α helices, parallel and anti-parallel β structures, and structural motifs such as helix-turn-helix, β-α-β, the leucine zipper, the zinc finger, the β- barrel, and the immunoglobulin fold. Movement of such domains relative to each other often relates to biological function and, in proteins having more than one function, different binding or effector sites can be located in different domains. Tertiary structure concerns the total three-dimensional structure of a protein, including the spatial relationships of amino acid side chains and the geometric relationship of different regions of the protein. Quaternary structure relates to the structure and non-covalent association of different polypeptide subunits in a multisubunit protein. A "functional site" refers to any site in a protein that has a function. Representative examples include active sites (i.e., those sites in catalytic proteins where catalysis occurs), protein-protein interaction sites, sites for chemical modification (e.g., glycosylation and phosphorylation sites), and ligand binding sites. Ligand binding sites include, but are not limited to, metal binding sites, co-factor binding sites, antigen binding sites, substrate channels and tunnels, and substrate binding sites. In an enzyme, a ligand binding site that is a substrate binding site may also be an active site.

A "pseudoatom" refers to a position in three dimensional space (represented typically by an x, y, and z coordinate set) that represents the average (or weighted average) position of two or more atoms in a protein or amino acid. Representative examples of a pseudoatom include an amino acid side chain center of mass and the center of mass (or, alternatively, the average position) of an α-carbon atom and the carboxyl atom bonded thereto.

A "reduced model" refers to a three-dimensional structural model of a protein wherein fewer than all heavy atoms (e.g., carbon, oxygen, nitrogen, and sulfur atoms) of the protein are represented. For example, a reduced model might consist of just the α-carbon atoms of the protein, with each amino acid connected to the subsequent amino acid by a virtual bond. Other examples of reduced protein models include those in which only the α- carbon atoms and side chain centers of mass of each amino acid are represented, or where only the polypeptide backbone is represented. The term "computer useable medium" is used to generally refer to media such as removable storage devices, and signals. The term also refers to software or program instructions to a computer system. Computer programs (also called computer control logic) are stored in a main memory and/or on a secondary memory and can also be received and transmitted via a communications interface. Such computer programs, when executed, enable the computer system to perform the features of the present invention as discussed herein.

Computational protein structures generated in the course of practicing the invention can be of different quality, and can be used for various purposes. Often, models produced by computational methods are reduced models. Indeed, many reduced models only show the polypeptide backbone of the protein, and such models are preferred in the practice of the invention. Of course, it is understood that once a protein structure based on a reduced model has been generated, all or a portion of it may be further refined to include additional predicted detail, up to including all atom positions.

As those in the art will appreciate, computational methods usually produce lower quality structures than experimental methods, and the models produced by computational methods are often called "inexact models." While not necessary in order to practice the instant methods , the precision of such models can be determined using a benchmark set of proteins whose structures are already known. The difference between the computational model and the experimentally determined structure can be quantified in any suitable manner, with a measure called "root mean square deviation" (RMSD) being preferred. A model having an RMSD of about 2.0 A or less as compared to a corresponding experimentally determined structure is considered "high quality". Frequently, computationally derived protein models have an RMSD of about 2.0 A to about 6.0 A when compared to one or more experimentally determined structures, and are called "inexact models". As those in the art will appreciate, RMSDs can also be determined for one or more atomic positions when two or experimental structures have been generated for the same protein.

SUMMARY OF THE INVENTION

The object of this invention is to generate one or more computational models of protein structure (including inexact and reduced models) from amino acid sequence data, for example, from deduced primary amino acid sequences derived from the nucleotide sequence of a gene identified in the course of a genome sequencing project. As those in the art will appreciate, multiple computational models can be built for a given amino acid sequence, and given adequate computational resources, computational models may be developed on proteome-wide scales.

In one embodiment the invention provides a method for computationally generating a protein structural model such method comprising using a computer running a first computer program logic to generate first close and distant sequence profiles for a probe amino acid sequence. Such method further using a computer running a second computer program logic to scan a database of template protein structures with the first close and distant sequence profiles to identify a plurality of template protein structures in the database that best match the first close and distant sequence profiles. Such method further using a computer running a third computer program logic to generate second close and distant sequence profiles for a probe amino acid sequence, wherein the second close and distant sequence profiles comprise the first close and distant sequence profiles and secondary interactions between amino acid residues of template protein structures. Such method further using a computer running a fourth computer program logic to scan the database of template protein structures with the second close and distant sequence profiles to identify a plurality of template protein structures in the database that best match the second close and distant sequence profiles. In another embodiment, the methods of the invention provide for a method for generating a protein structural model comprising generating first close and distant sequence profiles for a probe amino acid sequence, scanning a database of template protein structures with the first close and distant sequence profiles to identify a plurality of template protein structures in the database that best match the first sequence profiles, generating second close and distant sequence profiles for a probe amino acid sequence, wherein the second close and distant sequence profiles comprise the first close and distant sequence profiles and secondary interactions between amino acid residues of template protein structures, and scanning the database of template protein structures with the second close and distant sequence profiles to identify a plurality of template protein structures in the database that best match the second close and distant sequence profiles.

In still another embodiment, the invention provides a protein structural model produced by the above stated method embodiments of the invention.

In further embodiments of the invention, the various techniques, methods, and aspects of the invention described herein can be implemented in part or in whole using computer-based systems and methods. Additionally, computer-based systems and methods can be used to augment or enhance the functionalities described and increase the speed at which the functions can be perfonned, and provide additional features and aspects as a part of or in addition to those of the present invention described elsewhere in this document, hi related embodiments, where the elements are implemented using software, the software may be stored in, or transmitted via, a computer program product and loaded into computer system using a removable storage drive, hard drive or communications interface. The control logic (software), when executed by a processor causes the processor to perform the functions of the invention as described herein. In another embodiment, the elements are implemented primarily in hardware using, for example, hardware components such as PALs, application specific integrated circuits (ASICs) or other hardware components. Implementation of a hardware state machine so as to perform the functions described herein will be apparent to persons skilled in the relevant art(s). In yet another embodiment, elements are implemented using a combination of both hardware and software. In still another aspect of the invention, there is provided a computer program product comprising a computer useable medium having computer program logic recorded thereon for creating first close and distant sequence profiles for a probe amino acid sequence, such computer useable medium and computer program logic used in scanning a database of template protein structures with the first close and distant sequence profiles to identify a plurality of template protein structures in the database that best match the first sequence profiles. Using the computer useable medium and computer program logic for generating second close and distant sequence profiles for a probe amino acid sequence, wherein the second close and distant sequence profiles comprise the first close and distant sequence profiles and secondary interactions between amino acid residues of template protein structures, and scanning the database of template protein structures with the second close and distant sequence profiles to identify a plurality of template protein structures in the database that best match the second close and distant sequence profiles.

In yet another embodiment, the invention provides a method for determining a biochemical function of a protein comprising generating a protein structural model, according to at least one of the above stated methods, for the protein, and determining that the protein possesses the ability to perform the biochemical function under standard reaction conditions by identifying a sub-structure in the protein that corresponds to the biochemical function.

BRIEF DESCRIPTION OF THE DRAWINGS Figure 1A shows a schematic overview of the entire threading approach embodied in PROSPECTOR. In these embodiments, all alignments are generated using dynamic programming. In the upper half of this diagram, a first threading scheme, termed "PROSPECTOR1" is presented in flow chart format. As those of skill in the art will understand, PROSPECTOR1 is a hierarchical approach comprising the use of "close" sequence profiles and "distant" sequence profiles that, for each structure, generate probe- template alignments that are used in the evaluation of pair interactions in a second threading alignment pass. In preferred embodiments a total of twenty structures (comprising four scoring functions times five structures for each scoring function) are reported. The lower portion of the diagram, shows a flow chart for a second pass, termed "PROSPECTOR 2". As indicated, in PROSPECTOR2 the results of PROSPECTOR1 are pooled and consensus contacts are selected in the set of the (typically twenty) best structures. Using a recently developed formalism described by Skolnick J, et al., Proteins 38: 3-16 (2000), the consensus contacts are converted into a protein-specific pair potential. This converts contacts in a potential energy term by dividing the observed number of contacts by the square of the inverse of the number of residues in the protein and taking the natural logarithm of the resulting ratio. These protein-specific pair potentials are used to evaluate the pair interactions in the second cascade of the threading algorithm employing the sequence-based profile to generate alignments.

DETAILED DESCRIPTION OF THE INVENTION

In a first set of embodiments, the method of the invention uses a hierarchical threading approach comprising at least two stages of making alignment profile determinations. In a preferred embodiment, the protocol for PROSPECTOR1 is the following: first, close and distant sequence profiles are generated; second, each of these sequence profiles is used to scan a structural database. The probe-template alignments provided by the sequence profile scoring function are used to identify the partners for the evaluation of the pair interactions in the probe sequence for use in the next threading iteration that employs sequence plus secondary structure plus pair interaction profiles. Preferably, but not essentially, a plurality of top scoring structures for each scoring scheme are collected and composite results are reported. The number of structures in each plurality of top scoring structures may be the same or different. The number of structures in each plurality is generally at least 2, and is preferably 3, 4, 5, 6, 7, 8, 9, or 10 or more. The upper limit on the total number of structures, and structures in each plurality, could theoretically be infinite, but computational resource limitations (e.g., cpu number and speed, available memory, software considerations, etc.) typically necessitate the use of reasonable numbers of structures such as up to 80 in number per set. Pluralities each containing 5 structures are particularly preferred.

In a related preferred embodiment, the protocol for PROSPECTOR2, is the following: first, the consensus contacts (occurring in structures with Z-scores greater than about 1.2 and are found at least 3 times) in this set (e.g., 8, 10, 12, 14, 15, 16, 17, 18, 19, 20, or more) of the top scoring structures (preferably about 20, but the same considerations apply here as with respect to PROSPECTOR1) provided by PROSPECTOR1 and are then used to construct a protein-specific pair potential that is used in a subsequent iteration of threading, again based on close and distant sequence profiles. By "protein-specific pair potential" is meant a pair potential specific to a particular protein. Generally, a total of 3 sets of iterations are preferred. Additional iterations may be included to enhance the ability of the algorithm to recognize distantly related proteins.

In a related aspect, the claimed methods use a first stage to generate an initial alignment between a probe amino acid sequence and a template sequence structure, "the partly thawed approximation". Generally, this initial alignment represents an alignment approximation that in one aspect allows for alignment advancement over the "frozen" approximation of previously used methods by providing a "partly thawed" approximation of structure for the probe sequence. (In the frozen approximation, the partners in the evaluation of the pair interactions are taken from the template protein). This approximation results in alignment of the probe sequence in the template structure for use in calculating the sequence partners for the evaluation of alignment pair interactions between the template and probe sequences. In preferred embodiments, the probe sequence itself is used to evaluate the pair interactions.

Previously, in the first iteration of the frozen approximation, the partners were taken from the template structure. In practice, this worked well when the probe and template structures had similar environments. However, more often than not the environments are quite different. For example, the probe sequence might be entirely devoid of any tryptophan residues, but in the frozen approximation, a given residue might be forced to interact with a tryptophan from the template. On successive iterations, in the so-called defrosted approximation where the partners were taken from the previous alignment, there were times when the resulting alignments never converged. This resulted from the poor environment provided by the initial frozen approximation that selected the partners from the template.

In a second set of embodiments, sequence profiles are generated wherein sequences are selected from combined databases, e.g., Swissprot (http://www.expasy.ch sprot/) and the genome sequence database (ftp://kegg.genome.adjp/genomes/genes). In a third set of embodiments, a sequence identity is provided. For example, FASTA (e.g., described by Pearson WR., Methods Mol Biol 24: 307-31 (1994); Pearson WR., J Mol Biol 276: 71-84 (1998)) may be used to select those sequences whose sequence "identity" lies between about 35% and about 90% identity to the probe sequence. By "identity" is meant the degree of amino acid residue conservation at particular amino acid positions, including all amino acid positions, between two or more amino acid sequences, and can be calculated using any suitable multiple sequence alignment algorithm, including any of a variety of well known algorithms and software packages. In further embodiments, multiple sequence alignments are generated using alignment algorithms, e.g., CLUSTALW (Jeanmougin, et al. (1998), Trends Biochem Sci, 23, 403-5; Thompson, et al. (1997), Nucleic Acids Research, 24:4876- 4882; Higgins, et al. (1996). Methods EnzymoL, 266, 383-402; Thompson, et al. (1994), Nucleic Acids Research, 22:4673-4680; Higgins, et al. (1992), CABIOS 8, 189-191; Higgins, et al. (1989), CABIOS 5, 151-153; Higgins, et al. (1988), Gene 73, 237-244; http://www.ebi.ac.uk/clustalw/. Such multiple alignments provide for a "close" set of alignments. By "close" set is meant a set of amino acid sequences whose amino acid identity ranges from about 35% to about 90%) or more. This alignment may be described by the formula (la). The sequence profile for the ith position in the probe sequence for amino acid type γ is

j

Here N_cι_ose is the number of sequences that are aligned in the "close" alignment,

B(y,r\) is the BLOSUM 62 mutation matrix (as described by Henikoff S, and Henikoff JG.

Proteins 17: 49-61 (1993)) for residues type γ and a,_e is the amino acid at position i in the th sequence.

In yet a further embodiment, the instant invention provides for the addition of additional sequences to be considered whose E- value in FASTA is less than about 10. In a preferred embodiment, a profile is generated as reported by Gribskov M, et al., Proc. Natl.

Acad. Sci. USA 84: 4355-4358 (1987) (i.e., a weighted average of the amino acids at a given position) for these distantly related sequences, such profile comprising a "distant" set of alignments that may be described by the formula (lb).

In this formula, _di_st represents the "distant" sequences that are aligned. By "distant" is meant a set of amino acid sequences whose amino acid identity whose E- value in FASTA is less than about 10. h yet another embodiment, the method provides for generating at least two sequence profiles; one that is more sensitive to more closely related sequences and another that provides for detecting more distantly related sequences. With respect to the two sequences and the formulas used to provide sequence profiles, gaps are assigned a value of B=0, but are counted in the averaging process. Thus, if a region has a large number of gaps, then the contribution of that region to the alignment is diminished relative to a gap-free region, where B>0, e.g., favorable mutations may have occurred. i further embodiments, a first-pass sequence-profile score matrix is provided. By

"first-pass sequence-profile score matrix" is meant a sequence profile as calculated by forumal 2a. The score matrix for the first pass-through the structural database is calculated according to formula (2a). This matrix is associated with aligning residue i with the Jth residue in the Kth structure:

where Ά_JK is the residue at position Jin the Kth structure. Here, the shorthand notation

is used, which refers to the close or distant set of multiple sequence alignments. In a further embodiment, a second stage of alignment is performed. Such secondary alignment provides secondary structure propensities and pair interactions. By "structure propensities" and "pair interactions" is meant, as defined above, the propensities for secondary structure of a given amino acid, and the interaction potential between residues, respectively. In this stage of the alignment process, secondary structure propensities and pair interaction terms are provided. For the secondary structure propensities, a homology averaged secondary structure profile energy is defined using the formula (3),

wherein ε(Θ_l5 Θ₂,α_!e, _;+1,_e) represents the energy of a consecutive pair of amino acids ii_e,ai_+\,_e in consecutive secondary structure environments Θ₁ and Θ_2; respectively, which can be helix, beta, or turn. In these embodiments, three conformational states are considered using multiple-sequence averaging. These embodiments provide for a finer grained alignment description over previously reported methods that use six conformational states of a single sequence. In still further embodiments, the claimed invention uses yet another step wherein the alignment provided by the sequence-only scoring profile is used to generate the "partners" in the evaluation of the "pair potentials". In this embodiment, the homology averaged pair interaction matrix defined by formula (4) is considered. By "partners" is meant those residues interacting with the residue of interest.

In this formula, ε is the arithmetic average of quasichemical pair potentials that describes interactions between side chains of amino acid types γ,η that are in contact (that is, have one pair of heavy atoms within 4.5 A of each other). This protein-specific pair potential was derived previously using weak local sequence fragment similarity and φ is defined as in formula (2b). The minus sign allows for maximization of the score, that is gap penalties are negative.

In still further embodiments, the claimed methods provide for a second-pass scoring matrix that uses the partially thawed approximation. In some of these preferred embodiments, ^_IK (J) the alignment between the Jth residue in the Kth structure and the probe sequence generated by the φ th sequence profile after the first iteration. In this aspect, there is one of two possible values for W^ικ (J): Either there is a gap in the probe sequence that aligns to the Jth position in the Kth structure or the probe sequence position aligns the Jth position. Using these approximations, the score matrix is constructed such that Ξja^φ'²(i,J) is associated with aligning the -^'th probe position with the Jth position in the Kth structure:

λ_lRS J)+λ_lS' b_J,s„,i)+ £⁾E

(5)

Here,

is given by formula (2a) or (2b) depending on which sequence profile is used, the {λ } are the weight factors of the various scoring functions (taken on optimization to be 1, 5, and 5, respectively, as this set of parameters gave the best results on the 68 pairs of proteins the analyses of which are described in detail below; with the understanding however, that other weighted factors may also be used such as different terms for pair interactions, sequence profiles, etc.). Here, sj and S_/+ι are the conformations of residues J and J+l in the Kth structure. nc^(J) is the number of contacts the Jth residue makes in structure K, Cjκ( w) is the identity of the mth contact partner that; residue J makes in structure K,

is the alignment to the corresponding position in the probe sequence associated with residue Cjκ(m) that was generated using the first pass, and the sequence-profile score matrix is given by either formula (2a) or (2b) depending on the sequence profile that is used. As before, dynamic programming is used to generate the alignments in the second pass. a related embodiment, the possibility of different gap opening and gap propagation, penalties are allowed. Table I summarizes the set of values of the gap penalties optimized to select the maximum number of correct structures as compared to the Fischer Database for each of the four scoring functions.

Table 1. Compilation of gap penalties for the 4 scoring functions used in PROSPECTORI and PROSPECTOR2^a

The numbers in parentheses refer to those cases in PROSPECTOR2 that differ from PROSPECTORI

In this optimization procedure, gap insertion penalties were allowed to assume all even integer values from 2 to 12, and gap propagation penalties were scanned over all integers from 1 to 6. The set of gaps used in PROSPECTOR2 may also be found in Table 1. Interestingly, when secondary structure propensities and pair interactions were considered, the gap penalties were larger for the close profile cases than the distant profile cases. This reflects the fact that when a distant sequence profile was used, the gap penalties may have to be smaller to allow the favorable alignment regions to be found.

For each of the scoring functions, in the examples below, the top five scoring structures are described, for a total of 20 structures (4 scoring functions times the best five structures for each scoring function). Alignments for each of these probe template assignments are also described. h yet further embodiments, the method of the invention provides for generation of protein-specific pair potentials from threading. In this aspect, as is known by those of skill in the art, on average about 35-40% of the predicted contacts are correct if they belong to pairs of residues that are at least four residues apart and occur in at least 25% of the top scoring structures having a Z-score greater than 1.3. By "Z-score" is meant the energy relative to the mean divided by the standard deviation of the energy.

For instances where a probe cannot be assigned to a specific template because of poor score significance, there may be fragments of template structures that bear some relationship to the native structure of the probe sequence. For some of these relationships, at least some of the selected structures should have consistent substructure features such as side chain contacts. To access such structures for use in the methods of the invention, the following criteria were may be used.

Because the predicted contacts are inexact, rigorously demanding that they all be satisfied would lead to spurious results. Rather, such contacts are converted into a pseudo potential that reflects a bias towards such contacts. By "pseudo potential" is meant that this potential is not a real potential energy but a knowledge based one. This potential is used in a repeat of the second pass of the threading procedure using the newly derived, now protein- specific, pair potential. Because PROSPECTOR2 uses consensus information, which includes a significant number of correct or near correct contacts, it is more specific than when such information is absent, as in PROSPECTORI. As in PROSPECTORI, the alignment that assigns the partners is provided by either the close or distant sequence profiles. Dynamic programming is used to evaluate the probe-template fitness with an energy function that now includes the modified pair potentials. The threading method that employs these terms is called PROSPECTOR2. A schematic overview of PROSPECTOR2 is given in the lower half of Figure 1. h a particular preferred embodiment, the protein-specific pair potential is constructed as follows: If there are more than 3 contacts predicted between residues i and j, whose total number is qy, then the pair potential for these positions is calculated as

where the expected number of contacts, ^4ij is given by

and n is the number of residues in the probe sequence. For those pairs of positions where no consensus contacts are found, the profile-based pair potential of formula (4) is used. The arithmetic average of this potential along with the profile-based pair potential given by formula (4) in the second threading pass for the close and distant cases of protein specific pair potentials is then used. The use of pair potentials improves the fold specificity in the prediction of structure.

Another aspect of the invention concerns computationally derived structures for amino acid sequences analyzed by the instant invention. A related aspect of the invention concerns the analysis of the computationally derived protein structures derived through application of the instant invention. As those in the art will appreciate, such structures can be used for various purposes. One such purpose is testing, and a number of ways of testing these structures are available. For example, one way of testing is to examine the mean Z-score of the correctly identified template structure as a function of the various potentials used, where the Z-score for the Kth structure having energy E_κ is given by formula (7),

_z εXψ_σ (7)

wherein (E) and σ are the mean and standard deviation values of the energy of the probe in all templates of the structural database, respectively. Such an analysis provides one measure of the utility of a particular scoring function. Since the sequences are not randomized the sequence in the evaluation of formula (7), reported Z-scores are lower than for randomized sequences. On the other hand, sequence randomization is a computationally expensive process, and avoidance of such randomization is a significant advantage, especially when threading is done on a genomic scale. h another example, the accuracy of computationally derived structures can be tested. One way of assessing accuracy is to examine side chain contact maps in the structures. For example, among the quantities reported herein are f_c, the fraction, and N_c, the number of correctly determined contacts. The significance of these quantities are measured. One such measure arises by generating random alignments of the probe sequence in the correct template structure. However, this does not necessarily indicate the significance of the contact map determinations. Consider the case where one has a library of homologous structures and 95% of the contact map is correctly determined. By randomizing the contact map, one would conclude that this is a highly significant determination. However, one could just as easily have selected the structure at random. Here, the specificity of the determination is in fact close to zero. In general, if one focuses on a single structure, relative to the library of many or all structures, one has no idea of the significance of the value of N_c. To address this issue, it is reasonable to calculate the average number, N°, of correctly determined contacts for the best probe-template alignments of the probe sequence in all template structures as well as the standard deviation of this quantity, σ°. This can be done for some, or preferably all, of the entire structural template library. Then, the Z-score for the number of correctly determined contacts for the correct probe-template pair is calculated using formula (8).

*- ^■ - " rø

This quantity more appropriately measures the significance of a given number of determined contacts.

One of the problems with earlier, pseudo one-dimensional treatments of threading is the problem of correctly selecting the partners for evaluation of the pair interactions. Originally, to address the problem, the frozen approximation was introduced in which the partners from the template structure are used in the evaluation of the I pair potentials. If the environments were similar, the approximation worked well. Otherwise, it performed poorly. However, as described herein for the instant invention, it is desirable to retain the advantages of a local scoring function that enables dynamic programming to be used as the search scheme. Here, an iterative approach is provided in which a sequence profile is used to generate the initial alignment of the probe sequence in the template structure. Subsequent iterations use this alignment to evaluate the partners; this is termed the "partly thawed" approximation. This approximation works quite well, not only in the selection of the template, but also in the construction of a protein-specific pair potential whose recognition capabilities are enhanced as assessed by the Z-score of the correctly predicted structures. When the entire hierarchical approach of four scoring functions is used in PROSPECTOR2, this method correctly recognizes 61 proteins in the top position. By "top position" is meant the best score. In addition, the use of pair potentials enhances the number of correctly identified side chain contacts when the correct probe-template pair is considered.

The use of a hierarchical approach to structure/function assignments can provide additional information over the case when just a single scoring function is used. Also, the pair potentials employed have been highly optimized to give the best available results in gapless threading. Given that functional site descriptors (including those for active sites, ligand (e.g., substrate or co-factor) binding sites, etc. see also, PCT/US99/11913 and U.S. patent application serial number 09/322,067, filed May 27, 1999) can correctly select threading structures showing low sequence-structure specificity and can make both structural and functional assignments with a low false positive rate, the demands on a threading algorithm that uses such information are much less stringent than if structure prediction alone is to be done. That is, what one really requires is an algorithm that can get the correct fold near the top with a score of at least moderate significance with a reasonably good alignment, and then a functional site filter can assist in fold as well as biochemical function identification. This is the origin of the preferred hierarchical method of multiple scoring functions that, in combination, recognizes 59 of the 68 Fischer probe pairs in the top position in PROSPECTORI and 61 in the top position in PROSPECTOR2. Nevertheless, it is clearly important to have an excellent threading algorithm to ensure that the correct structure is within this threshold in order to be certain that all proteins in a genome having the particular fold and function are identified. Further, very distant sequence profiles possess significant information and can profitably assist in fold recognition. Indeed, quite often it is this set of sequence plus secondary structure pair interactions that provides the best Z-score for the correct probe- template pair on threading.

In summary, a new threading approach has been invented, and a preferred embodiment of this approach, termed PROSPECTOR, has been developed. The threading algorithm embodied in PROSPECTOR is at the state-of-the-art of contemporary threading algorithms, as assessed by its performance on standard benchmarks. Moreover, in the way the algorithm is constructed, known experimental restraints (e.g., disulfide bonds or NMR restraints) can be readily integrated into this threading algorithm. This can be done both by biasing the pair potential toward known contacts and by eliminating structures that do not satisfy the constraints in a post-threading selection step.

Examples

Example 1. Application to the original Fischer benchmark

The studies described in this example focus on the Fischer Database (Fischer D, et al., Pac Symp Biocomput 300-18 (1996)). This data base comprises of 301 template structures and 68 probe sequences, and represents a standard benchmark in the threading field. A variety of approaches were tried on this database. The results of these earlier studies are summarized below.

For a given scoring function, the Needleman- Wunsch global alignment algorithm (as described in Needleman and SB, Wunsch CD. J Mol Biol 48: 443-53 (1970)) recognized more correct probe-template pairs on average than did the Smith- Watennan local alignment algorithm (as described in Waterman MS, and Eggert M. J Mol Biol 197: 723-8 (1987)). Secondary structure profiles alone were also evaluated as the initial step in generating the probe-template alignment for pair evaluation. Secondary structure profiles alone only recognized 18 cases in the first position, whereas secondary structure profiles plus pair profiles recognized 29 cases. This clear improvement shows the utility of incorporating pair potentials in a threading approach. Nevertheless, 29 correctly recognized pairs represent rather poor performance. The major improvement in fold recognition is achieved when sequence profiles are used. If a sequence-profile-based alignment is used, but the sequence- profile term is ignored in the calculation of the energy (i.e., using formulas (3) and (4), above), then 34 probe-sequence-template structure pairs are matched as the top score. This is to be contrasted to 52 cases (see Table II, below) that are correctly assigned when the entire sequence plus secondary structure plus pair profiles are used. In all cases, when combined in a hierarchical fashion, it has been discovered that inclusion of pair interactions improves the yield of correct probe-template matches.

Table II. Summary of threading results using different scoring functions for the Fischer Database

Results are reported in both the top 5 (close) and 4 (distant) and top 10 (close) and 8 (distant) positions, with the number in parenthesis given by the UCLA benchmark website (http://www.doe-mbi.ucla.edu/people/fischer/BENCH/tablel html.) The top part of Table II summarizes the results using PROSPECTORI and its hierarchy of four scoring functions. Note that the "distant" sequence profile recognizes a somewhat greater number of correct pairs (46 pairs or 67%) than does the "close" profile (44 pairs or 65%). This is very interesting in that it shows that these distant profiles contain additional information that can be profitably employed to increase the recognition abilities of these threading algorithms. However, the best single scoring function is the combined distant sequence profile plus secondary structure plus pair interaction scoring function that recognizes 52% for close and 72% for distant profile cases in the top position, itself, this single scoring function is a competitive threading algorithm (see below). This is an improvement of 6 correctly matched structures relative to the best distant sequence profile case. Further, it recognizes the most proteins in the top 4, 5, 8, and 10 positions. The performance of the close sequence-profile plus secondary structure plus pair interaction scoring function is also quite good. It recognizes more top scoring proteins than the close sequence-profile case alone (45 versus 44), and also recognizes considerably more proteins in the top 4, 5, 8, and 10 positions, for example, 55 versus 46 proteins in the top five positions. Clearly, the best performance is when all four scoring functions are combined. Then 59, 63, and 65 proteins are recognized in the top, top five, and top ten positions respectively.

Example 2.

Another way of assessing the utility of a given scoring function is to measure the mean Z-score of the correctly identified proteins. The close sequence profile plus secondary structure plus pair interaction scoring function has a mean Z-score of 3.33 that is the best of all scoring functions and is significantly better than the close sequence profile which has a mean Z-score of 2.6.5. Note that the distant profile recognizes 46 proteins in the top position and has a marginally poorer mean Z-score of 2.50 as compared to the close profile value of 2.65. For both close and distant cases, the use of pair interactions plus secondary structure propensities increases the sequence-structure specificity relative to the use of a sequence profile alone, with the mean Z-score of the distant case of 3.06. In other words, the use of structural information confers an advantage over the cases in which pure evolutionary information is used, both in terms of the number of proteins placed in the top position as well as in the sequence-structure specificity as assessed by the Z-score.

One of the best alternative methods reported on the UCLA website as of August 1, 2000, (http://www.doe-mbi.ucla.edu/people/fischer/BENCH/tablel.html.) is that of Gonnet (which is a pair-wise sequence-alignment method that also uses predicted secondary structure). This method recognizes 52 proteins in the top position. This is the same number that the combined distant sequence profile plus secondary structure plus pair interaction scoring function recognizes. The same number of proteins was recognized in the top four positions (56), and one less protein was recognized in the top eight (57 vs. 58). If any method, in particular a hierarchical method such as PROSPECTORI, is considered, then that disclosed herein is clearly effective, as 59 proteins are recognized in the top position, with a total of 65 pairs recognized in the top ten positions. It is clearly superior to all of early efforts in threading as well as to the hybrid method described by Jaroszewski L, et al. Protein Sci. 7: 1431-1440 (1998), or the BLAST, or PSIBLAST (described by Altschul SF, et al. Nucleic Acids Res. 25: 3389-3402 (1997); Altschul SF, and Koonin EN. Trends Biochem Sci 23: 444-7 (1998). In particular, for PSIBLAST, two sets of results are reported. The first is when only sequences from the Fischer Database were used to generate the profiles and the second is when an extended version of the same sequence database that was used to generate the sequence profiles is employed. For the FISHER only database, only 24 probe-template pairs were correctly identified in the first position. Next, we use a larger sequence database (consisting of all the sequences in Swiss Prot, the genome sequence database from KEGG, and the trEMBL database (http://expasy.proteome.org.au/sprot/) to generate position-specific score matrices, PSSM, using the IMP ALA package with default settings as described in Schaffer AA, et al. Bioinformatics 15: 1000-11 (1999). 41 cases were then assigned to the top position. Note that this performance is worse than when either the close or distant sequence profiles was used alone. With respect to the top five positions, the close and distant profiles placed 46 for close and 53 for the distant and 49 for the close and 53 for the distant in the top five and ten positions, respectively. In contrast, PSI-BLAST placed 46 and 47 proteins in the top five and ten positions, respectively. It might be argued that since four scoring functions were used and the hybrid threading method only uses three, this was not a fair comparison. If those results obtained from the "distant" sequence profiles were eliminated, however, then 58, 621 and 64 cases were obtained in the top 1, 5, and 10 positions, respectively. Thus, with respect to this test, PROSPECTORI was certainly a very competitive algorithm.

In Table IIIA, the distant sequence-profile scoring function was analyzed further. Note that the Z-score for the number of correctly predicted side chain contacts, Z_con (see column 6) was, in general, significantly better that one would expect from random. Indeed, it had a mean value of 7.57. At first glance, it might be argued that this was simply an artifact in that the sequence profile generates a good probe-template match based on the score significance, and since the two structures are similar, this result is trivial.

Table IIIA. Compilation of results on the Fischer benchmark for the distant sequence- profile scoring function

Probe Template Rank Zscore" N_c ^b f ^c Z_cond _N0

2mnr 4enl_ 5.95 118 0.09 6.48 22.5 ltahA ltca_ 88 0.52 80 0.14 4.34 23.53 lltsD lbovA 40 0.82 6 0.04 -1.07 12.98 lmdc lifc_ 2.78 0 — -1.91 12.92

3chy_ 4fxn_ 60 0.75 46 0.16 2.79 18.8

2sga 4ptp_ 1.34 72 0.19 13.06 6.8 lfclA 2fb4H 1.52 134 0.24 11.39 17.76 lone 7rsa_ 4.3 176 0.61 30.37 11.09 lfxiA lubq_ 19 0.9 20 0.1 2.22 9.44

3hlaB 2rhe- 1.17 54 0.25 8.91 8.52

3rubL 6xia_ 3 1.53 46 0.07 1.2 25.56 lchrA 2mnr 5.33 556 0.44 33.87 24.21

2pia_ lfhr 175 0.31 42 0.1 1.33 23.78 laep 256bA 1.07 34 0.14 1.6 18.23

2ak3A lgky_ 2.06 114 0.27 6.13 26.59

3cd4_ 2rhe_ 2.24 66 0.29 8.37 11.55

IcauB IcauA 1.94 198 0.47 23 16.06 lc2Ra lycc_ 4.05 164 0.54 21.19 15.07 laaj_ lpaz 1.7 102 0.4 14.93 11.55 lgky_ 3adk_ 1..37 182 0.34 12.67 22.61

ImioC 3minB 7.18 542 0.31 23.18 32.5 leaf 4cla 1.98 174 0.32 13.2 21.89 lpfc 3hlaB 1.74 62 0.22 12.3 6.27

5fdl 2fxb_ 5 1.2 26 0.41 2.72 10.93

2afhA laozA 2.66 54 0.06 3.02 18.62 lhrhA lrnh 1.62 82 0.28 8.32 15.3 lnpx 3grs_ 1. 7.72 492 0.36 26.44 26.53 lbbtl 2plvl 1 1.64 106 0.34 13.55 15.63 lmup_ lrbp_ 1 1.56 108 0.29 10.74 17.31 laba lego_ 1 2.08 78 0.32 11.94 10.43 lcrl_ lede 36 1.1 46 0.07 1.19 25.39 lcpcL IcolA 3 1.06 34 0.1 1.51 18.81

2azaA lpaz 24 0.82 46 0.21 4.71 13.64 lbgeB HgmfA 11 0.88 62 0.27 4.28 18.63 lten 3hhrB 127 0.52 28 0.12 4.56 7.52 lhip_ 2hipA 1 2.03 104 0.57 21.26 7.94 larb_ 4ptp_ 1 2.85 22 0.04 0.96 13.9 latnA latr_ 1 1.69 178 0.21 9.35 26.14

2sarA 9rnt_ 4 1.01 26 0.18 2.98 10.01

IsacA 2ayh 1 1.32 32 0.12 1.37 19.05 lhom_ llfb_ 1 1.37 70 0.55 8.27 11.86

2snv_ 4ptp_ 11 1.06 50 0.19 5.02 14.06 lcewl ImolA 15 0.95 86 0.34 13.18 12.23 lcid_ 2rhe 15 0.79 30 0.12 1.7 16.52

2hhmA lfbpA 1 1.27 128 0.2 9.39 20.15 ltie 4fgf_ 1 1.14 36 0.14 3.76 11.82 lrcb lgmfA 1 1.01 38 0.15 2.08 17.37 ltlk_ 2rhe_ 1 1.14 82 0.39 11.74 10.76 lstfl ImolA 3 0.95 22 0.1 1.74 11.64

2omf_ 2por_ 1 2.11 38 0.08 1.8 18.78

4sbvA 2tbvA 1 2.49 110 0.21 11.38 15.36 ldxtB lhbg 1 1.8 246 0.64 23.88 19.62

2cmd 61dh 1 8.68 430 0.4 23.7 27.2

2fbjL 8fabB 2 1.02 24 0.09 0.49 19.47

2sas_ 2scpA 1 2.04 176 0.33 10.43 24.37

2pna_ IshaA 1 1.18 6 0.06 -0.7 9.66 losa 4cpv_ 1 1.86 142 0.46 8.53 26.71

2hpdA 2cpp_ 1 3.64 284 0.25 14.48 27.08 llgaA 2cyp_ 1 3.69 442 0.46 28.53 23.8 lbbhA 2ccyA 1 2.36 118 0.38 12.22 15.12 lisuA 2hipA 1 1.56 42 0.30 8.98 6.33

2mtaC lycc_ 1 1.13 26 0.08 1.13 16.74

IdsbA 2trxA 5 0.93 66 0.24 3.62 '22.83

2sim InsbA 256 -0.49 28 0.06 0.45 22.41

2gbp_ 21iv_ 16 0.82 58 0.1 1.68 28.46 lgplA 2trxA 84 0.55 28 0.17 0.5 22.54

8ilb 4fgf_ 1 1.17 60 0.19 6.1 15.18 lgal_ 3cox_ 1 2.63 322 0.28 17.87 25.42 average 1.95 0.25 7.57

^a Z-score for the score significance is given by formula (7). ^b Number of correctly determined contacts for the correct probe-template pair. ^c fraction of correctly determined contacts for the correct probe-template pair. ^d Z-score of correctly determined contacts given by formula (8) for the correct probe- template pair. ^e Number of correctly determined contacts averaged over the entire structural template library.

A number of the correctly ranked structures had a rather poor energy Z-score, yet their contact prediction was highly significant, e.g., Ibbtl in 2plvl has a Z score of 1.64 and

Z_con is 13.6. Furthermore, some of the probe-template pairs that did not lie near the top scores could also have a significant Z_con. For example, the score of lten_ in 3hhrB was at position 127, yet Z_COn is 4.6. Note that 4/11 of the poorly ranked structures (Krank<16) have a Z_con greater than 3, which is much better than one might guess based on the rank of the correct template structure. Of course, there are some cases that are much worse than random as well. This substantiates the observation that a sequence profile can often generate a reasonable set of correct contacts (on average 25 % correct) even when the score of the alignment is not significant. Of course, since there was a substantial fraction of incorrect contacts as well, the pair-potential contribution cannot be made too large as these incorrect contributions could dominate the score.

Table III B, below, presents the results for a distant sequence profile plus secondary structure plus pair profile scoring function.

Table IIIB. Compilation of results on the Fischer benchmark for the distant sequence profile plus secondary structure plus pair interactions scoring function in PROSPECTORI

Probe Template Rank Zscore" N_c ^b f ^c ^■ 7'-' ^d _N0e con

2mnr- 4enl- 1 3.6 150 0.12 5.42 33.17 itaha Itca- 20 0.85 142 0.15 4.67 37.55

HtsD lbova 9 1.01 14 0.06 -0.5 17.67 lmdc l ife 1 3.32 0 -2.21 18.57

3chy_ 4fxn 1 1.33 76 0.2 3.87 26.65

2sga 4ptp- 1 1.47 86 0.18 11.33 1.1 lfcla 2fb4H 1 2.5 242 0.4 15.27 27.69

1 onc- 7rsa- 1 4.98 176 0.61 22.15 16.41 lf ia lubq_ 2 1.39 30 0.15 2.61 13.8

3hlaB 2rhe_ 2 1.55 66 0.29 5.95 14.59

3rubL 6xia- 21 1.34 78 0.09 1.69 36.12

1 chra 2nnu 1 8.28 638 0.46 20.19 45.57

2pia- I fiir- 50 0.63 148 0.18 4.99 40

1 aep 256bA 1 1.48 18 0.06 -0.36 22.19

2ak3A I ky_ 1 2.66 132 0.24 4.62 41.02

3cd4- 2rhe- 1 3.02 96 0.38 9.97 1.5 IcauB I caua 3.28 268 0.57 24.58 20.96 l c2rA lycc- 4.75 170 0.56 18.93 19.06

1 aaj_ lpaz - 2.8 146 0.5 .15.8 16.69

¹ gky_ 3adk- 2.33 172 0.32 7.68 37.14 lniioc 3minB 8.01 642 0.37 17.4 46.84 l eaf 4cla- 3.56 244 0.38 13.59 33.14 lpfc 3hlaB 2.72 66 0.22 9.11 9.55

5fdl 2fxb 1.4 34 0.22 2.28 17.34

2afiιA laoza 2.5 72 0.07 2.46 30.94 lhrha lmh- 2.38 96 0.34 7.76 22.13

Inpy-_ 3grs 7.91 536 0.37 16.48 45.61

Ibbtl 2plvl 43 0.77 132 0.41 13.07 17.1 lmup_ frbp_ 2.31 208 0.42 14.79 30.07

1 aba- I ego_ 2.47 78 0.32 8.41 14.82 l crl- I ede- 1.94 138 0.13 3.8 39.09 lcpcl IcolA 1.34 36 0.08 0.88 24.96

2azaA lpaz- 2 1.47 72 0.29 5.67 19.61 lbgeb I gnifa 3 1.39 76 0.21 3.9 25.35 l ten- 3hhrB 37 0.96 40 0.15 5.35 10.11

Ihip 2hipA 2.74 96 0.54 15.43 11.6 l arb- 4ptp- 2.23 36 0.05 1.66 18.89 latna I atr- 2.27 156 0.18 5.02 38.4

2sarA 9mt 1.38 70 0.29 7.78 15.1

I saca 2ayh- 1.28 58 0.13 2.28 29.08

Hiom l lfb 2.05 88 0.64 7.77 18.03

2snv- 4ptp- 14 1.15 62 0.23 4.67 20.36 lcewl ImolA 1.48 112 0.38 11.67 18.76

Icid- 2rhe- 1.01 24 0.1 0.16 22.29

2hlιmA l fbpa 2.83 236 0.24 11.05 36.72 l tie- 4fgL 1.51 66 0.18 4.87 19.52 l rcb- I gnifa 1.35 90 0.3 4.76 28.3

1 tlk- 2rhe- 1.46 140 0.56 16.58 15.8

I stfl ImolA 1.36 48 0.22 4.29 15.92

2omf 2por- 12 1.17 132 0.2 6.02 29.1

4sbvA 2tbvA 2.52 116 0.21 8.57 23.28

1 dxtb lhbg 3.4 260 0.68 18.63 27.79

2cmd- 61dh- 9.54 462 0.42 15.53 43.85

2fbiL 8fabB 2.25 140 0.2 9.84 24.64

2sas- 2scpA 2.52 176 0.33 7.27 38.76

2pna- IshaA 1.42 8 0.07 -0.77 12.18

I osa- 4cpv- 3.17 152 0.45 6.31 39.87

2hpdA 2cpp_ 8.86 652 0.42 20.45 46.73 llgaa 2cyp_ 6.48 502 0.5 21.52 38.11 lbbha 2ccyA 2.98 176 0.53 14.89 21.18 lisua 2hipA 1.81 40 0.3 7.68 7.5

2mtaC lycc 1.25 44 0.14 1.77 24.58

I dsba 2trxA 1.69 88 0.31 3.31 34.67

2sim InsbA 112 0.39 78 0.08 1.91 37.13

2gbp_ 21iv- 1.57 164 0.21 4.21 46.57 lgpla 2t-xA 25 0.97 54 0.18 1.56 32.03 8ilb- 4fgL 1 1.8 66 0.19 4.47 22.4

Igal 3 cox 1 4.68 All 0.29 17.49 38.49 average 2.59 0.29 8.39

^a Z-score for the score significance is given by formula (7). Number of correctly determined contacts for the correct probe-template pair. ^c Fraction of correctly determined contacts for the correct probe-template pair. ^d Z-score of correctly determined contacts given by formula (8)for the correct probe- template pair. ^e Number of correctly determined contacts averaged over the entire structural template library.

As would be expected, compared to the distant profile case, the mean Z-score over all increased from 1.95 to 2.59. 29% of the contacts were, on average, correct, and the mean Z-score of correctly predicted contacts has increased from 7.57 to 8.39. The ranking of 16 probe-template pairs improves and 6 cases get worse. Of the 6 cases that ha a worse ranking, all had Z-scores less than 1.6, a range in which 16/32 cases were correctly assigned. Furthermore, the misassignment of lomf_ (a membrane protein) by a pair potential derived for water soluble proteins was understandable. In another two cases (3hlaB and IsacA), the rank moved from first to second. For 3hlaB, the best scoring fold, 3cd4_, had its best structure alignment with an RMSD of 2.52 A on the Cαs over a slightly smaller part of the structure as compared to the best structural alignment of 2rhe_ of 2.56 A over a slightly longer piece of structure. For IsacA, the correct fold was 2ayh_ and the misassignment, 8fabA, were all beta barrels, with the latter having a significant structural superposition over roughly half of the IsacA native structure. For 3rubL the rank of 6xia_ moves from third to twenty first, with lipd_ rated as the best match. The best structural superposition of lipd_ and 3rubL was 2.78 A, while that of 6xia and 3rubL was 2.52 A over about two thirds of the structure. Finally, for ItahA, the first- to second-pass ranks moved from 88^th to 20^th.

Example 3 Turning now to PROSPECTOR2, and using the formalism of formulas (6a) and

(6b), above, a set of protein-specific potentials was derived by generating consensus contacts in the top threaded structures as provided by PROSPECTORI. The arithmetic average of this potential given by formulas (6a) and (6b) was then used, as was the original profile-based pair potential given by formula (4) in the next threading iteration. This case was termed the "close" and "distant" protein-specific pair potentials. The results of these calculations, as well as the entire composite result of all four scoring functions ("close" sequence profiles, "close" sequence profiles plus secondary structure plus protein specific pair potentials, "distant" sequence profiles, "distant" sequence profiles plus secondary structure plus protein-specific pair potentials), are reported in Table II. These results show that the "distant" case alone recognized a total of 51 proteins. This is somewhat worse than in PROSPECTORI, where 52 proteins were recognized. However, the mean Z-score of the correctly determined proteins increased from 3.06 to 3.84. That said, the number of proteins in the top five positions increased from 56 to 59. The close sequence profiles plus secondary structure plus protein-specific pair potentials recognized 48 proteins in the top position as compared to 46 in PROSPECTORI, with an increase in the mean Z-score of correct cases (3.32 to 3.95) and the recognition of one new protein in the top position. For the composite prediction of PROSPECTOR2, 61, 64, and 65 proteins were recognized in the top, top five, and top ten positions. Interestingly, as shown in Table IN, below, the average fraction of side chain contacts that were selected in the probe-template structure increased slightly from 0.29 (see Table fflB, above) for the pair profiles of PROSPECTORI to 0.30 with a slight increase in the average Z_con from 8.39 to 8.66. Finally, the mean threading Z-score for all structures increased from 2.59 to 3.23.

Table IV. Compilation of results on the Fischer benchmark for the distant sequence plus secondary structure plus protein specific pair profiles scoring in PROSPECTORI

^a Z-score for the score significance is given by formula (7). ^b Number of correctly determined contacts for the correct probe-template pair. ^c Fraction of correctly determined contacts for the correct probe-template pair. ^d Z-Score of correctly detennined contacts given by formula (8)for the correct probe- template pair. ^e Number of correctly determined contacts averaged over the entire structural template library.

Example 4. Application to the second Fischer benchmark

The method of the invention was used to analyze 27 probe sequences pairs scanned against the original Fischer structural database

(http://www.doembi.ucla.edu/people/fischer/BENCH/tablepairs2.html), with the following results. PROSPECTORI recognized 17 pairs in the top position, as compared to the best reported results of 17 correctly identified pairs as well as 21 and 22 in the top four and eight positions, respectively. However, one probe, "stel," which is supposed to be matched to 2azaA, selected 2pcy in the top position, making 18, 19 (19), and 20 (20) correct matches in the top position and top five (four) and ten (eight) positions respectively. Thus, somewhat better results were obtained for the first position than what has been reported previously. If PROSPECTOR2 is considered, then a total of 17, 20, and 20 proteins were recognized in the top, top five, and top ten positions, respectively.

All patents and publications mentioned in this specification are indicative of the levels of skill of those skilled in the art to which the invention pertains. All references cited in this disclosure are incorporated by reference to the same extent as if each reference had been incorporated by reference in its entirety individually. One skilled in the art would readily appreciate that the present invention is well adapted for use in determining protein structure assignments.

The specific methods and compositions described herein as presently representative of preferred embodiments are exemplary and are not intended as limitations on the scope of the invention. Changes therein and other uses will occur to those skilled in the art which are encompassed within the spirit of the invention are defined by the scope of the claims. It will be readily apparent to one skilled in the art that varying substitutions and modifications may be made to the invention disclosed herein without departing from the scope and spirit of the invention. The invention illustratively described herein suitably may be practiced in the absence of any element or elements, limitation or limitations which is not specifically disclosed herein as essential. Thus, for example, in each instance herein, in embodiments of the present invention, any of the terms "comprising," "consisting essentially of and "consisting of may be replaced with either of the other two terms. The terms and expressions that have been employed are used as terms of description and not of limitation, and there is not intention, in the use of such teπns and expressions, of excluding any equivalents of the features shown and described or portions thereof, but it is recognized that various modifications are possible within the scope of the invention claimed. Thus, it should be understood that although the present invention has been specifically disclosed by preferred embodiments and optional features, modification and variation of the concepts herein disclosed may be resorted to by those skilled in the art, and that such modifications and variations are considered to be within the scope of this invention as defined by the appended claims.

While certain embodiments and examples have been used to describe the present invention, many variations are possible and are within the spirit and scope of the invention. Such variations will be apparent to those skilled in the art upon inspection of the specification and claims herein.

Other embodiments are within the following claims.

Claims

What is claimed is: 1. A method for computationally generating a protein structural model, comprising: a) using a computer running first computer program logic to generate first close and distant sequence profiles for a probe amino acid sequence; b) using a computer running second computer program logic to scan a database of template protein structures with the first close and distant sequence profiles to identify a plurality of template protein structures in the database that best match the first close and distant sequence profiles; c) using a computer running third computer program logic to generate second close and distant sequence profiles for a probe amino acid sequence, wherein the second close and distant sequence profiles comprise the first close and distant sequence profiles and secondary interactions between amino acid residues of template protein structures; and d) using a computer running fourth computer program logic to scan the database of template protein structures with the second close and distant sequence profiles to identify a plurality of template protein structures in the database that best match the second close and distant sequence profiles.

2. A method for generating a protein structural model, comprising: a) generating first close and distant sequence profiles for a probe amino acid sequence; b) scanning a database of template protein structures with the first close and distant sequence profiles to identify a plurality of template protein structures in the database that best match the first sequence profiles; c) generating second close and distant sequence profiles for a probe amino acid sequence, wherein the second close and distant sequence profiles comprise the first close and distant sequence profiles and secondary interactions between amino acid residues of template protein structures; and d) scanning the database of template protein structures with the second close and distant sequence profiles to identify a plurality of template protein structures in the database that best match the second close and distant sequence profiles.

3. A protein structural model produced in accordance with claim 1 or 2.

4. A computer program product comprising a computer useable medium having computer program logic recorded thereon for: a) creating first close and distant sequence profiles for a probe amino acid sequence; b) scanning a database of template protein structures with the first close and distant sequence profiles to identify a plurality of template protein structures in the database that best match the first sequence profiles; c) generating second close and distant sequence profiles for a probe amino acid sequence, wherein the second close and distant sequence profiles comprise the first close and distant sequence profiles and secondary interactions between amino acid residues of template protein structures; and d) scanning the database of template protein structures with the second close and distant sequence profiles to identify a plurality of template protein structures in the database that best match the second close and distant sequence profiles.

5. A method for determining a biochemical function of a protein, comprising: a) generating a protein structural model according to claim 1 for the protein; and b) determining that the protein possesses the ability to perform the biochemical function under standard reaction conditions by identifying a sub-structure in the protein that corresponds to the biochemical function.