WO2001037191A2

WO2001037191A2 - Method for manipulating protein or dna sequence data (in order to generate complementary peptide ligands)

Info

Publication number: WO2001037191A2
Application number: PCT/GB2000/004418
Authority: WO
Inventors: Gareth Wyn Roberts; Jonathan Richard Heal
Original assignee: Proteom Limited
Priority date: 1999-11-19
Filing date: 2000-11-20
Publication date: 2001-05-25
Also published as: US6721663B1; EP1230615A2; WO2001037191A3; GB2356401A; GB9927485D0; AU1713401A

Abstract

This method enables computational analysis and manipulation of DNA and protein sequence data such as is found in large public databases. The method allows systematic searches of such data to identify portion of sequences which code for key intermolecular surfaces or regions of specific protein targets. In a first example, two amino acid sequences are input (steps 1,2) to an iterative procedure (steps 4-6). A frame size is selected, in terms of a number of sequence elements. The procedure then compares pairs of frames, one from each sequence, to identify intramolecular and intermolecular regions on the basis of relationships between amino acids according to a predetermined coding scheme. The probability of existence of each region within the coding scheme is then evaluated and those regions for which the probability is greater than a predetermined threshold are discarded. The procedure outputs the remaining regions.

Description

METHOD FOR MANIPULATING PROTEIN OR DNA SEQUENCE DATA (IN ORDER TO GENERATE COMPLEMENTARY PEPTIDE LIGANDS) ■

Specific protein interactions are critical events in most biological processes and a clear idea of the way proteins interact, their three dimensional structure and the types of molecules which might block or enhance interaction are critical aspects of the science of drug discovery in the pharmaceutical industry.

Proteins are made up of strings of amino acids and each amino acid in a string is coded for by a triplet of nucleotides present in DNA sequences. The linear sequence of DNA code is read and translated by a cell's synthetic machinery to produce a linear sequence of amino acids, which then folds to form a complex three-dimensional protein.

The mechanisms which govern protein folding are multi-factorial and the summation of a series of interactions between biophysical phenomena and other protein molecules. Virtually all molecules signal by non-covalent attachment to another molecule ("binding"). Despite the conceptual simplicity and tremendous importance of molecular recognition, the forces and energetics that govern it are poorly understood. This is owed to the fact that the two primary binding forces (electrostatics and van der Waals interactions) are weak, and roughly of the same order of magnitude. Moreover, binding at any interface is complicated by the presence of solvent (water) , solutes (metal ions and salt molecules) , and dynamics within the protein, all of which can inhibit or enhance the binding reaction.

In general it is held that the primary structure of a protein determines its tertiary structure. A large volume of work supports this view and many sources of software are available to the scientists in order to produce models of protein structures (Sansom 1998) . In addition, a considerable effort is underway in order to build on this principle and generate a definitive database demonstrating the relationships between primary and tertiary protein structures. This endeavour is likened to the human genome project and is estimated to have a similar cost (Gaasterland 1998). Despite th s assembly of background knowledge t is clear that there are considerable limitations m our abilities to predict protein structures and that these become very apparent when computational methods are applied during drug discovery programs. For many experienced practitioners the use of 'docking' programmes (which seek to examine protein- ligand interactions m detail) are 'disappointing' (Sansom 1998) .

Consider this example. A typical growth factor has a molecular weight of 15,000 to 30,000 daltons, whereas a typical small molecule drug has a molecular weight of 300-700. Moreover, X-ray crystal structures of small molecule-protein complexes (such as biotm-avidm) or enzyme-substrates show that they usually oind m crevices, not to flat areas cf tne protein. Thus relative to enzymes an receptors, protein-protein targets are non- traditional and the pharmaceutical community has had very limited success m developing drugs that bind to them using currently available approaches to lead discovery. High throughput screening technologies in which large (combinatorial) libraries of synthetic compounds are screened against a target protein (s) have failed to produce a significant number of lead compounds .

It is possible that a large portion of the difficulties experienced m attempting to apply such computer programs to drug discovery result from an over-rεliance on the consensus dogma that primary structure predicts tertiary structure.

This consensus view of the determinants of protein structure has been re-evaluated in the light of experiments with colicm El (Goldstein 1998). This scientific work demonstrated that 'modules of secondary structure that make up a given protein are not rigidly constrained in a single set of interactions that lead to a unique three-dimensional structure' (Goldstein 1998)

The data generated m such studies also presents further issues for large structural projects such as that described by Gaasterland (1998). Proteins are identified and their function ascribed by the nomology searcnes for particular structural elements associated with a given function (e.g. transmembrane domains, enzyme cleavage sites, β-barrel fold etc.) . In effect there exists a circular logic to the way m which protein structures are explored and described and this hampers our understanding of the true biological significance since we are only searching for those things we already know.

'Given these considerations, structural genomists might consider assigning a high priority to understanding the extent to which protein-protein and other molecular interactions determine native folding patterns before their databases get too full ' (Goldstein 1998) .

The binding of large proteinaceous signalling molecules (such as hormones) to cellular receptors regulates a substantial portion of the control of cellular processes and functions. These protein- protein interactions are distinct from the interaction of substrates to enzymes or small molecule ligands to seven-transmembrane receptors. Protein-protein interactions occur over relatively large surface areas, as opposed to the interactions of small molecule ligands with serpentine receptors, or enzymes with their substrates, which usually occur in focused "pockets" or "clefts."

Many major diseases result from the inactivity or hyperactivity of large protein signalling molecules. For example, diabetes mellitus results from the absence or ineffectiveness of insulin, and dwarfism from the lack of growth hormone. Thus, simple replacement therapy with recombinant forms of insulin or growth hormone heralded the beginnings of the biotechnology industry. However, nearly all drugs that target protein-protein interactions or that mimic large protein signalling molecules are also large proteins. Protein drugs are expensive to manufacture, difficult to formulate, and must be given by injection or topical administration.

It is generally believed that because the binding interfaces between proteins are very large, traditional approaches to drug screening or design have not been successful. In fact, for most protein-protein interactions, only small subsets of the overall intermolecular surfaces are important in defining binding affinity. 'One strongly suspects that tne many crevices, canyons, depressions and gaps, that punctuate any protein surface are places that interact with numerous micro- and macro-molecular ligands inside the cell or m the extra-cellular spaces, the identity of which is not known' (Goldstein 1998) .

Despite these complexities, recent evidence suggests that protein- protein interfaces are tractable targets for drug design when coupled with suitable functional analysis and more robust molecular diversity methods. For example, tne interface between hGH and its receptor buries -1300 Sq . Angstroms of surface area and involves 30 contact side chains across the interface. However, alanme-scannmg mutagenesis shows that only eight side-chains at the center of the interface (covering an area of about 350 Sq. Angstroms) are crucial for affinity. Such "hot spots" have been found m numerous other protem-protem complexes by alanme-scannmg, and their existence is likely to be a general phenomenon.

The problem therefore is to define the small subset of regions that define the binding or functionality of the protein.

The important commercial reason for this is that a more efficient way of doing this would greatly accelerate the process of drug development .

These complexities are not insoluble problems and newer theoretical methods should not be ignored the drug design process. Nonetheless, the near future there are no good algorithms that allow one to predict prote binding affinities quickly, reliably, and with high precision.

The invention described herein provides a method and a software tool for processing sequence data and a method and a software tool for protein structure analysis, and the data forming the product of each method, as defined the appended independent claims to which reference should be made. Preferred or advantageous features of the invention are set out dependent subclaims . The invention provides a method and a software tool for use analysing and manipulating sequence data (e.g. both DNA and protein) such as is found large databases (see EXAMPLE 1) . Advantageously it may enable the conducting of systematic searches to identify the sequences that code for key intermolecular surfaces or "hot spots" on specific protein targets.

This technology may advantageously have significant applications the application of informatics to sequence databases order to identify lead molecules for important pharmaceutical targets.

THE ORIGINS OF COMPLEMENTARY PEPTIDE THEORY

DNA is composed of two helical strands of nucleotides (see FIG. 11) . The concepts governing the genetic code and the fact that DNA codes for protem sequences are well known The 'sense' strand codes for the protem, and as such, attracts all the attention of molecular biologists and protem chemists alike. The purpose of the other 'anti-sense' strand is more elusive. To most, its function is relegated to that of a molecular 'support' for the 'sense' strand, which is used when DNA is replicated but is of little immediate functional significance for the day-to-day activities of cellular processes .

Some research would suggest a greater role of the antisense strand of DNA above that of the basic conceptual model of replication. In particular, it had been noticed that there appeared to be a potential functional relationship between sense and anti- sense strands m viruses. Mekler (1969) observed that several minus stranded virus complexes contained protein components translated from the mRNA complementary to the RNA of the viral gene. Mekler postulated that the significance of this finding was that because this viral protein interacts strongly with tne RNA from which the mRNA was generated, a peptide chain may associate specifically with the coding strand of its own gene. It was later thought that this might provide a rationale for the ability of a protem to regulate the transcription of its own gene. Mekler' ε original tneory was supported by studies on antigen processing pathways Specifically, an antiboαy- synthesizing R_NA complex was found to bind to its antigen with high affinity (Fishman and Adler, 1967) Mekler contended that these results demonstrated the ability of a protem antigen to regulate its own synthesis by binding to the mRNA encoding the antibody (Mekler, 1969) As the binding between the active centre of the antibody and the antigenic determinant is well known to be based on associations of polypeptide chains, he purported that two interacting polypeptides may be encoded in complementary strands of DNA (FIG. 11)

Mekler also analysed the proposed interacting regions of pancreatic ribonuclease A and recorded that reading the complementary RNA of one of the interacting chains the 5 '-3' direction yielded the sequence of the other mteractant. From these observations he suggested that there existed a specific code of interaction between ammo acid si_de chains encoded by complementary codons at the RNA level (EXAMPLE ₂) .

Collectively, these observations represented the first predictions of a sense-complementary peptide-b dmg complex.

• One key feature of Mekler' s theory was that due to the degeneracy of the genetic code one ammo acid may be complementary related to as many as four others, allowing for a large variety of possible interacting sequences (EXAMPLE 2) .

FURTHER THEORETICAL DEVELOPMENTS

In 1981, Mekler revised his original tneory and described a ^v general stereochemical genetic code' (Mekler and Idlis, 1981) which it was reported that the complementary pairings detailed in the above table formed three distinct groupings (FIG. 12)

Mekler noted that, in general, am o acids with non-polar side cnams were related by complementary code to ammo acids with polar side chains. He did not provide an explanation for th s Further theoretical considerations on the possibility of complementary-sense peptide recognition were independently developed by Biro (1981) , Root-Bernstein (1982) and Blalock and Smith (1984) . Biro (1981) conducted a computational comparison of DNA sequences encoding protein ligand-receptor segments and showed that there were many complementary regions between them, giving rise to complementary related polypeptides .

Blalock and Smith (1984) observed that the hydropathic character of an amino acid residue is related to the identity of the middle letter of the triplet codon from which it is transcribed. Specifically, a triplet codon with thymine (T) as its middle base codes for a hydrophobic residue whilst adenine (A) codes for a hydrophilic residue. A triplet codon with middle bases cytosine (C) or guanine (G) encode residues which are relatively neutral and with similar hydropathy scores. Hydropathy is an index of the affinity of an amino acid for a polar environment, hydrophilic residues yielding a more negative score, whilst hydrophobic residues exhibit more positive scores. Kyte and Doolittle (1982) conceived the most widely used scale of this type. The observed relationship between the middle base of a triplet codon and residue hydropathy entails that peptides encoded by complementary DNA will exhibit complementary, or inverted, hydropathic profiles.

It was proposed that because two peptide sequences encoded in complementary DNA strands display inverted hydropathic profiles, they may form amphipathic secondary structures and bind to one another (Bost et al . , 1985) .

Complementary peptides have been reported to form binding complexes with their 'sense' peptide counterparts (Root-Bernstein and Holsworthy, 1998). Evidence of such an interaction has now been reported for over forty different systems from many different authors (EXAMPLE 3) .

The reports listed cite experiments showing specific interactions between complementary peptide pairs. As such they demonstrate a variety of ways in which these peptide ligands may be utilised. The scope of this analysis for explaining the interactions between proteins was further developed by Blalock to propose a Molecular Recognition Theory (MRT) (Bost et al . , 1985, Blalock 1995, FIG. 13). This theory suggests that a "molecular recognition' code of interaction exists between peptides encoded by complementary strands of DNA based on the observation that such peptides will exhibit inverted hydropathic profiles.

Blalock suggested that it is the linear pattern of amino acid hydropathy scores in a sequence (rather than the combination of specific residue identities) that defines the secondary structure environment. Furthermore, he suggested that sequences with inverted hydropathic profiles are complementary in shape by virtue of inverse forces determining their steric relationships.

DERIVING A COMPLEMENTARY PEPTIDE IN THE 3' -5' READING FRAME

As a corollary to his original work, Blalock contended that as well as reading a complementary codon in the usual 5' -3' direction, reading a complementary codon in the 3' -5' would also yield amino acid sequences that displayed opposite hydropathic profiles (Bost et al . , 1985). This follows from the observation that the middle base of a triplet codon determines the hydropathy index of the residue it codes for, and thus reading a codon in the reverse direction may change the identity, but not the hydropathic nature of the coded amino acid (EXAMPLE 4) .

Statistical studies at the DNA level must take into account the degeneracy of the genetic code as it allows for the existence of larger inter- or intramolecular complementary sequences without maintaining complementarity at the DNA level. In this vein, recent work by Baranyi et al . (1995) details a new protein structural motif called the Antisense Homology Box (AHB) . Following an analysis of a protein sequence data bank for possible intramolecular complementary pairs, it was noted that there are many more regions of complementary peptide complementarity within the structures than statistically expected. The reported rrequency of tnese motifs is, on average, one per fiftv resiαues AHB areas have already been shown to be able to ac_t as molecular recognition sites by studies involving function inhibition with peptide complements Specifically, the εndothelm peptide (ET-1) was inhibited by a 14 ammo acid fragment of the endothel A receptor a smooth muscle relaxation assay (Baranyi et al . , 1996), whilst complementary encoded regions of the C5a receptor antagonize C5a anaphylatoxm (Baranyi et al . , 1996). These studies suggest that many interactions nature may result from contacts between complementary related polypeptides .

A MODEL OF RECOGNITION BASED ON HYDROPATHY

Several investigations have been directed at gaming an understanding of how hydropathic profiles and binding constants between complementary peptides are connected. The most comprehensive of these was carried out by Fassma et al . (1989) who studied the relationship between a complementary peptide designed on a computer to maximize complementary hydropathy against a thirteen-residue section of a glycoprotem. The study demonstrates a positive correlation between binding constants, as determined by an affinity binding column assay, and the degree of hydropathic complementarity, implying that a peptide' s hydropathic character is inextricably linked to the binding mechanism.

This interesting result suggests that binding between two complementary related peptides is determined solely by the hydropathicity . Importantly, it also suggests that the steric nature of the side chain alone does not directly influence the ability of peptides to recognise each other, for general, residues with similar hydropathic character display a wide distribution of side chain shapes and sizes.

APPROACHES TO PREPARING COMPLEMENTARY PEPTIDES

The generation of a complementary peptide is straightforward m cases where the DNA sequence information is available. The complementary base sequence is read in either the 5' -2' or 3'-Ξ' direction an_d translated to the peptide sequence according to the genetic code.

In the absence of knowledge of the nucleotide sequence of the sense peptide, many possible permutations of complementary sequences exist, in accordance with the degeneracy of the genetic code (as shown in EXAMPLES 2 and 4) .

Several approaches to define complementary sequences in such instances have been proposed:

• One such approach makes a series of educated guesses based on the use of preferred codon usage tables (Aota et al . 1988), which allows one to assess the probability of a particular codon to be used for each amino acid for a given sequence.

• Another approach, where applicable, is to assign the complementary residue to the amino acid, which is the most frequent out of all the theoretical complementary residues.

Thus, in a situation where the DNA sequence is unknown, the possible complementary amino acids for a leucine residue are glutamine (3 possible codons) , stop (2 possible codons) , glutamic acid (1 possible codons) and lysine (1 possible codon) . In this case glutamine would be chosen on the basis of statistical weight. Information such as this, along with the use of codon usage tables leads to a consensus approach to limiting the number of possible combinations of complementary sequences. Bost and Blalock (1989) and Shai et al . (1989) have employed methods of this type.

A number of studies have demonstrated the value of this type of approach to designing peptides with real functional utility.

Although some very high affinities have been reported for these peptides (JCj - 10^"9 M) , most are of moderate affinity (K - 10^"3-10^"7M) .

Their potential applications therefore would depend on the affinity attained in a particular system. Lower affinity complementary peptides may be useful for diagnostic tests or for purification of ..gεnαs Higner affinity pept αes may serve a purpose the development of therapeutics, for example a complementary peptioe to a coat protein of a virus may interfere with the virus-host interaction at the molecular level, thus providing a strategy to manage this type of disorder

Although the importance of inverted hydropathy protem-protem interactions has long been recognized (Blalock and Smith, 1984) there has been little activity to apply this method on a large scale to investigate the complementary peptide partners of many proteins. One such attempt is recorded m the literature "In the design of computer-based mining tools, no attention has been paid to a unique feature m the genetic code that determines the basic physico- chemical character of the encoded ammo acids" (Kohler and Blalock, 1998). They proposed a method to scan DNA sequence banks using the hydropathic binary code, Patent US5523208 The method described differs from the current invention as outlined below.

• The current invention finds regions of potentially interacting ammo acid sequences by using the relationships outlined EXAMPLES 2 and 4 Patent US5523208 determines regions of potentially interacting peptides by an altogether different method, that of hydropathy scoring. The results of analyses are thus completely different.

• The process (algorithms) by which sequences are analysed are different the current invention than described patent US5523208. In particular, the current invention describes different algorithms for the analysis of complementary regions between proteins, or within proteins. PROBLEMS ADDRESSED E^ THE INVENTION

The current problems associated with design of complementary peptides are

• A lacx of understanding of the forces of recognition between complementary peptides

• An absence of software tools to facilitate searching and selecting complementary peptide pairs from within a protem database

• A lack of understanding of statistical relevance/distribution of naturally encoded complementary peptides and how this corresponds to functional relevance

Based on these shortfalls, embodiments of the invention describe the following technological advances this field. -

• A mini library approach to define forces of recognition between human Interleukm (IL) lβ and its complementary peptides

• A high throughput computer system to analyse an entire database for mtra/ ter-molecular complementary regions

• A novel (computational) method of analyzing X-ray crystal files for potential discontinuous complementary binding sites.

THE INNOVATION

Studies into preferred complementary peptide pairings between IL-lβ and its complementary ligand reveal the importance of both the genetic code and complementary hydropathy for recognition. Specifically, for our example, the genetic code for a region of protem codes for the complementary peptide with the highest affinity An important observation is that this complementary peptide maps spatially and by residue hydropathic character to the interacting portion of the IL-1R receptor, as elucidated by the X-ray crystal structure Brookhaven reference pdb2itb.ent.

• Using these novel observations as guiding principles for analysis, we have developed a computational analysis system to evaluate the statistical and functional relevance of intra/inter- molecular complementary sequences.

This invention provides significant benefits for those interested in:

• The analysis and acquisition of peptide sequences to be used in the understanding of protein-protein interactions.

• The development of peptides or small molecules that could be used to manipulate these interactions.

The advantages of this invention to previous work in this field include: -

• Using a valid statistical model. Previously, complementary mappings within protein structures has been statistically validated by assuming that the occurrence of individual amino acids is equally weighted at 1/20 (Baranyi et al . , 1995). Our statistical model takes into account the natural occurrence of amino acids and thus generates probabilities dependent on sequence rather than content per se.

• Facilitation of batch searching of an entire database. Previously, investigations into the significance of naturally encoded complementary related sequences have been limited to small sample sizes with non-automated methods. The invention allows for analysis of an entire database at a time, overcoming the sampling problem, and providing for the first time an overview or 'map' of complementary peptide sequences within known protein sequences.

• The ability to map complementary sequences as a function of frame size and percentage antisense amino acid content. Previously, no consideration has been given tc tne significance of the frame length of complementar sequences Our invention produces a statistical map as a function of frame size and percentage complementary residue content such that the statistical importance of how nature selects these frames may be evaluated.

BRIEF DESCRIPTION OF DRAWINGS

The present invention is described with reference to accompanying drawings. In the drawings, like reference numbers indicate identical or functionally similar elements.

FIG. 1 shows a block diagram illustrating one embodiment of a method of the present invention; FIG. 2 shows a block diagram illustrating one embodiment for carrying out Step 4 FIG. 1 ; FIG. 3 shows a block diagram illustrating one embodiment for carrying out Step 5 in FIG. 1; FIG. 4 shows a block diagram illustrating one embodiment for carrying out Step 8 in FIG. 2 and 3; FIG. 5 shows a block diagram illustrating one embodiment for carrying out Step 8 m FIG. 2 and 3 ; FIG. 6 shows a block diagram illustrating one embodiment for carrying out Step 6 FIG. 1 ; FIG. 7 snows a block diagram illustrating a second embodiment of a method of the present invention; FIG. 8 shows a block diagram illustrating one embodiment for carrying out Step 29 m FIG. 7 ; FIG. 9 shows a block diagram illustrating one embodiment for carrying out Step 30 FIG. 7 ; FIG. 10 shows a diagram illustrating one embodiment of software design required to implement the ALS program FIG. 11 shows a diagram illustrating the principle of complementary peptide derivation. FIG. 12 shows a diagram to illustrate antisense ammo acids pairings inherent in the genetic code FIG. 13 snows a representation of the Molecular Recognition Theory FIC 14 shows a graph and text illustrating cioicgicε. oata as an example of the utility of the ALS program

A DESCRIPTION OF THE ANALYTICAL PROCESS OF THE INVENTION.

The software, ALS (antisense ligand searcher) , performs the following tasks : -

• Given the input of two ammo acid sequences, calculates the position, number and probability of the existence of lntra-

(with a protem) and inter- (between proteins) molecular antisense regions. 'Antisense' refers to relationships between ammo acids specified in EXAMPLES 2 and 4 (both 5'->3' derived and 3 ' - >5 ' derived coding schemes).

• Allows sequences to be inputted manually through a suitable user interface (UI) and also through a connection to a database such that automated, or batch, processing can be facilitated.

• Provides a suitable database to store results and an appropriate interface to allow manipulation of this data.

• Allows generation of random sequences to function as experimental controls .

Diagrams describing the algorithms involved m this software are shown m FIG. 1-6. DETAILED DESCRIPTION OF THE EMBODIMENTS

1 OVERVIEW

The present invention is directed toward a computer-based process, a computer-based system and/or a computer program product for analysing antisense relationships between protem or DNA sequences. A scheme of software architecture of a preferred embodiment is shown m FIG. 10.

The method of the embodiment provides a tool for the analysis of protein or DNA sequences for antisense relationships. This embodiment covers analysis of DNA or prote sequences for intramolecular (within the same sequence) antisense relationships or ter-molecular (between 2 different sequences) antisense relationships. This principle applies whether the sequence contains am o acid information (protein) or DNA information, since the former may be derived from the latter.

The overall process of the invention is to facilitate the batch analysis of an entire genome (collection of genes/and or protein sequences) for every possible antisense relationship of both mter- and mtra-molecular nature. For the purpose of example it will be described here how a protem sequence database, SWISS-PROT (Bairoch and Apweiler, 1999) , may be analysed by the methods described.

SWISS-PROT contains a list of protem sequences. The current invention does not specify what format the input sequences are held - for this example we used a relational database to allow access to this data.

The program runs m two modes. The first mode (Intermolecular) is to select the first protem sequence SWISS-PROT and then analyse the antisense relationships between this sequence and all other protein sequences, one at a time. The program then selects the second sequence and repeats this process. This continues until all of the possible relationships have been analysed. The second mode (Intramolecular) is where each protein sequence is analysed for antisense relationships within the same protein and thus each sequence is loaded from the database and analysed in turn for these cropert es Eotn operational mooes use tne same core algontnms for their processes The core algorithms are describee detail below

An example of tne output from this process is shown EXAMPLES 5 and 6. EXAMPLE 5 shows a list of proteins the SWISS-PROT database that contain highly improbable numbers of intramolecular antisense frames of size 10 (frame size is a section of the ma sequence, it is described m more detail below) . In EXAMPLE 5 the total number of antisense frames are shown. Another way of representing this data is to list the actual sequence information itself, as shown in EXAMPLE 6 and m the Sequence Listing (Seq ID Nos 1-32) . An example of the biological relevance of peptides derived from this method is shown in FIG. 14. The embodiment can output the data m either of these formats as well as many others .

2. METHOD OF THE PRESENT INVENTION

For the purpose of example protem sequence 1 is ATRGRDSRDERSDERTD and protem sequence 2 is GTFRTSREDSTYSGDTDFDE (universal 1 letter ammo acid codes used) .

In step 1 (see FIG. 1), a protem sequence, Sequence 1, is loaded. The protein sequence consists of an array of universally recognised ammo acid one letter codes, e.g. 'ADTRGSRD'. The source of this sequence can oe a database, or any other file type. Step 2, is the same operation as for step 1, except Sequence 2 is loaded. Decision step 3 involves comparing the two sequences and determining whether they are identical, or whether they differ. If they differ, processing continues to step 4, described in FIG. 2, otherwise processing continues to step 5, described in FIG. 3.

Step 6 analyses the data resulting from either step 4, or step 5, and involves an algorithm described m FIG. 6. Description of parameters useo m FIG 2

In Step 7, a 'frame' is selected for each of the proteins selected steps 1 and 2. A 'frame' is a specific section of a protein sequence. For example, for sequence 1, the first frame of lengtr. '5' would correspond to the characters 'ATRGR' The user of the program decides the frame length as an input value. This value corresponds to parameter (n) in FIG. 2. A frame is selected from each of the protein sequences (sequence 1 and sequence 2). Each pair of frames that are selected are aligned and frame position parameter (f) is set to 0. The first pair of ammo acids are 'compared' using the algorithm shown m FIG. 4 and 5 The score output from this algorithm (y, either 1 or 0) is added to an aggregate score for the frame { ιS) . In decision step 9 it is determined whether the aggregate score { ιS) is greater than the Score Threshold value (x) If it is then the frame is stored for further analysis. If it is not then decision step 10 is implemented. In decision step 10, it is determined whether it is possible for the frame to yield the Score Threshold (x) . If it can, the frame processing continues and (f) is incremented such that the next pair of ammo acids is compared If it cannot, the loop exits ana the next frame is selected The position that the frame is selected from the protem sequences is determined by the parameter (ipl) for sequence 1 and ( ιp2) for Sequence 2 (refer to FIG. 2) . Each time steps 7 to 10 or 7 to 11 are completed, the value of dpi) is zeroed and then incremented until all frames of Sequence 1 have been analysed against the chosen frame of Sequence 2. When this is done, ( ιp2) is then incremented and the value of dpi) is incremented until all frames of Seσuence 1 have oeen analysed against the cnosen frame cf Seσuence 2 This process repeats anc terminates when dp2) is eσual to the length of Seσuence 2 Once this process is complete, Seσuence 1 is reversed programmatically and the same analysis as αescribed above is repeated. The overall effect of repeating steps 7 to 11 using each possible frame from both sequences is to facilitate step 8, the antisense scoring matrix for each possible comnmation of linear sequences at a given frame length

FIG 3 shows a block diagram of the algorithmic process that is carried out m the conditions described m FIG. 1. Step 12 is the only difference between the algorithms FIG. 2 and FIG 3. In step 12, the value of dp2) (the position of the frame m sequence 2) is set to at least the value of dpi) at all times since as Sequence 1 and Seσuence 2 are identical, if dp2) is less than dpi) then the same sequences are being searched twice.

FIG 4 and 5 describe the process m which a pair of ammo acids (FIG. 4) or a pair of triplet codons is assessed for an antisense relationship. The antisense relationships are listed in EXAMPLES 2 and 4. In step 13, the currently selected ammo acid from the current frame of Sequence 1 and the currently selected ammo acid from the current frame of Sequence 2 (determined by parameter (f) in FIG. 2 and 3) are selected. For example, the first am o acid from the first frame of Sequence 1 would be 'A' and the first ammo acid from the first frame of Sequence 2 would be 'G' . In step 14, the ASCII character codes for the selected single uppercase characters are determined and multiplied and, m step 15, the product compared with a list of pre-calculated scores, which represent the antisense relationships EXAMPLES 2 and 4 If the ammo acids are deemed to fulfil the criteria for an antisense relationship (the product matches a value m the pre-calculated list) then an output parameter (T) is set to 1, otherwise the output parameter is set to 0 (see FIG. 4) .

Steps 16-21 relate to the case where the input sequences are DNA/RNA code rather the protem sequence. For example Sequence 1 could be AAATTTAGCATG and Sequence 2 could be TTTAAAGCATGC. The domain of the current invention includes both of these types of information as input values, since tne protein seσuence ca ne oecoαed from tne DNA sequence, m accordance with the genetic code Steps 16-21 determine antisense relationships for a given triplet codon In step 16, the currently selected triplet codon for botn sequences is 'read' For example, for Sequence 1 the first triplet codon of the first frame would be 'AAA', and for Sequence 2 this would be ' TTT ' In step 17, the second character of each of these strings is selected. In step 18, the ASCII codes are multiplied and compared, m decision step 19, to a list to find out if the bases selected are 'complementary', accordance with the rules of the genetic code If they are, the first bases are compared m step 20, and subsequently the third bases are compared m step 21 Step 18 then determines whether the bases are 'complementary' or not. If the comparison yields a 'non- complementary' value at any step the routine terminates and the output score (T) is set to 0. Otherwise the triplet codons are complementary and the output score (T) = 1.

FIG. 6 illustrates the process of rationalising the results after the comparison of 2 prote or 2 DNA sequences. In step 22, the first 'result' is selected. A result consists of information on a pair of frames that were deemed 'antisense' in FIG. 2 or 3. This information includes location, length, score (ι..e the sum of scores for a frame) and frame type (forward or reverse, depending on orientation of sequences with respect to one another) . In step 23, the frame size, the score values and the length of the parent sequence are then used to calculate the probability of that frame existing. The statistics, which govern the probability of any frame existing, are described the next section and refer to equations 1-4. If the probability is less than a user chosen value (p) , then the frame details are 'stored' for inclusion m the final result set (step 24).

STATISTICAL BASIS OF PROGRAM OPERATION

The number of complementary frames in a protein sequence can be predicted from appropriate use of statistical theory.

The probability of any one residue fitting the criteria for a complementary relationship with any other s defined by the groupings ...ustrateα _n EXAMPLE 2 Tnus, depending en tne resxαue in quest_on, tnere are varying probabilities for the selection of a complementary am o acid. This is a result of an uneven distribution of possible partners. For example possible complementary partners for a tryptophan residue include only proline whilst glycme, serine, cyste e and arginine all fulfil the criteria as complementary partners for threonme . The probabilities for these residues aligning with a complementary match are thus 0.05 and 0.2 respectively. The first problem fitting an accurate equation to describe the expected number of complementary frames within any sequence is integrating these uneven probabilities into the model. One solution is to use an average value of the relative abundance of the different ammo acids in natural sequences. This is calculated by (equation 1):

Where ( v) = probability sum, (R) = fractional abundance of ammo acid in E. coli proteins, ( N) = number of complementary partners specified by genetic code.

This value (p) is calculated as 2.98. The average probability (p) of selecting a complementary ammo acid is thus 2.98/20 = 0.149.

For a single 'frame' of size (n) the probability ( C) of pairing a number of complementary ammo acids (r) can be described by the binomial distribution (equation 2) :

c = -^—-p (\ - _Pγ- 2

(n - r)'rl

With this information we can predict that the expected number {Ex) of complementary frames in a protem to be (equation 3) :

Where (S) = protem length, (n) = frame size, (r) = number of complementary residues required for a frame and (p) = 0.149. If ( r) = (n) , representing that all am o aciαs a frame nave to fulfil a c omc ι re . at i cnsmp , f e ajoove eαua t i cr s ncl ι f_ es to ( equati on A ) ^•

For a population of randomly assembled ammo acid chains of a predetermineα length we would expect the number of frames fulfilling the complementary criteria the search algorithm to vary m accordance with a normal distribution.

Importantly, it is possible to standardise results such that given a calculated mean (u) and standard deviation (σ) for a population it is possible to determine the probability of any specific result occurring. Standardisation of tne distribution model is facilitated by the following relation (equation 5) :

X -μ

Where (X) is a single value (result) n a population.

If we are consiαering complementary frames with a single protein structure then the above statistical model requires further analysis. In particular, the possibility exists that a region may be complementary to itself, as indicated in the diagram below.

Reverse turn motifs within proteins. A region of protem may oe complementary to itself. In this scenario, A-S, L-K and V-D are complementary partners. A six ammo acid wide frame would thus be reporteα (in reverse orientation) . A frame of this type is only specified by half of the residues m the frame. Such a frame is called a reverse turn. :.-. this scenario, once half of the frame length has been selected with complementary partners, there is a finite probability that those partners are the sequential neighbouring amino acids to those already selected. The probability of this occurring in any protein of any seσuence is (equation 6) :

Ex = p^S'²(S - f) ⁶

Where (f) is the frame size for analysis, and (5) is the sequence length and (p) is the average probability cf choosing an antisense amino acid.

The software of the embodiment incorporates all of the statistical models reported above such that it may assess whether a frame qualifies as a forward frame, reverse frame, or reverse turn.

ANTISENSE X-RAY STRUCTURE ANALYSIS (AXRA) SOFTWARE

Currently over 20 prokaryote and 1 eukaryote genomes have been completely sequenced and more than 3 times that number are in progress or nearing completion including the human genome. The wealth of information generated is providing the foundation for a new important initiative in structural biology. Protein fold assignment and homology modelling of related protein structures have become important research tools, providing structural insights for many different areas of biology and medicine, Burley et al . , 1999. At present, however, despite large-scale protein structure analyses only a fraction of a protein can usually be modelled e.g. 18% of all residues, or domains in yeast proteins.

"The obvious solution to this problem is to obtain complete three- dimensional structural information for each distinct protein fold. De novo prediction of a protein structure from its sequence is simply not feasible at present", Burley et al . 1999.

The current invention provides a novel method for aiding the determination of three dimensional, structure . This software performs the following tasks: -

• Reads an X-Ray structure file;

• Determines regions of complementary hydropathy and /or antisense pairings in 3D space, between:

1) 2 discontinuous protein sequences

2) 1 discontinuous and 1 linear protein sequences

3) 2 linear protein sequences.

INVENTIVE ASPECT OF SOFTWARE

The observation that many receptor-ligand contact points within the IL-lβ IL-1R X-ray crystal structure involve an interchange of residues of opposite polarity, suggests that this may represent a general principle of protein contact points. In this vein, AXRA was designed to analyse X-ray data for regions of complementary hydropathy and/or antisense relationships between proximal residues. This software confers significant advantages in: -

• Prediction of tertiary and quaternary protein structures,-

• Prediction of intermolecular contact points.

AXRA overcomes previous limitations of analysing protein sequences for antisense interactions by recognising for the first time that antisense pairings also exist in discontinuous regions of proteins, and thus antisense sequence searching can be expanded to 3 dimensional structures . PROGRAM OPERATION

In overview, program functions by

• Reading an X-ray data file;

• Calculating which sets of residues, or 'frames' of user defined length, represent the greatest area of complementary hydropathy and /or antisense relationships.

User options allow control over searching parameters such as frame length, minimum distance for partner and number of neighbouring residues from the same chain to exclude from analysis.

Description of parameters used m AXRA process

Decision steps 25 to 30 are shown in FIG. 7. In step 25, the program reads a file containing the cartesian x, y, z co-ordinates of a protein structure and these are stored by conventional programmatic means (step 26) . The protem sequence (1 letter ammo acid codes) is also read from this file and stored memory as an array of cnaracters In step 2 the distances netween eacn alpha- caroon atom (as denoted m Brookhaven databank format CA) ana all other carbon atoms that make up eacn ammo acid (CB, cl, c2, en) are calculated by vector mathematics from the cartesian co-ordinates The program user chooses (through the UI) which atom type (e g CB , cl etc) are used the calculation of the distances between two ammo acids. The (x) closest ammo acids for each residue are stored for further analysis. The value (x) , the number of nearest am o acids to interrogate, is provided by the user from a suitable user interface (UI) . For each ammo acid the protem structure we now have a list of proximal ammo acids within distances {mind) and {maxd) between any carbon atoms that constitute the structure of that residue. The default maximum distance m this process is 15 angstroms,- if less than (x) ammo acids fall within this distance then only those within this distance will be stored. The user may change this value through the UI . This is known as the Nearest Neighbour Sphere (NNS). In decision step 28, the program flow follows the user's choice (input through the UI) as to whether the analysis should be based on hydropathy (step 29) or whether the analysis should be based on antisense relationships (step 30)

Decision steps 31 to 35 are shown m FIG. 8. In step 31, the antisense relationships between the first ammo acid m the protein sequence (stored m step 25) and the list of ammo acids stored as the nearest neighbour sphere (NNS) are determined. (Programmatically, the NNS is a list of arrays - one array for each position in the protein sequence) To do this, each am o acid in the sequence is selected in turn and compared with each member of its NNS (stored in step 27) using the algorithm depicted m FIG. 5 If none of the NNS members for a particular ammo acid show an antisense relationship (i.e output value of 1 from FIG 5) then a zero value is scored at this position m a Result Array (R) , otherwise the details (sequence index) of the closest ammo acid fulfilling an antisense relationship are stored in the Result Array (R) for further analysis. The user may specify input values determining the maximum [maxd) and minimum {mind) distances that antisense relationships must fall within to be accepted. This process is repeated for all ammo acids in the protein sequence generating a Result Array (R) containing sequence indexes of all am o acioε tnat fulf_i an antisense criteria within the NNS The overall process here is to oefme which proximal ammo acids have antisense relationships

Decision step 32 routes the users selection (from the UI) of whether to find regions of antisense relationships between 2 continuous parts of the same sequence (step 33), 1 continuous and 1 discontinuous part of the same sequence (step 34) or 2 discontinuous parts of the same sequence (step 35) .

In step 33, the first 'frame' of length (n) of the protein sequence s selected. The frame is a section of the total sequence, and the length of this frame (n) is chosen by the user through the UI . Also chosen through the UI is a Score Threshold { ST) parameter. The first frame (of length ⁽n^{) )} is selected from the protem sequence. For each am o acid in this frame the NNS is analysed If any continuous combinations of antisense relationships within the NNS are found where the aggregate Score { S) is greater than the user chosen Score Threshold ^{ ST⁾ then the am o acids sequence locations are stored as a 'hit' frame. This is repeated for each frame in the protem sequence When the process has finished the 'hit frame' results are then listed m an appropriate UI format.

In step 34, the first 'frame' of length (n) of the protein sequence is selected. The frame is a section of the total sequence, and the length of this frame (n) is chosen by the user through the UI . Also chosen through the UI is a Score Threshold { ST) parameter. The first frame ⁽of length ⁽n^{) )} is selected from the protem sequence. For each ammo acid each frame the NNS is analysed. If any discontinuous combinations of antisense relationships within the NNS are found where the aggregate score (S) is greater than the user chosen Score Threshold ^{ ST⁾ then the ammo acids sequence locations are stored as a 'hit' frame. This is repeated for each frame of the protein sequence. When the process has finished the 'hit frame' results are then listed an appropriate UI format. In steo 35, tne πrst aπu'-c acid or tne orote r seαuence is selected. The list of antisense relationships determined step 31 is listed in an appropriate UI format.

Decision steps 36 to ^0 are shown in FIC. 9. In step 40, the hydropathic comparison scores between the first am o acid the protem sequence (stored in step 25) and the list of ammo aciαs stored as the nearest neignbour sphere (NNS) are determined using the following equation (equation ) :

Where (a₂) and (a ) are the hydropathy scores of the ammo acids selected as scored on tne Kyte and Doolittle scale (Kyte and Doolittle, 1982) . This equation is evaluated for each pair of ammo acids specified by the currently selected ammo acid and its partners the NNS and the resulting ( H) values are scored.

The user may specify input values determining the maximum {maxd) and minimum {mmd) distances that relationships must fall within to be processed further. This process is repeated for all ammo acids in the protem sequence. The overall process here is to define the hydropathic relationships between proximal ammo acids. Programmatically, we end up with a list of arrays where each array contains a list of nyoropathic scores for ammo acids neighbouring the ammo acid specified by the index m the ma list. This list of arrays { LR) is then used for steps 37, 38 or 39.

Decision step 36 routes the users selection (from the UI) of whether to find regions of complementary hydropathy between 2 continuous parts of the same sequence (step 37), 1 continuous and 1 discontinuous part of the same sequence (step 38) or 2 discontinuous parts of the same sequence (step 39) .

In step 37, the frame is a section of the total sequence, and the length of this frame (n) is chosen by the user through the UI . Also chosen through the UI is a Hydropathy Score Threshold { HST) parameter. The first 'frame' of length (n) of the protein sequence is selected. In this nrεt frame tne first am o acic _^s se_ectec. The LCWEST value of tne list of hydropathy scores formed step 40 is ta en and written to a Result Frame ( RFi . (The sequence indexes of tne amino acids that are responsible for the lowest scores are written to another list { SL) such that a link between ammo acid location and hydropathy is created) . This is repeated for each ammo acid in the frame until we have a completed Result Frame { RF) that contains a list of the lowest hydropatny scores available for the specified ammo acids. The average hydropathy for this frame is then determined by the following (equation 8):

Where { H) is defined in the equation above, { L) is the frame length, denoting the length of the ammo acid sequence that is used for the comparison. The lower the score (Ω) , the greater the degree of hydropathic complementarity for the defined region.

Once the average hydropathy score is calculated, if that score is LOWER than the ( HST) parameter the sequence indexes of the amino acids that were responsible for the hydropathy values used in equation 10 are analysed for continuity (i.e. are these amino acids continuous, such as position 10, position 11, position 12 etc) . If continuity is found, the frame is stored for further analysis.

This is repeated for each frame of the protein sequence (i.e. of frame length 7, 1-7, 2-8, 3-9 etc). When the process has finished the results are then listed in an appropriate UI format.

In step 39, the frame is a section of the total sequence, ano the length of this frame (n) is chosen by the user through tne UI . Also chosen through the UI is a Hydropathy Score Threshold { HST) parameter. The first 'frame' of length (n) of the prote sequence is selected. In this first frame the first ammo acid is selected. The LOWEST value of the list of hydropathy scores formed in step 40 is taken and written to a Result Frame (RF) . (The sequence indexes of the ammo acids that are responsible for the lowest scores are written to another list (SL) sucn that a lmκ between ammo acid location and hydropathy is created.). This is repeated for each ammo acid m the frame until we have a completed Result Frame {RF) that contains a list of the lowest hydropathy scores available for the specified ammo acids. The average hydropathy for this frame is then determined by the following equation 8.

Once the average hydropathy score is calculated, if that score is LOWER than the HST parameter the sequence indexes of the ammo acids that were responsible for the hydropathy values used m equation 10 are stored m a suitable programmatic container to display as results .

This is repeated for each frame of the protem sequence (i.e. of frame length 7, 1-7, 2-8, 3-9 etc) . When the process has finished the results are then listed m an appropriate UI format.

In step 38, all hydropathic relationships (equation 8) between each ammo acid and its NNS counterparts are written out to a display for further analysis. The program flow is illustrated FIG. 7.

SPECIFIC EXAMPLE OF AXRA OUTPUT

The software was used to select regions of complementary hydropathy withm the IL-lβ IL-1R crystal structure. The program was run on the X-ray file (pdb2ιtb) and selected the most complementary region between the ligand and receptor as consisting of residues 47-54 of IL-lβ (sequence QGEESND) and residues 245, 244, 303, 298, 242, 249, 253 of the receptor (sequence W, S, V, I, G, Y, I). This demonstrates two things. Firstly, it shows that the software functions properly that it can locate regions of hydropathic complementarity between a receptor-ligand pair. Secondly, it proves that the region of IL-lβ wnich has the closest residues of greatest hydropathic inversion to the IL-1 type I receptor is the trigger loop region of IL-lβ to which we have previously designed antisense peptides. The receptor-ligand contact pairs analysed by the software a s di spl aying tne large s t di f f erence : /drocathi .ndice: are i l lus trated bel ow .

Region of complementary hydropathy within the X-ray crystal structure of IL-lβ complexed with its type I receptor (pdb file 2itb) . C-alpha traces of the proteins are displayed with regions picked out by the 3D-hydropathy map tool highlighted in white.

UTILITY OF THE INVENTION

This invention presents a novel informatics technology that greatly accelerates the pace for initial identification and subsequent optimization of small peptides that bind to protein-protein targets. Using this technology an operator can systematically produce large numbers or 'catalogues' of small peptides that are very useful and specific agonists/antagonists of protein-protein interactions.

These peptides are ideally suited for use in drug discovery programs as biological tools for probing gene function, or as a basis for configuring drug discovery screens or as a molecular scaffold for medicinal chemistry. In addition, peptides with a high affinity for a protein could form drugs in their own right.

Finally, these peptides are amenable to dramatic further improvement through various methods in addition to traditional medicinal chemistry. EXAMPLE 1

Protem and nucleotide sequence databases amenable for analysis using the invention

Major Nucleic acid databases

ricr Protem Seσuence αatanases

EXAMPLE 2

The amino acid pairings resulting from reading the anticodon for naturally occurring amino acid residues the 5' -3' direction.

EXAMPLE 3

Literature regarding generation of complementary peptides with biological effects

System tested Reference index

ACTH Bost et al. (1985)

Anaphylatoxm C5a Baranyi et al . (1996) Angiogenm Gho et al. (1997) Angiotensm II Elton et al. (1988) , Soffer et al . (1987)

Arginine vasopressin Johnson et al . (1988)

Lu et al . 1991) γ-endorphm Shahabi et al. (1992)

Big Endothelm Fassma et al. (1992a)

Bradykimn Fassma et al. (1992b)

Calcium mimetic peptide Dillon et al. (1991) c-Raf protein Fassina et al . (1989a)

Cystatm Ghiso et al. (1990)

Dopamme receptor Nagy et al (1991)

Enkephalm Carr et al . (1989)

Fibπnogen Pasqualim et al . (1989),

Gartner et al . (1991a)

Fibronectm Brentam et al . (1988) Y-Endorphm Carr et al . (1986) Gastrm terminal peptide McGuigan et al . (1992),

Jones (1972)

GH-RH Grosvenor and Balint (1989)

Idiotypic antibodies Bost and Blalock (1989)

Insulin Knutson (1988)

Integrm Derrick et al . (1997)

Interferon β Johnson et al . (1982)

Interferon Y Scalpol et al . (1992)

Interleukm 2 Weigent et al . (1986)

Fassina et al . (1995)

Lamimn receptor Castronovo, V et al . (1991)

LH-RH Mulchahey et al . (1986)

Melanocyte stimulating Al-Obeidi, F. A. et al . (1990) hormone

Mosquito oostatic receptor Borovsky et al . (1994)

Myelm protein antibody Zhou et al . (1994)

Nicotimc receptor Radding et al . (1992)

Neurophysm II Fassina et al (1989b)

Ovine prolactm Bajpai et al . (1991)

Opiate receptor Carr et al . (1987)

Prion protein Martins et al . (1997)

Ribonuclease S peptide Shai et al. (1989)

Somatostatm Campbell -Thompson (1993)

Substance P Bret-Dibat et al . (1994)

T15autoreactιve antibody Kang et al . (1988)

Vasopressin 1 receptor Kelly et al . (1990)

Vitronectm Gartner et al . (1991b) IXAMFLΞ 4

The relationships between am o acids and the residues encoded in the complementary strand reading 3 '-5'

EXAMPLE 5

Examples of proteins to which complementary peptides can be identified by Antisense Ligand Searcher (ALS) in the SWISS-PROT database

Frame Size 10: Swiss-Prot DB : 50 significant proteins

Accession No. Description Length No. No.RF No. Total Ex()

SHEEP (P50415 ) BACTENECIN 7 PRECURSOR 190 8 8 4 16 S .89E-05

CHICK (Q98937 ) TRANSCRIPTION FACTOR BF-2 440 22 26 0 48 0 .00053

HUMAN (P55316 ) TRANSCRIPTION FACTOR BF-2 469 12 4 0 16 0 .000603

MOUSE (Q61345 ) TRANSCRIPTION FACTOR BF-2 456 22 18 1 40 0 .00057

HUMAN (Q12837 ) BRAIN-SPECIFIC HOMEOBOX 410 40 53 1 93 0 .000461

MOUSE (Q6393₄ ) BRAIN-SPECIFIC HOMEOBOX 411 108 127 1 235 0 .000463

HUMAN (P2026 ) BRAIN-SPECIFIC HOMEOBOX 500 102 103 1 205 0 .000685

MOUSE (P31361 ) BRAIN-SPECIFIC HOMEOBOX 495 82 83 1 165 0 .000671

DROME (Q24266 ) TRANSCRIPTION FACTOR BTD 644 28 32 0 60 0 .001136

GVCL (P 1726 ) DNA-BINDING PROTEIN 58 48 54 10 102 9 .22E-06

HUMAN (P02452 ) PROCOLLAGEN ALPHA 1(1) 1464 6 62 4 68 0 .005873

HUMAN (P02458 ) PROCOLLAGEN ALPHA 1(11) 1418 6 22 2 28 0 .005509

MOUSE (P28 81 ) PROCOLLAGEN ALPHA 1(11) 1459 8 31 3 39 0 .005833

BOVIN (P04258 ) COLLAGEN ALPHA Kill) 1049 8 17 3 25 0 .003015

HUMAN P02461 ) PROCOLLAGEN ALPHA 1(111) 1456 8 28 2 36 0 005889

BOVIN Q28083 ) COLLAGEN ALPHA 1(XI) CHAIN 911 8 4 0 12 0 002274

MOUSE Q01149 PROCOLLAGEN ALPHA 2 ( I 1373 14 84 4 98 0 005165

MOUSE Q99020 CARG-BINDING FACTOR-A 285 6 9 3 15 0 000223

HUMAN P22S81 PROTO-ONCOGENE C-CBL 906 12 2 0 14 0 002249

HUMAN Q13319 CYCLIN-DEPENDENT KINASE 5 367 6 1 1 7 0 000369

DROME P17970 VOLTAGE-GATED POTASSIUM CH 924 162 162 18 324 0 002339

DROME Q02280 POTASSIUM CHANNEL PROTEIN 1174 38 63 9 101 0 003776

RAT Q09167 SULIN- INDUCED GROWTH 269 32 42 8 74 0 000198

CHICK Q90611 72 KD TYPE IV COLLAGENASE 663 8 8 4 16 0 001204

HPBVF P29178 CORE ANTIGEN 195 14 15 3 29 0 000104

DROME P32027 FORK HEAD DOMAIN PROTEIN 508 50 40 0 90 0 000707

CRYPA P52753 CRYPARIN PRECURSOR 118 18 18 6 36 3 82E-05

CANFA P30803 ADENYLATΞ CYCLASE , TYPE V 1184 55 34 3 89 0 003841

RABIT P 01 4 ADENYLATΞ CYCLASE, TYPE V 1264 25 21 3 46 0 004378

DICDI P54639 CYSTEINE PROTEINASE 4 442 84 82 14 166 0 000535

ORYSA P22913 DEHYDRIN RAB 16D 151 8 22 0 30 6 25E-05

ORYSA P12253 WATER- STRESS INDUCIBLE 163 14 12 4 26 7 28E-05

RAPSA P21298 LATE EMBRYOGENESIS 184 24 20 0 44 9 28E-05

DROME P23792 DISCONNECTED PROTEIN 568 27 28 8 55 0 000884

DROME Q24563 DOPAMINE RECEPTOR 2 539 8 0 0 8 0 000796

DICDI Q04503 PRESPORE PROTEIN DP87 555 22 17 1 39 0 000844

DROME ( P23022 DOUBLESEX PROTEIN 427 56 68 0 124 0.0005

DROME ( P23023) DOUBLESEX PROTEIN, MALE- 549 70 88 0 158 0 000826

DROME ( Q27368) TRANSCRIFTION FACTOR E2F 805 6 11 1 17 0 001776

DROME ( P20105) ECDYSONE- INDUCED PROTEIN 7 829 8 5 1 13 0 001883

DROME ( P11536) ECDYSONE- INDUCED PROTEIN 7 883 80 83 1 163 0 002136

EBV ( P12978) BNA-2 NUCLEAR PROTEIN 487 174 178 0 352 0 .00065

HUMAN ( P18146) EARLY GROWTH RESPONSE 543 12 17 1 29 0 000808

MOUSE ( P49749) HOMEOBOX EVEN-SKIPPED 475 223 208 0 431 0 000618

HUMAN ( Q12947) FORKHEAD-RELATED TRANSCR .. 408 10 20 0 30 0 000456

HUMAN ( Q16676) FORKHEAD- RELATED TRANSCR.. 465 14 7 1 21 0 000592

DROME ( P33244) NUCLEAR HORMONE RECEPTOR 1043 104 118 12 222 0 002981

BURCE ( P24127) FUSARIC ACID RESISTANCE 142 6 0 0 6 5 52E-05

SCHPO ( P41891) GAR2 PROTEIN 500 8 8 0 16 0 000685

HUMAN ( P43694) TRANSCRIPTION FACTOR &Λ " 442 12 12 2 24 0 .00053 EXAMPLE 6 (SEQ ID NQE . 1-32)

DROME (P17970) VOLTAGE-GATED POTASSIUM CHANNEL PROTEIN SHAB (SHAB 11) REVERSE GSGAGAGAGA 157-166 GAGSGSGSGA 185-194

REVERSE GSGAGAGAGA 157-166 GSGSGAGTGT 172-181

HUMAN (P22681) PROTO-ONCOGENE C-CBL

REVERSE SSGAGGGTGS 8-17 GSGPAASAAT 857-866

REVERSE GSGPAASAAT 857-866 SSGAGGGTGS 8-17

FORWARD AASAATASPQ 861-870 GSGGSGSGGL 16-25

FORWARD GPAASAATAS 859-868 TGSGGSGSGG 15-24

SCHPO (P41891) GAR2 PROTEIN

REVERSE SSSESSSSSE 138-147 FGGRGGFGGG 469-478

REVERSE SSSESSSSSE 138-147 FGGRGGFGGR 463-472

REVERSE SSEΞΞSSSES 139-148 GFGGRGGFGG 468-477

REVERSE SSESSSSSES 139-148 GFGGRGGFGG 462-471

MOUSE (P49 49) HOMEOBOX EVEN- SKIPPED HOMOLOG PROTEIN 2

REVERSE APPSGSSAPC 387-396 GGGAGTAGGS 432-441 REVERSE PPSGSSAPCS 388-397 GGGGAGTAGG 431-440 REVERSE PSGSSAPCSC 389-398 GGGGGAGTAG 430-439 REVERSE AALGSRGGGG 416-425 SAAAPRSESG 446-455 REVERSE SAAAPRSESG 446-455 AALGSRGGGG 416-425 FORWARD ALGSRGGGGS 417-426 SQSAAAAAAA 404-413

REFERENCES

Aota S, Gojobori T, Ishibashi F, Marvyama T and Ilkamea T. 1988. Codon usage tabulated from the GenBank Genetic Sequence Data. Nucleic Acid Res. 16 : 315-391

Bairoch A and Apweiler R. 1999. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Research. 27:49-54.

Biro J. 1981. Comparative analysis of specificity in protein-protein interactions. Part II.: The complementary coding of some proteins as the possible source of specificity in protein-protein interactions. Med. Hypotheses 7: 981-993

Blalock JE . 1995. Genetic origins of protein shape and interaction rules. Nature Medicine 1: 876-878

Blalock JE and Smith EM. 1984. Hydropathic anti-complementarity of amino acids based on the genetic code. Biochem Biophys Res Commun. 12: 203-7.

Baranyi L, Campbell W, Ohshima K, Fujimoto S, Boros M and Okada H. 1995. The antisense homology box: a new motif within proteins that encodes biologically active peptides. Nature Medicine. 1:894-901.

Baranyi L, Campbell W and Okada H. 1996. Antisense homology boxes in C5a receptor and C5a anaphylatoxin : a new method for identification of potentially active peptides. J Immunol. 157: 4591-601.

Bost KL, Smith E M. and Blalock JE.1985. Similarity between the corticotropin (ACTH) receptor and a peptide encoded by an RNA that is complementary to ACTH mRNA. Proc.Natl. Acad . Sci . USA 82: 1372-1375

Bost KL and Blalock JE . 1989. Production of anti-idiotypic antibodies by immunization with a pair of complementary peptides. J. Molec. Recognit. 1: 179-183 Burley SK, Almo SC, Bonanno JB , Capel M, Chance MR, Gaasterland T, Lm D, Sail A, Studier FW ana Swam athan S. 1999. Structural genomics beyond the Human Genome Project Nature Genetics 23: 151- 157.

Fassma G, Zamai M, Burke MB, Chaiken, M. 1989. Recognition properties of antisense peptides to Arg8 -vasopress /bovme neurophysm II biosynthetic precursor sequences. Biochemistry 28, 8811-8818

Fishman M and Adler FL . 1967 Cold Spring Harbour Symp . Quant. Biol. 32: 343-350

Gaasterland T. Structural genomics : Bio formatics m the driver's seat. Nature Biotechnology 16: 645-627, 1998.

Goldstein DJ. 1998. An unacknowledged problem for structural genomics? Nature Biotechnology 16: 696-697.

Kohler H and Blalock E. 1998. The hydropathic binary code: a tool in genomic research? Nature Biotechnology 16: 601.

Kyte J and Doolittle RF . 1982. A simple method for displaying the hydropathic character of a protem. J Mol Biol 5:105-132.

Mekler LB. 1969 Specific selective interaction between ammo acid groups of polypeptide chains Biofizika 14: 581-584

Mekler L3 and Idlis RG . 1981 Deposited Doc. VINITI 1476-81

Root-Bernstem RS and Holsworthy DD.1998. Antisense peptides: a critical mmi-review. J. Theor. Biol. 190: 107-119

Root-Bernstem RS . 1982. Ammo acid pairing J Theor Biol. 94:885-94 Sansom C. 1996. Extending the boundaries of molecular modellinc. Nature Biotechnology 16: 917-918.

Shai Y, Brunck TK and Chaiken IM. 1989. Antisense peptide recognition of sense peptides: sequence simplification and evaluation of forces underlying the interaction. Biochemistry. 28: 8804-11.

Claims

1 A method for processing sequence αata comprising the steps of, selecting a first protem sequence and a second protem sequence, selecting a frame size corresponding to a number of sequence elements such as ammo acids or triplet codons, a score threshold, and a frame existence probability threshold; comparing each frame of the first sequence with each frame of the second sequence by comparing pairs of sequence elements at corresponding positions within each such pair of frames to evaluate a complementary relationship score for each pair of frames; storing details of any pairs of frames for which the score equals or exceeds the score threshold; evaluating for each stored pair of frames the probability of that complementary pair of frames existing, on the basis of the number of possible complementary sequence elements existing for each sequence element m the pair of frames; and discarding any stored pairs of frames for which the evaluated probability is greater than the probability threshold.

2. A method according to claim 1, which the first sequence is identical to the second sequence and a frame at a given position in the first sequence is only compared with frames m the second sequence at the same given position or at later positions in the second sequence, m order to eliminate repetition of comparisons.

3. A method according to claim 1 or 2 , which the sequence elements at corresponding positions within each of a pair of frames are compared sequentially, each such pair of sequence elements generating a score which is added to an aggregate score for the pair of frames .

4. A method according to claim 3, m which if the aggregate score reaches the score threshold before all the pairs of sequence elements in the pair of frames have been compared, details of the pair of frames are immeαiate y storeα ana a new pair cf frames is selectee for comparison

5. A method according to any preceding claim, which the seσuence elements are ammo acids and pairs of ammo acids are compared by using an antisense score list.

6. A method according to claim 5, m which the antisense score list is as illustrated in EXAMPLE 2 or 4 herein.

7. A method according to any of claims 1 to 4 , which the sequence elements are triplet codons and pairs of codons m corresponding positions withm each of the pairs of triplet codons are compared by using an antisense score list.

8. A method for processing sequence data substantially as described herein with reference to Figures 1 to 6.

9. A method for controlling a computer by means of a computer program for implementing the method of any of claims 1 to 8.

10. A computer-readable medium carrying a computer program for implementing the method of any of claims 1 to 8.

11. A computer program for implementing the method of any of claims 1 to 8.

12. A pair of frames or a list of pairs of frames being the product of the method of any of claims 1 to 9 , optionally carried on a computer-readable medium.

13. A list of pairs of frames being the product of the method of any of claims 1 to 9, as set out m EXAMPLE 5 or EXAMPLE 6 and Sequence Listing (Seq ID Nos 1-32) herein.

14. A method for protem structure analysis comprising the steps of; reading protem structure data including carbon atom positions ,- selecting a frame size corresponding to a nu oer of ammo aciαs the protein structure, a complementary relationship score threshold, and a predetermined number of ammo acids and/or a nearest neighbour sphere radius, for each frame m the protem structure evaluating the distances oetween the frame and carbon atoms making up other ammo acids to assemble a list of either the predetermined number of ammo acids nearest to the frame or all of the ammo acids withm the nearest neighbour sphere centred on the frame; for each frame comparing the ammo acids the frame with each of the corresponding listed ammo acids to evaluate a complementary relationship score; and storing each frame for which the complementary relationship score equals or exceeds the score threshold.

15. A method according to claim 14, m which the complementary relationship score assesses antisense relationships

16. A method according to claim 14, m which the complementary relationship score assesses hydropathy relationships.

17. A method according to claim 14,15 or 16, m which the relationships are between two discontinuous sequences of ammo acids, one continuous sequence and one discontinuous sequence of ammo acids, or two continuous sequences of ammo acids.

18. A method according to any of claims 14 to 17, in which a maximum distance between the frame and am o acids to be listed can be selected

19. A method for protem structure analysis substantially as described herein with reference to Figures 7 to 9.

20. A method for controlling a computer by means of a computer program for implementing the method of any of claims 14 to 19.

21. A computer- readable medium carrying a program for implementing the method of any of claims 14 to 19.

22. A computer program for implementing the method of any of claims 14 to 19.

23. A pair of frames or list of frames being the product of the method of any of claims 14 to 20, optionally carried on a computer- readable medium.