US20040006559A1

US20040006559A1 - System, apparatus, and method for user tunable and selectable searching of a database using a weigthted quantized feature vector

Info

Publication number: US20040006559A1
Application number: US10/448,168
Authority: US
Inventors: David Gange; Bomi Framroze
Original assignee: ROW2 TECHNOLOGIES Inc
Current assignee: ROW2 TECHNOLOGIES Inc
Priority date: 2002-05-29
Filing date: 2003-05-28
Publication date: 2004-01-08
Also published as: US20060074859A1; EP1631925A1; EP1631925A4; US7251643B2; WO2004107217A1

Abstract

The invention disclosed herein concerns a data processing means for user tunable and selectable searching of a database wherein the data contained therein have associated descriptive properties capable of being expressed in numeric form. A quantized vector representative of the descriptive properties is created for each item in the database. This quantized vector becomes the fingerprint for each data item. The user submits a query item to be matched against the database for similarity. A fingerprint is calculated for the query item. The user may then assign weights to the individual descriptive properties based upon perceived importance. A newly weighted fingerprint for the query item is then compared with the weighted fingerprints for all the data in the database. A list of results sorted in order of decreasing similarity is presented to the user. The user may then change the previously assigned weights and then re-run the similarity search. This may be done as often as necessary to achieve the desired results. The invention describes similarity searching in a generic database. However, this invention is particularly desirable in databases containing chemical compound structure data or biological response screening result data. The process described herein may be run stand alone or as a preliminary screening search in a large database. If used for screening, it can greatly reduce the amount of data required for exactly matching a query item to the data in the database.

Description

CROSS REFERENCE TO RELATED APPLICATIONS

This is a U.S. nonprovisional utility patent application that is also described in and claims the benefit of both U.S. provisional patent application Nos. 60/383,952 filed on May 29, 2002, entitled MACHINE, METHOD AND ARTICLE OF MANUFACTURE FOR A SELECTIVELY SEARCHING A DATABASE OF CHEMICAL COMPOUNDS, and 60/384,305 filed on May 30, 2002, entitled MACHINE, METHOD AND ARTICLE OF MANUFACTURE FOR SEARCHING A DATABASE OF BIOLOGICAL ACTIVITY SCREENING RESULTS, said provisional applications being incorporated by reference in their entirety herein.[0001]

REFERENCE TO AN APPENDIX

Accompanying this patent application is a CD-R, bearing the electronic title “Gange & Framroze,” the contents of which comprise a program listing in ASCII text file format entitled LISTING.TXT, being of size 86 KB and having been created on May 29, 2003. The contents of said CD-R is incorporated by reference herein. The CD-R is hand labeled as follows:

Non-Provisional Patent Application Dr. David M. Gange & Dr. Bomi P. Framroze Filed: May 29, 2003 Docket No.: 51900-ROW2-01-001

Attached to this application and made an integral part hereof is an APPENDIX comprising the identical program listing as that found on said CD-R.

BACKGROUND

1. Field of the Invention

This invention relates to data processing and specifically enabling highly efficient searching of a database wherein the entries can be characterized using a set of one or more descriptive properties that can be expressed in numeric form.

2. Description of the Prior Art

Modern database management systems have been used since the early 1970's. Commercial database systems mostly concentrate on finding exact matches. Searches are performed either to find a specific entry, or to find multiple entries having the same characteristics. Attributes of the data often become fields. An exact search can be made to find a specific person by looking up his name or social security number. A search can be performed to find multiple individuals having the same occupation or place of birth. Alternatively, one may locate all people born before a particular date. Whether a single entry or multiple entries are found, this type of query constitutes an exact search. Exact searches try to exactly or relationally match one or more fields in different data records.

Similarity searching of databases has been around for several years. A similarity search compares two or more entries in their entirety to determine how closely they match one another. Consider the following simple database containing entries of various animals that fly:

a house-fly

a bat

a hummingbird

a dragonfly

a flying fish

a hawk

The question: “Which are most similar?” is not meaningful without additional input. A proper answer requires input of the key dimension. If “feathers” represent the key dimension, then the hawk and the hummingbird are most similar. If “the ability to fly stationary” is the key dimension, then the dragonfly and the hummingbird are most similar. Other possible key dimensions could be metabolism, life span, body temperature, etc. Therefore, the answer to the question: “Which are most similar?” is subjective depending upon the preferences of the person supplying the answer.

For a more complicated residential real estate database, a potential buyer would be looking to buy a home by expressing preferences that become the parameters for a similarity search. Such parameters might include number of bedrooms, type of house, asking price, neighborhood, quality of the local school system, property taxes, age restrictions on residents, home-owners' associations, etc. Currently, a real estate agent would first screen for homes having a specific most desirable characteristic (e.g., neighborhood or number of bedrooms). Then, the agent would look for the next desirable characteristic. The process would be repeated for each parameter, each search yielding a number of homes for consideration by the buyer. Where a particular home appears in the search results multiple times, it is more likely that the agent can make a sale. However, a binary feature vector may be created using these and other parameters, and a similarity search can be performed to match a potential buyer's preferences. This search would generate a list of homes approximating these preferences. A binary vector could indicate whether or not the buyer is interested in a particular feature. The homes can then be compared in their entirety by computing the mathematical distance between their feature vectors. In the rare instance where an exact match is found, the distance between the vectors would be zero. However, if the distance is not zero, the smaller the distance between the feature vectors of an ideal home and an available home, the more similar they are.

This technique has been found to be particularly useful for searching in databases containing chemical structures. Databases of organic chemical compounds can contain millions of records. An atom-by-atom and bond-by-bond search becomes more difficult as the size of the molecule increases. Even were the organic molecules to be pre-classified according to specific features, queries to find exact matches of these features might still yield questionable and non-useful results. Furthermore, in large databases, exact match searching can be extremely time consuming. Similarity searching in a large chemical structure database is a method of screening for compounds which are closely related to one another but may not exactly match. Such a screening query can also be used to shorten the list of compounds to be matched thereby resulting in greatly reducing the overall query time. In fact, several screening searches using different algorithms may be performed that would yield a manageable list of chemical compounds that would then be exactly matched in an atom-by-atom and bond-by-bond search.

A chemical structure similarity search may be performed by creating a chemical fragment dictionary or by using an algorithm that generates chemical structure fragments. A fragment consists of a grouping of atoms attached to one another by specific chemical bonds. All of the compounds in the database are parsed to determine whether or not a particular fragment is present. Associated with each compound is a binary vector. Each element of the vector represents the presence or absence of a specific fragment. This binary vector then serves as an index for that compound in the database. Now a search can be made to find a compound in the database that is similar to a substance that interests the user. The distance between the vector for the new substance and the vectors of compounds in the database can be calculated. The results can then be returned in order of decreasing similarity.

In another application, chemical compounds, natural products, fermentation broths, and other substances are often tested for biological activity, or pharmacological activity. The results of these tests are often stored in electronic databases. Biologists and chemists are often interested in searching a database of biological screening results for substances with an activity profile similar to a given biological activity profile. For example, in the development of an antibiotic a scientist might be interested in substances showing good activity against gram-positive bacteria and one species of gram-negative bacteria. The profile of such a substance would have strong activity values for the several gram-positive and one gram-negative bacteria under consideration and weak activity values for the rest of the gram-negative species tested. In addition, physical properties of the substances, such as LogP, molecular weight, molecular size, pKa, and other physical properties may be considered. One method that can be used to examine biological screening results and property data is similarity searching.

In this type of database, a feature vector or a vector of test results can be formed where binary values would not be used. In this case, it would be desirable to create a vector where a specific element would refer to a particular feature or test, and the vector would contain numeric values other than one or zero. The distance between vectors may be measured, and distances would represent the degree of similarity between entries in the database.

Similarity searching using quantized vectors is prior art. However, prior art searches have been performed according to a fixed searching algorithm. In a chemical compound database, the user might wish to perform a similarity search based upon substructure comparisons, and the data processing system would provide an answer as a sorted list of compounds. When process development chemists search for similarity in chemical compounds, some parts of the molecule are more important to them than other parts. Therefore, when performing a search, they would be interested in establishing a higher search priority to the important substructures and a lower search priority to the less important substructures. Assignment of search priorities is arbitrary and based upon user preference. If priorities of substructure preferences can be dynamically assigned, then should the results of the search not be what the user desires, the user can reassign substructure priorities, thereby refining the search results. The units of assigned priority or weights can be arbitrary, and only their ratio to each other is important.

In the previously mentioned residential real estate database, the similarity search revealed homes having all of the features that interested a potential buyer. Yet, for some potential buyers, certain items are more important than others. For example, for a family with four children, purchase of a house with five bedrooms and the quality of the school system might be more important than asking price and property taxes. Yet these latter features could also serve as influencing factors. In such a case, being able to assign higher priorities to certain features and lower priorities to other features would result in a more meaningful search.

The underlying mathematics for this search is very broadly applicable. It can be used inter alia in biology and medical databases, in physiology databases, in anthropology databases, in photography databases, and in taxonomy databases. It is practical where a characterization vector can be applied to the description of the data.

It is an object of the invention described herein to create a computerized system that will perform similarity searches in an electronic database where the entries have a set of one or more descriptive properties capable of being expressed in numeric form and wherein the user can assign weights or priorities to the descriptive properties so as to influence the similarity searches.

SUMMARY OF THE INVENTION

The invention disclosed herein is a data processing product and method that permits computerized similarity searching of an electronic database using a quantization vector. The quantization vector, a linear array of descriptive properties of the entries in the database, is maintained by the system. Different datatype representations of the quantization vector may be implemented. The system examines the structure of a query item in terms of its known descriptive properties. During examination, the quantization vector is established. This vector represents the query item's “fingerprint.” The system then searches the entire database for identity or similarity to the query item by comparing the vectors. The system further permits the user to set numeric priorities for the descriptive properties in a user friendly environment, said priorities to be used in the search for entries that are similar to the query item. An object of the invention is to provide a simplified searching system for naive and infrequent users. In one of the embodiments presented herein, a computerized user tunable system is disclosed that selectively searches a database of chemical compounds. In another embodiment presented herein, a computerized user tunable system is disclosed that selectively searches a database of biological activity screening test results.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is an overview program flowchart showing a current computerized method of performing similarity searches for a generalized database. The method shown is prior art. [0027]
FIG. 2 is an overview program flowchart showing the computerized method of the invention disclosed herein being used to perform similarity searches for a generalized database. [0028]
FIG. 3 is an overview program flowchart showing the computerized method of the invention disclosed herein being used to perform similarity searches in a database of organic chemical compounds. [0029]
FIG. 4 shows the screen view of the chemical structure of the query compound, Trovafloxacin, as drawn by the user with one of the standard chemical drawing software packages and input into the search program. [0030]
FIG. 5 shows the structure of the query compound having been parsed or fingerprinted according to chemical structure fragments in a fragment dictionary. Only twelve fragments are shown on the screen in the figure. However, a slider on the right edge of the screen may be used to display additional fragments. An adjustable slider with a numeric scale is associated with each fragment shown. [0031]
FIG. 6 is the screen of FIG. 5 after the user adjusted some of the sliders so as to assign weights to their associated fragments. [0032]
FIG. 7 shows the results of the similarity search for the query compound with compounds in the database. The figure shows ten out of fifty compounds returned as part of the search. [0033]
FIG. 8 shows a program flow chart for a specific implementation of the program in FIG. 3 in which the screens shown in FIG. 4 through FIG. 7 are used. [0034]
FIG. 9 is an overview program flowchart showing the computerized method of the invention disclosed herein being used to perform similarity searches in a database of biological responses to various compounds. [0035]
FIG. 10 is a MICROSOFT EXCEL spreadsheet divided into three parts, FIGS. [0036] 10(a), (b), and (c), done so because the entire spreadsheet could not conveniently fit on a single drawing sheet. The data represented in the spreadsheet (the LEWI Data) represents the biological response test data of rats to various tranquilizers.
FIG. 11 represents four screen prints of the program shown in FIG. 9 operating on the data shown in FIG. 10. [0037]

DESCRIPTION OF THE PREFERRED AND ALTERNATE EMBODIMENTS

It is feasible to perform similarity searches in an electronic database of items, wherein said items possess a set of one or more descriptive properties (related to the items) that can be expressed in numeric form. Similarity searching in such a generalized database according to current technology may be performed in a computer using the method shown in FIG. 1. [0038]
1. A user submits a query to the system. The query may be submitted using different formats, but a query item must be able to be classified according to its descriptive properties. The descriptive properties may have inherent numeric values (e.g., test results, characteristic values, prices, ASCII values, checksums, etc.). Alternatively, they may have binary values (‘one’ indicating the presence of a feature and ‘zero’ indicating the absence of the feature). [0039]
2. The query item is parsed according to its descriptive properties. The descriptive properties are analyzed by comparing various elements of the query item in sequence to standardized descriptive properties previously entered electronically into the computer. The characteristics of these descriptive properties may be pre-stored in an electronic dictionary or be generated dynamically by some program algorithm. However these descriptive properties are presented for comparison with the query item, the query item is analyzed for the presence of absence of a particular property, and its numeric value is noted. A quantized vector is formed wherein each element in the vector represents a value for a specific descriptive property. The quantized vector can be thought of as a “fingerprint” for the query item. [0040]
3. The database contains entries of similarly describable items, each such item having been similarly pre-parsed into quantized vectors. The quantized vector (or “fingerprint”) for each entry is stored in the database and associated with its entry. Therefore, a distance may be computed between the vector representing the query item and each vector representing each and every item in the database. The closer the query item vector is to a vector representing an entry in the database, the more similar the query item is to that database entry. [0041]
4. The results are sorted in order of similarity. [0042]
5. The sorted results may be stored for future use at the user's discretion. [0043]
6. The sorted list of database entries is then presented to the user. [0044]
The computation of vector distances in [0045] step 3 above may be calculated, inter alia, as the standard Euclidean Distance, the Tanimoto Coefficient, the Hamming Distance, the Soergel Distance, the Dice Coefficient, or the Cosine Coefficient. Other types of similarity measurement may also be used.
The most familiar method for computing the distance between two vectors, thereby comparing their overall similarity, is to measure the Euclidean distance between them. This is done according to the well known equation: [0046] $\begin{matrix} D_{A, B} = {[\sum_{j = 1}^{j = n} {(x_{jA} - x_{jB})}^{2}]}^{1 / 2} & [1] \end{matrix}$
where: [0047]
D[0048] _A,B=the distance between vectors A and B;
j=the index to a specific vector element; [0049]
n=the number of elements in the vector; [0050]
x[0051] _jA=the value of the jth element in the A vector; and,
x[0052] _jB=the value of the jth element in the B vector.
This is the familiar process of obtaining the difference between each of the elements in the same position in each vector, squaring that difference, and then taking the square root of the sum of the squares. Using this method, the distance between two identical vectors would be zero. The smaller the distance between two vectors, the greater their degree of similarity. The Euclidean Distance can be normalized to the range of 0 to 1 if the values of all attributes are normalized to this range and the results divided by n. [0053]
To illustrate computation of the distance, assume two [0054] binary dimension 5 vectors: A=1 1 0 1 1 and B 0 1 1 1 0. Using Equation [1], the calculation of Euclidean distance from A to B is as follows:

A − B = C C * C Sum of C Distance

1 0 1 1 3 1.73

1 1 0 0

0 1 −1 1

1 1 0 0

1 0 1 1
Another method for comparing similarity is to compute the Tanimoto Coefficient of the two vectors. This is done using the equation: [0055] $\begin{matrix} S_{A, B} = \frac{\sum_{j = 1}^{j = n} x_{jA} x_{jB}}{\sum_{j = 1}^{j = n} {(x_{jA})}^{2} + \sum_{j = 1}^{j = n} {(x_{jB})}^{2} - \sum_{j = 1}^{j = n} x_{jA} x_{jB}} & [2] \end{matrix}$
where: [0056]
S[0057] _A,B=the Tanimoto Coefficient.
The Tanimoto Coefficient is determined by taking the quotient of the sum of the cross product of two vectors divided by the sum of the squares of the elements of the first vector added to the sum of the squares of the elements of the second vector less the cross product of the two vectors. Another name for the Tanimoto Coefficient is the Jaccard Coefficient. [0058]
Other distance computations such as the Hamming Distance, the Soergel Distance, the Dice Coefficient and the Cosine Coefficient are sometimes used to perform similarity searches and are prior art. The Hamming Distance is computed as: [0059] $\begin{matrix} D_{A, B} = \sum_{j = 1}^{j = n} {\langle x_{jA} - x_{jB} \rangle}^{} & [3] \end{matrix}$
The Soergel Distance is computed as: [0060] $\begin{matrix} D_{A, B} = \frac{\sum_{j = 1}^{j = n} {\langle x_{jA} - x_{jB} \rangle}^{}}{\sum_{j = 1}^{j = n} \max (x_{jA} \cdot x_{jB})} & [4] \end{matrix}$
The Dice Coefficient (also known as the Czekanowski Coefficient and the Sørenson Coefficient) is computed as: [0061] $\begin{matrix} S_{A, B} = \frac{2 \sum_{j = 1}^{j = n} x_{jA} x_{jB}}{\sum_{j = 1}^{j = n} {(x_{jA})}^{2} \sum_{j = 1}^{j = n} {(x_{jB})}^{2}} & [5] \end{matrix}$
The Cosine Coefficient is computed as: [0062] $\begin{matrix} S_{A, B} = \frac{\sum_{j = 1}^{j = n} x_{jA} x_{jB}}{{[\sum_{j = 1}^{j = n} {(x_{jA})}^{2} \sum_{j = 1}^{j = n} {(x_{jB})}^{2}]}^{1 / 2}} & [6] \end{matrix}$
The foregoing comparison methodologies represented by Equations [1] through [6] are only a few prior art techniques for similarity measurement between two quantized vectors. Of course, the measure of similarity depends upon the method of measurement. Changing the “fingerprint” changes the similarity. The results are dictated by the algorithm of the system. For the aforementioned prior art similarity measurement methods, there is generally no feedback, no user control over the results, and no possibility of iteratively improving the answer. [0063]
The present invention improves the quality of the results obtained from similarity searching in the type of database discussed above. The results obtained from a search using the methodology disclosed herein should be more meaningful to the user. FIG. 2 is an overview program flowchart showing the computerized method of the invention disclosed herein being used to perform similarity searches for a generalized database. The methodology is as follows: [0064]
1. The user submits a query item to the system. [0065]
2. The query item is parsed according to its descriptive properties using the same method that is used to calculate the descriptive properties of the entries stored in the database. A quantized vector (or “fingerprint”) for the query item is formed. [0066]
3. The user is permitted to assign a weight or priority to each descriptive property of the quantized vector. A quantized weight vector is then formed in this manner. The weight vector has the same dimension (or number of elements) as the quantized vector representing the “fingerprint” of the query item. The assignment of weights can be done by presenting to the user a computer screen showing the query item, the descriptive properties of the query item, and a means to adjust weighting to assign importance values to the descriptive properties. The means to adjust weighting may be adjustable sliders, dials, text boxes, or any other controls that permit the user to interactively assign weights to the descriptive properties. Alternatively, the weight values representing the elements of the quantized weight vector may be obtained from a file created by the user. [0067]
4. The user adjusts the descriptive property weightings to suit his or her individual preferences. [0068]
5. Using the query item properties and weightings, similarity values between the query item and all of the items in the database are calculated using one of the standard similarity algorithms (Euclidean Distance, Tanimoto Coefficient, etc.) [0069]
6. Using the calculated similarity values, the database items are sorted. [0070]
7. The sorted results may be stored for future use at the user's discretion. [0071]
8. The sorted list of database items is presented to the user. [0072]
9. If the user so desires, the process may be repeated until the desired outcome is achieved. [0073]
The units of assigned priority or weights can be arbitrary, and only their ratio to each other is important. In the system represented by reduction to practice of the present invention, the weights are unitless integers between zero and ten. However a logarithmic scale may also be used. In that case, “1” would be the inflection point. Fractional weights (between “0” and “1”) should be in tenths. Fractional weights downscale priorities while weights above “1” upscale priorities. [0074]
In this type of system, using a weight vector, w, the Euclidean distance between the two vectors would be computed as: [0075] $\begin{matrix} D_{A, B} = {[\sum_{j = 1}^{j = n} {w_{j} (x_{jA} - x_{jB})}^{2}]}^{1 / 2} & [7] \end{matrix}$
where: [0076]
D[0077] _A,B=the distance between vectors A and B; and,
w[0078] _j=the weight assigned to vector element j.
To illustrate the new computation of the Euclidean distance as influenced by the assigned weights (w[0079] _j) for the two previous binary dimension 5 vectors: A=1 1 0 1 1 and B=0 1 1 1 0. The calculation of the new Euclidean distance from A to B is as follows:

A − B = C C * C Weight Sum of C Distance

1 0 1 1 3 9 3

1 1 0 0 1

0 1 −1 1 3

1 1 0 0 1

1 0 1 1 3
The new weighted Tanimoto Coefficient derived from Equation [2] would be computed according to Equation [8]: [0080] $\begin{matrix} S_{A, B} = \frac{\sum_{j = 1}^{j = n} w_{j} x_{jA} x_{jB}}{\sum_{j = 1}^{j = n} {w_{j} (x_{jA})}^{2} + \sum_{j = 1}^{j = n} {w_{j} (x_{jB})}^{2} - \sum_{j = 1}^{j = n} w_{j} x_{jA} x_{jB}} & [8] \end{matrix}$
Likewise, the new weighted Hamming Distance derived from Equation [3] would be computed using Equation [9]: [0081] $\begin{matrix} D_{A, B} = \sum_{j = 1}^{j = n} w_{j} {\langle x_{jA} - x_{jB} \rangle}^{} & [9] \end{matrix}$
the new weighted Soergel distance derived from Equation [4] would be computed using Equation [10]: [0082] $\begin{matrix} D_{A, B} = \frac{\sum_{j = 1}^{j = n} w_{j} {\langle x_{jA} - x_{jB} \rangle}^{}}{\sum_{j = 1}^{j = n} w_{j} \max (x_{jA} \cdot x_{jB})} & [10] \end{matrix}$
the new weighted Dice coefficient derived from Equation [5] would be computed using Equation [11]: [0083] $\begin{matrix} S_{A, B} = \frac{2 \sum_{j = 1}^{j = n} w_{j} x_{jA} x_{jB}}{\sum_{j = 1}^{j = n} {w_{j} (x_{jA})}^{2} + \sum_{j = 1}^{j = n} {w_{j} (x_{jB})}^{2}} & [11] \end{matrix}$
and the new weighted Cosine coefficient derived from Equation [6] would be computed using Equation [12]: [0084] $\begin{matrix} S_{A, B} = \frac{\sum_{j = 1}^{j = n} w_{j} x_{jA} x_{jB}}{{[\sum_{j = 1}^{j = n} {w_{j} (x_{jA})}^{2} + \sum_{j = 1}^{j = n} {w_{j} (x_{jB})}^{2}]}^{1 / 2}} & [12] \end{matrix}$
As previously mentioned, one of the preferred uses for this methodology implemented as a computerized system is as a means to selectively search a database of chemical compounds. All chemical compounds can be structurally decomposed into recognizable fragments. Inorganic molecules are composed of atoms, and these atoms are bound to each other in a limited number of ways. The elements making up these molecules span the entire periodic table. However, their structures are simple. On the other hand, organic molecules comprise very few elements usually on the lower end of the periodic table (e.g., carbon, hydrogen, oxygen, nitrogen, etc.), but their structures are complex. Due to structural complexity and the ability of these elements to form large molecules, the number of possible organic molecules is virtually limitless. During product development of organic compounds, it is often important to search for other compounds having a similar molecular structure in an effort to adjust the new structure so as to predict its chemical, biological, and physical properties. Such a search is also necessary to insure that the new product does not infringe on patented products previously developed by others. [0085]
Computerized searching of organic chemical compound databases has been around for decades. Many of these databases store molecular information according to their recognizable fragments. The data processing systems maintain a fragment dictionary, and all compounds input into the database are parsed so as to establish a relationship between fragments in the dictionary. The dictionary is instituted with a limited number of fragments well known to those skilled in the art. Many database searching tools use fragment dictionaries with a large number of entries, and others use fragment dictionaries with a smaller number of entries. A larger number of fragments makes it easier to define a complex molecule, but it increases the search time. A more rapid search engine requires fewer fragments in the dictionary. [0086]
As the number of atoms in a complex organic molecule increase, the search time for identity and similarity in these databases grows exponentially. There is no upper bound as to the time required to secure a match. Therefore, a search should be done in two stages. The first stage is a screening search. This stage eliminates most of the compounds in the database (possibly up to 99%). In order to determine whether one structure is a sub-structure of another, traditionally one performs an atom-by-atom match. Atoms are graphically superimposed upon one another to make sure that all the atoms match and that all the bonds between the atoms match. If one is a subset of the other, then there is a substructure match. However, this is a slow process. In order to minimize the number of times that this process is performed, it is important to first apply a filter in order to perform a screening search. If a substructure match is found, all of the atoms and bonds between the atoms of the smaller structure will be contained within the larger structure. If there is a fragment dictionary, all of the fragments in the molecule to be matched must also be in the target molecule. Other fragments may also be present, but all the fragments in the substructure of both must be present in both molecules. So after performing the search using a binary vector of fragments, most of the molecules are eleminated. Then, an atom-by-atom search is performed on the remainder of the database. [0087]
One possible representation of a complex molecule would be to parse it into a binary fragment vector. Each bit represents the presence of absence of a particular fragment in the dictionary. The vector element order is keyed to the fragment dictionary. Molecular parsing is performed by analyzing the chemical structure atom-by-atom and bond-by-bond that is associated with each atom. A search of the fragment dictionary is performed to find a match. When a match is found, the element for molecular descriptor vector corresponding to the matched fragment is set to 1. The binary vector may be represented logically as a string of bits or bytes or may have any conenient representation. These binary vectors then form a fingerprint for the chemical structure of the molecule. Each bit or fragment in the fingerprint is a dimension representing one row in the vector. Equal weighting is applied to all dimensions. Data processing systems that use this type of fingerprint implement search for similarity of new compounds with known existing compounds [0088]
Searching using a fragment dictionary is commonly used in chemical database technology. Chemical Abstracts (CAS/STN) uses a dictionary of two-thousand fragment keys in the dictionary for a database of approximately ten-million chemical compounds. Most commercial databases use a dictionary of between five-hundred to one-thousand keys for a database of approximately one-million to two-million chemical compounds. The inventors have reduced the current system to practice. Said system uses a dictionary 230 fragment keys for a database of approximately seventy-thousand compounds. The performance of said system is excellent. [0089]
FIG. 3 is an overview program flowchart showing the computerized method of the invention disclosed herein being used to perform similarity searches in a database of organic chemical compounds. In designing a search query system for a chemical compound database, the following steps must be performed: [0090]
1. Draw the query:—A user draws a chemical structure using a chemical structure drawing package such as ChemDraw, ISISDraw, or CASDraw. The resulting chemical structure, the qurey structure, is transferred to the program implementing the search. [0091]
2. Fingerprint the query:—Use the dictionary of chemical structure fragments or an algorithm that generates the fragments to characterize the chemical structure. The seaching program determines which structure fragments, from the fragment dictionary, are present in the query structure. [0092]
3. Allow the user to adjust the fragment weighting:—An electronic form displaying the structure fragments, from the fragment dictionary, which are present in the query structure is displayed to the user. For each structure fragment, there is also present a control that allows the user to define the importance of the fragment. The control on the form could be a slider with a numeric scale, a dial with a numeric scale, a text box allowing numeric value entry, or any graphic or text based system that would permit the user to interactively assign a weight to the importance of a particular structure fragment. Alternatively, the fragment weights may be input from a file. [0093]
4. Run the similarity search:—After the user has assigned the structure fragment weights, the similarity search is performed using a Euclidean distance, the Tanimoto coefficient, or other method of comparing the similarity between two vectors. [0094]
5. Return Results:—The results of the similarity search may be stored for future use. The results are then displayed to the user. In the preferred embodiment, they would be shown as a graphical series of compounds sorted in order of decreasing importance. However, any method of user informative display could be used. [0095]
Using the above method of searching, the search may be biased in a direction defined by the user. The above tunable search process applied to organic chemical compounds is illustrated in FIG. 4 through FIG. 7. FIG. 4 illustrates a computer monitor screen display of the chemical structure of query compound [0096]
Trovafloxacin (C[0097] ₂₀H₁₅F₃N₄O₃) as input through one of the standard chemical drawing packages. FIG. 5 shows the structure of the query compound having been “fingerprinted” using the twelve fragments
These are shown graphically on the lower portion of the screen. Sliders are shown next to each fragment all preset to their default values of 1. FIG. 6 shows the same screen where the user has set the sliders for the [0098]
fragment to 6.5, the [0099]
fragment to 7, and the [0100]
to 7. FIG. 7 shows the results of the similarity search. In the figure the first ten compounds (of fifty) found to be similar to Trovafloxacin are shown arranged in order decreasing similarity. For example, the molecule of compound labeled {fraction (1/50)} is deemed by the search criteria to be most similar. It differs only by substitution of fluorine (F) for the ethyl (CH[0101] ₂) grouping.
FIG. 8 shows a program flow chart for a specific implementation of the program in shown in FIG. 3 in which the screens shown in FIG. 4 through FIG. 7 are used. A printed program listing for this system can be found in the APPENDIX attached hereto. The system comprises a MICROSOFT VISUAL BASIC program and an associated ORACLE database. In addition, the ACCORD CHEMISTRY TOOLKIT available from ACCELRYS is used for certain chemistry related functions (primarily substructure matching). [0102]
The ORACLE database requires at least two tables in this implementation of the method: [0103]
Fragment dictionary table containing [0104]
ID numbers [0105]
Chemical fragment structures in MOLFILE or other chemistry structure format [0106]
Main compound table containing: [0107]
ID numbers [0108]
Chemical structures in MOLFILE or other chemistry structure format [0109]
Chemical structure fingerprints (stored as binary bit string or other numeric format [0110]
The VISUAL BASIC program is comprised of Forms, Modules, and Class Modules. [0111]
Forms: [0112]
1. Search (SearchAgent.frm)—This is the main form used in the application. Query input and function execution are primarily handled from this form. [0113]
2. frmTune (Tune.frm)—The form used for tuning the fragment weights used in the chemical tunable search. [0114]
3. frmLogin (Login.frm)—This is a small form used to take database name, user name, and password input from the user, and then use the information to open the ORACLE database. [0115]
Modules: [0116]
1. AccordSDK (ACCSDK50.BAS)—Module from Accelrys containing chemical structure handling routines. [0117]
2. AccordSDK Constraints (ACCSDK50CNST.BAS)—Definitions of constraints used by the chemical structure toolkit. [0118]
3. AccordSDK Fingerprints (ACCSDK50FP.BAS)—Fingerprint handling routines. [0119]
4. AccordSDKOld (ACCSDK50OLD.BAS)—Older versions of routines included for backward compatibility. [0120]
5. AccordSDKX (ACORDX50.BAS)—ActiveX controls to use on forms in conjunction with the rest of the toolkit routines. [0121]
6. Utilities (Utilities.BAS)—General purpose utility functions. [0122]
Class Modules: [0123]
1. cChemDb (cChemDb.cls)—Class for handling chemistry related functions of the program. [0124]
2. cChemUtils (cChemUtils.cls)—Class containing chemistry utilities. [0125]
3. cError (cError.cls)—Error handling and logging class. [0126]
The detailed program execution follows: [0127]
1. User starts the program. [0128]
2. Search.form_load( ) executes: [0129]
Error handler is set up; [0130]
Accelrys Accord license is checked and a new Accord session is created to allow use of the toolkit functions; [0131]
New database connections are set up and an Accord chemistry object is created; [0132]
The active form is displayed to the user. [0133]
3. User clicks “Open DB Connection”: [0134]
cmdOpenDbConnection_click( ) executes; [0135]
A new login form is created and displayed; [0136]
User enters database connection information, username, and password and then clicks OK. [0137]
4. frmlogin.cmdOK_click( ) executes: [0138]
User supplied information is loaded into variables and form is closed. [0139]
5. Search.fLogin_close(Cancel as Integer) executes: [0140]
ORACLE database is opened using the Open method of the mOraCnn(ORACLE connection) object. [0141]
6. Search.mOraCnn_ConnectionComplete executes: [0142]
Status of the ORACLE connection is returned; [0143]
User is notified that DB is open; [0144]
Database record sets are opened and initialized; [0145]
User clicks OK button on status notice. [0146]
7. User clicks Tune button [0147]
8. Search.cmdTune_click( ) executes: [0148]
Strings containing (fragment) key status information are initialized; [0149]
Tuning form is loaded. [0150]
9. frmTune.Form_load( ) executes: [0151]
Form checks (fragment) key status information and initializes sliders (user weighting controls) if needed. [0152]
10. User double clicks on a structure box—ISIS Draw starts. [0153]
11. User draws or reads a structure into ISIS Draw. [0154]
12. User clicks return box on ISIS Draw and returns structure to program. [0155]
13. frmTune.chmTune_changes( ) event fires: [0156]
The arrays containing (fragment) key information are initialized; [0157]
Any pictures of keys already present on the form are removed; [0158]
If it does not already exist, a chemistry object is created; [0159]
The TunableKeys method of the chemistry object is called cChem.TunableKeys; [0160]
For every key found in the query the appropriate members of the key arrays are set; [0161]
Key arrays are returned to the calling routine; [0162]
For every key that has been set in the key arrays, a picture and a slider are loaded and displayed on the form. [0163]
14. The user adjusts the settings of the sliders to adjust the weightings used in the similarity calculations. [0164]
15. The user clicks the Search button. [0165]
16. frmTune.cmdTunableSearch_click( ) executes: [0166]
The values of the sliders are loaded into the tunable key arrays; [0167]
The structure contained in the Tune form is loaded into the query box of the Search form [0168]
17. Search.cmdTunableProductAnalogySearch_click( ) is called by frmTune: [0169]
If it does not already exist, a chemistry object is created; [0170]
When the object is created, the database connection is established and the record sets are opened; [0171]
The TunableProductAnalogySearch method of the chemistry object is called using the tunable key arrays as input; [0172]
cChem.TunableProductAnalogySearch initialization routines are fired; [0173]
Query structures are searched for certain heterocycles—If the heterocycles are present, copies of the query are made and edited to generate related molecules whose syntheses are related to the initial query—Similarity search will be performed on the query and related synthetically equivalent structures; [0174]
Calculate the similarity values between the query compound(s) and the molecules in the database, sort, and store the top 50 results; [0175]
SearchDone event is raised; [0176]
Search.cChem_SearchDone executes; [0177]
Search complete message is displayed to the user; [0178]
Answers are extracted from the database and displayed on the form. [0179]
18. User clicks “Done” button on Tune form: [0180]
Tune form unloads. [0181]
19. User browses answers and runs another search at his or her discretion. [0182]
The aforementioned data processing system was also implemented in the C++ and JAVA programming languages in addition to the MICROSOFT VISUAL BASIC implementation shown in the APPENDIX. As described above, several prior art software packages were used in the implementation of the system shown in the APPENDIX. The ORACLE database is well known to those with ordinary skill in the art. It was used only for the implementation discussed herein, and any comparable database management system may be substituted therefor. Similarly, ACCORD allows a user to search through documents and files for chemical structures and reactions. ACCORD CHEMEXPLORER recognizes a wide range of formats—ISIS/DRAW, SKETCHFILES or CHEMDRAW files, MOLFILES, RXNFILES, SD and RD files, MICROSOFT WORD documents, EXCEL spreadsheets or the like. It looks, works and feels like the WINDOWS Finder. The ACCORD utilities are well known. They were used only for the implementation demonstrated herein, and the data processing routines contained in the ACCORD utilities are prior art. Similarly functioning routines may be easily substituted therefor. Finally, ISIS/DRAW was used in this implementation as a means to input a chemical structure into the program. This program is available from MDL®. It is one of many programs of this type. The data processing routines contained in ISIS/DRAW are prior art, and an equivalent utility may be substituted therefor. [0183]
The above mentioned method may also be used to search for biological data. In this case, the values of the elements in the vector might not be binary. A biological response is a continuous variable. For example, the binding strength of a drug to a particular receptor would have a specific numeric value, and it would be important to express that value in the vector. These measurements are important for drug competition experiments where relative binding strengths are relevant. They are also important for antibody and monoclonal antibody research that involve binding to specific epitope sites. However, the priority or weight that a user would apply to a characteristic such as binding strength for a particular receptor when performing a similarity search is independent of the actual data. Compounds can be described based upon biological response. Plotting the biological response over a series of tests produces a graph possessing a characteristic shape. A database biological compounds may be probed for those having characteristic shapes that are similar. Often compounds having a similar profile would have similar modes of action. In this case, a weighted search would provide a significant advantage. The inventors have reduced this technique to practice with excellent results by performing such a search with a highly descriptive biological compound model based upon biological response. [0184]
FIG. 9 is an overview program flowchart showing the computerized method of the invention disclosed herein being used to perform similarity searches in a database of biological responses to various compounds. The implementation of the method described herein is in the form of a MICROSOFT EXCEL Spreadsheet with macros performing all of the necessary functions. A source code listing for this implementation appears in a section entitled, “COMPUTER PROGRAM LISTING—TUNABLE BIOLOGICAL SEARCH,” at the end of this application. The data shown in the spreadsheet of FIG. 10 has been separated into three parts, viz., FIGS. [0185] 10(a), (b), and (c). The source of the data (hereinafter the LEWI Data) is the paper: Janssen, Paul A. J.; Niemegeers, Carlos J. E.; and Schellekens, Karel H. L.; “Is it Possible to Predict the Clinical Effects of Neuroleptic Drugs (Major Tranquillizers) from Animal Data?—Part I: ‘Neuroleptic activity spectra’ for rats”; from the Janssen Pharmaceutic n.v., Research Laboratoria, Beerse (Belgium), Drug Research, Vol 15, Heft 2, 1965, pp 104-117. A copy of this paper is provided with this application, and is incorporated by reference as non-essential material in its entirety herein.
The following features are needed: [0186]
1. A row of target biological activity scores to use as a target in the similarity search. In the LEWI Data, there are twelve measured responses. [0187]
2. A row of weighting values to apply to the target biological responses. Weightings are input by the user to indicate the relative importance that the user places on the importance of the associated biological test. [0188]
3. A collection of data related to individual compounds and their associated biological responses. For the purposes of this implementation, the data are contained within the same spreadsheet as the target input scores and the target weightings. The LEWI Data set contains data on 40 compounds. [0189]
The program works as follows: [0190]
1. After the user has entered the biological response values, and the associated biological response weightings, the user initiates the calculation by pressing the button for Euclidean Distance or Tanimoto Coefficient. [0191]
2. Using the user-supplied biological activity target values, and user-supplied target weightings, the similarity values are calculated for each compound in the data set. [0192]
3. The calculated similarity values for each compound in the data set is stored. [0193]
4. After the similarity values for all the compounds in the data set have been calculated, the data is then sorted in order of decreasing similarity. [0194]
For the convenience of the user, other features have been added: [0195]
To simplify the entry of target biological activity values, a control box has been set up to allow the user to select a compound from the data set to use as a starting point data entry. Biological activity values from a selected data set compound are loaded. Then the user can modify the values to suit his or her needs. [0196]
When scrolling through the sorted output data, a graph showing the relationship between the input target and the data set compound data currently selected can be shown. [0197]
The routines are as follows: [0198]
1. cboCompoundNames_change( )—A combo box, cboCompoundNames, is loaded with the names of the compounds in the data set contained within the EXCEL worksheet. When the user selects a compound name from the combo box, the biological data associated with the compound is loaded into the Target area at the top of the spreadsheet. This is purely a convenience for the user, not a critical feature. [0199]
2. cmdEuclidLC50_click( )—This routine calculates the Euclidean Distance between the user-supplied target biological data values, and the biological data values for the compounds in the data set using the appropriate user-supplied weights. Biological data values are sorted according to the calculated Euclidean Distances. [0200]
3. cmdEuclidSpec_click( )—Not Used![0201]
4. cmdTanimotoLC50_click( )—This routine calculates the Tanimoto Coefficient between the user-supplied target biological data values, and the biological data values for the compounds in the data set, using the appropriate user-supplied weights. Biological data values are sorted according to the calculated Tanimoto coefficients. [0202]
5. cmdTanimotoSpec_click( )—Not Used![0203]
6. Worksheet_activate( )—Loads the combo box with the compound names from the data set. Reset all weightings to 1. This routine fires when the user opens the spreadsheet. [0204]
7. Worksheet_SelectionChange(by Val Target as Range)—This routine checks to see if the selection is within the dataset range. If it is within the range, then a chart showing the biological responses from the Target and the selected compound are shown. This routine uses the makeChart routine to create the charts. [0205]
8. makeChart(by Val as long)—This routine creates a chart using the Target biological responses and the biological data from a compound in the data set. This routine is called by Worksheet_SelectionChange. [0206]
FIG. 11 represents four screen prints of the program shown in FIG. 9 operating on the data shown in FIG. 10. The figure is divided into four parts, viz., FIGS. [0207] 11(a), (b), (c), and (d). The data for all versions of FIG. 11 are those shown in FIG. 10(a).
In FIG. 11([0208] a), the cursor is positioned on the Target compound, Aceperone (R3248) butyr (Row 17—Column A). The chart shows the Tanimoto fingerprint for twelve test results on rats on a logarithmic scale. Note the “Euclid LC50” and “Tanimoto LC50” radio buttons. Since the target compound is only being compared with itself, only one fingerprint is shown.
In FIG. 11([0209] b), the cursor is positioned on Promazine phen (Row 18—Column A). Here, the chart compares two fingerprints. The darker graph is the fingerprint of Aceperone (R3248) butyr while the lighter graph is the fingerprint of Promazine phen. Note how closely the fingerprints of these adjacently sorted compounds resemble each other.
In FIG. 11([0210] c), the cursor is positioned on Levomepromazine phen (Row 25—Column A). Once again there are two fingerprints being compared where the darker graph is the fingerprint of Aceperone (R3248) butyr and the lighter graph is the fingerprint of Levomepromazine phen. Note here that the two graphs are far less similar than those of FIG. 11(b).
In FIG. 11([0211] d), the cursor is positioned on Trabuton (R1516) butyr (Row 29—Column A). The darker graph is the fingerprint of Aceperone (R3248) butyr and the lighter graph is the fingerprint of Trabuton (R1516) butyr. Here, the two fingerprint graphs are far less similar than those of FIGS. 11(b) and (c).
The systems, methods, and programs disclosed herein may be implemented in hardware or software, or a combination of both. Preferably, the techniques are implemented in computer programs executing on programmable computers that each comprise a processor, a storage medium readable by said processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Program code is applied to data entered using the input device to perform the functions described and to generate output information. The output information is routed to one or more output devices. [0212]
Each such computer program is preferably stored on a storage medium or device (e.g., CD-ROM, hard disk, magnetic tape, or magnetic diskette) that is readable by a general or special purpose programmable computer. Said computer program configures and operates the computer when the storage medium or device is read by the computer to perform the procedures described in this application. The system may also be considered to be implemented as a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner. The present invention may be embodied in computer-readable media, such as floppy disks, ZIP or JAZ disks, conventional hard disks, optical media, CD-ROMS, Flash ROMS, nonvolatile ROM, RAM and any other equivalent computer memory device. It will be appreciated that the system, method of operation and product may vary as to the details of its configuration and operation without departing from the basic concepts disclosed herein. [0213]
Some or all of the functionality may be implemented on an analog computer or device or on a hybrid digital/analog computer. User tuning (i.e., the process whereby weights are assigned to the specific descriptive properties) is an area of the computerized process most applicable to analog processing. The analog processing devices used may be inter alia electrical, mechanical, optical, hydraulic, or any other means for analog processing. Analog-to-digital or digital-to-analog conversion may take place at any step of the process. [0214]

Based upon the disclosure of the systems, processes, methods, and computer programs herein, as well as the foregoing discussion of apparatus considerations, it is apparent that one skilled in the art would be able to implement the present invention on any of the apparatuses or devices mentioned above without undue experimentation.



1	COMPUTER PROGRAM LISTING -
	TUNABLE BIOLOGICAL SEARCH

2	Written to be executed within a Microsoft Excel Spreadsheet,
	using MS
	Excel Visual Basic.
3	Private Sub cboCompoundNames_Change()
4	′Copy the data into the appropriate boxes
5	Dim i As Long, rowNumber As Long, rng As Range, name As String
6	If cboCompoundNames.ListIndex = −1 Then
7	′Do Nothing, empty box
8	Else
9	′Get LC50 data
10	name = cboCompoundNames.Text
11	Set rng = Range(“A17:A56”).Find(name)
12	rowNumber = rng.row
13	′Set the title for the rows in header area
14	Me.Cells(3, 1) = CStr(Me.Cells(rowNumber, 1)) & “ LC50”
15	Me.Cells)5, 1) = “Weighting (0 - 9)”
16	For i = 2 To 13
17	′copy LC50 values into row 3
18	Me.Cells(3, i) = Me.Cells(rowNumber, i)
19	′copy standard weights into row 5
20	Me.Cells(5, i) = 1#
21	Next
22	End If
23	End Sub
24	Private Sub cmdEuclidLC50_Click()
25	Dim valueRow As Long, rowCount As Long, columnCoumt As Long
26	Dim SumOfSquares As Double, targetCellvalue As Double,
	testCellValue As Doube
27	Dim Difference As Double, EuclideanDistance As Double,
	weight(1 To 20) As Double
28	Dim i As Long, j As Long, weightRow As Long, rng As Range
29	valueRow = 3
30	weightRow = 5
31	columnCount = 14
32	′Data in rows 17 to 56
33	For i = 17 To 56
34	SumOfSquares = 0
35	′Include first column in data
36	For j = 2 To 13
37	weight(j) = Me.Cells(weightRow, j)
38	targetCellvalue = Me.Cells(valueRow, j)
39	testCellValue = Me.Cells(i, j)
40	Difference = weight(j) * (targetCellValue − testCellValue)
41	SumofSquares = SumofSquares + (Difference * Difference)
42	Next j
43	′Take the square root
44	EuclideanDistance = Sqr(SumOfSquares)
45	Me.Cells(i, columnCount + 1) = EuclideanDistance
46	Next i
47	′Now sort the results
48	Set rng = Range(“A16:P56”)
49	rng.Select
50	rng.sort Key1:=Ramge(“016”), Order1:=xlAscending,
	Header:=xlYes, _
51	MatchCase:=False, OrderCustom:=1, Orientation:=xlRows
52	Set rng = Range(“A17”)
53	rng.Select
54	End Sub
55
56	Private Sub cmdTanimotoLC50_Click()
57	Dim valueRow As Long, rowCount As Long, columnCount As Long
58	Dim SumASquared As Double, SumBSquared As Double,
	SumAtimesB As Double
59	Dim i As Long, j As Long, tanimoto As Double, weight(1 To 20)
	As Double
60	Dim weightRow As Long, rng As Range
1	valueRow = 3
2	weightRow = 5
3	columnCount = 14
4	′Calculate the Tanimoto Coefficient
5	′Use all columns when operating on the untransformed data.
6	SumASquared = 0
7	For j = 2 To 13
8	weight(j) = Me.Cells(weightRow, j)
9	SumASquared = SumASquared + weight(j) *
	(Me.Cells(valueRow, j) * Me.Cells(valueRow,
10	j))
11	Next j
12	For i = 17 To 56
13	SumBSquared = 0
14	SumAtimesB = 0
15	For j = 2 To 13
16	weight(j) = Me.Cells(weightRow, j)
17	SumBSquared = SumBSquared + weight(j) *
	(Me.Cells(i, j) * Me.Cells(i, j))
18	SumAtimesB = SumAtimesB + weight(j) *
	(Me.Cells(valueRow, j) * Me.Cells(i, j))
19	Next j
20	tanimoto = SumAtimesB / (SumASquared + SumBSquared −
	SumAtimesB)
21	Me.Cells(i, columnCount + 2) = tanimoto
22	Next i
23	′Now sort the results
24	Set rng = Range(“A16:P56”)
25	rng.Select
26	rng.sort Key1:=Range(“P16”), Order1:=xlDescending,
	Header:=xlYes, _
27	MatchCase:=False, OrderCustom:=1, Orientation:=xlRows
28	Set rng = Range(“A17”)
29	rng.Select
30	End Sub
31
32	Private Sub Worksheet_Activate()
33	Dim i As Long
34	cboCompoundNames.Clear
35	′Fill the combo box with the compound names
36	If cboCompoundNames.ListCount = 0 Then
37	For i = 3 To 42
38	cboCompoundNames.AddItem Sheet1.Cells(i, 1).value
39	Next i
40	End If
41	′Fill the weighting cells with the standard value (1.00)
42	For i = 2 To 14
43	Me.Cells(4, i) = 1#
44	Next
45	End Sub
46
47	Private Sub Worksheet_SelectionChange(ByVal Target As Range)
48	′Look to see if we are in the LC50 rows or spectral rows and then put
	up a chart
49	Dim newrow As Long
50	newrow = Target.row
51	If newrow <> oldrow Then
52	′make a chart
53	makeChart (newrow)
54	End If
55	oldrow = newrow
56	End Sub
57	Private Sub makeChart(ByVal row As Long)
58	Dim co As ChartObject, cw As Long, rh As Long
59	Dim rng As Range, oCell As Range, selection As String
1	Dim MinimumValue As Double
2	′Get rid of old charts
3	If ActiveSheet.ChartObjects.Count > 0 Then
4	Do
5	ActiveSheet.ChartObjects.Delete
6	Loop Until ActiveSheet.ChartObjects.Count = 0
7	EndIf
8	If row < 57 And row > 16 Then
9	′Charts for LC50 similarities
10	selection = “A3:M3, ” & “A” & row & “:M” & row
11	′Rows(“2:2”).Select
12	′Rows(CStr(row) & “:” & CStr(row)).Select
13	′Create column width and row height units
14	cw = Columns(2).Width ′In points
15	rh = Rows(1).Height
16	′Place chart with respect to upper left corner of A1
17	′ ( Left, Top, Width, Height )
18	Set co = ActiveSheet.ChartObjects.Add(cw * 7.5, rh *
	5.5, cw * 7, rh * 18)
19	co.name = “Test Chart”
20	′Set the chart type
21	′co.Chart.ChartType = xlXYScatterSmooth
22	′co.Chart.ChartType = xlLine
23	co.Chart.ChartType = xlLineMarkers
24	co.Chart.HasLegend = False
25	′Attach the data to the chart
26	′ Source:=ActiveSheet.Range(“B1;I1”, selection),
27	co.Chart.SeriesCollection.Add _
28	Source:=ActiveSheet.Range (selection),
29	rowcol:=xlRows
30	co.Chart.HasTitle = False
31	′These are the standard default values
32	With co.Chart
33	.HasAxis(xlCategory, xlPrimary) = True
34	.HasAxis(xlCategory, xlSecondary) = False
35	.HasAxis(xlValue, xlPrimary) = True
36	.HasAxis(xlValue, xlSecondary) = False
37	End With
38	′Get the names from the first row, category names missing from
	scatter plot
39	co.Chart.Axes(xlCategory).CategoryNames = _
40	ActiveSheet.Range (“bl:ml”)
41	′co.Chart.Axes(xlvalue).CrossAt = xlAxisCrossesMinimum
	′This doesn't work here
42	′MinimumValue = co.Chart.Axes(xlValue,
	xlPrimary).MinimumScale
43	co.Chart.Axes(xlValue, xlPrimary).MinimumScale = −4#
44	co.Chart.Axes(xlValue, xlPrimary).MaximumScale = 2#
45	co.Chart.Axes(xlValue).CrossesAt = −4#
46	co.Chart.Axes(xlValue).HasTitle = True
47	′co.Chart.Axes(xlValue).AxisTitle.Orientation = xlHorizontal
48	′co.Chart.Axes(xlValue).AxisTitle.Orientation = xlVertical
49	co.Chart.Axes(xlValue).AxisTitle.Orientation = xlUpward
50	′co.Chart.Axes(xlValue).AxisTitle.Orientation = xlDownward
51	co.Chart.Axes(xlValue).AxisTitle.Text = “Log(1/C)”
52	add a data table to the bottom
53	′datatable.doesn't appear in scatter plots,does appear in line
	graphs
54	′co.Chart.HasDataTable = True
55	′Doesn't affect a line graph, does affect scatter plot
56	co.Chart.SeriesCollection(1).MarkerSize = 5
57	co.Chart.SeriesCollection(1).MarkerStyle =
	xlMarkerStyleDiamond
58	End If
59	End Sub

[0216]

Claims

We claim:

1. A method for searching an electronic database of data,

wherein said data has associated with them a set of one or more calculated descriptive properties related to said data; and

wherein said descriptive properties are capable of being expressed in numeric form;

said method comprising:

a) accepting a query datum submitted electronically by a user;

b) electronically calculating a set of one or more descriptive properties of said query datum wherein the descriptive properties of said query datum are capable of being expressed in numeric form, are of the same number and arrangement, and are calculated in the same manner as said descriptive properties of the data in said database;

c) allowing the user to electronically examine the calculated descriptive properties of said query datum;

d) electronically setting a weight for every descriptive property to unity, said weight being an importance value for that particular descriptive property;

e) allowing the user to change said weights for any or all of the descriptive properties to other numeric values, said other numeric values being set at the user's discretion;

f) electronically calculating a similarity value to the query datum for all data in the database according to a method comprising:

factoring in the user assigned weights of the descriptive properties to both the query datum and each datum in the database thereby forming a weighted query datum and a weighted database datum;

computing a quantized vector distance, or equivalent indicator or coefficient, between said weighted query dataum and said weighted database datum; and,

assigning said quantized vector distance, or equivalent indicator or coefficient, to the similarity value;

g) presenting a list of data from said database to the user wherein said data is sorted in order of their similarity values;

h) repeating steps f) and g) of this method at the user's discretion as many times as the user desires.

2. The method according to claim 1 wherein a user can assign weights to said descriptive properties by manipulating objects on a computer screen.

3. The method according to claim 2 wherein said objects are sliders with numeric scales.

4. The method according to claim 2 wherein said objects are dials with numeric scales.

5. The method according to claim 2 wherein said objects are text boxes allowing numeric entry.

6. The method according to claim 1 wherein said similarity value is calculated using a weighted Euclidean Distance between said quantized vectors of the descriptive properties.

7. The method according to claim 1 wherein said similarity value is calculated using a weighted Hamming Distance between said quantized vectors of the descriptive properties.

8. The method according to claim 1 wherein said similarity value is calculated using a weighted Soergel Distance between said quantized vectors of the descriptive properties.

9. The method according to claim 1 wherein said similarity value is calculated using a weighted Tanimoto Coefficient between said quantized vectors of the descriptive properties.

10. The method according to claim 1 wherein said similarity value is calculated using a weighted Dice Coefficient between said quantized vectors of the descriptive properties.

11. The method according to claim 1 wherein said similarity value is calculated using a weighted Cosine Coefficient between said quantized vectors of the descriptive properties.

12. The method according to claim 1 wherein said data in said database is representative of structures of chemical compounds and wherein said query datum is also representative of the structure of a chemical compound.

13. The method according to claim 12 wherein said qurey datum is generated by a chemical structure drawing package.

14. The method according to claim 12 wherein said descriptive properties of said database data and query datum are characterized by assigned structure fragments.

15. The method according to claim 14 wherein the numeric values of said descriptive properties are set to either one or zero, a one representing the presence of the structure fragment associated with a particular descriptive property, and a zero representing the absence of said structure fragment.

16. The method according to claim 14 wherein the structure fragments are contained within and referenced in an electronic dictionary.

17. The method according to claim 14 wherein the structure fragments are generated by an algorithm.

18. The method according to claim 1 wherein said data in said database is representative of biological activity screening results and wherein said query datum is also representative of biological activity screening results.

19. The method according to claim 18 further comprising:

a) a user entering biological response values for known screening results;

b) a user entering the biological response values for a target query item;

c) a user entering weights for each of the biological response values;

d) a user selectively designating a method to be used to calculate similarity;

20. The method according to claim 1 further comprising storing the sorted calculated similarity results for further use.