WO2003019183A1

WO2003019183A1 - Process for the informative and iterative design of a gene-family screening library

Info

Publication number: WO2003019183A1
Application number: PCT/US2002/026842
Authority: WO
Inventors: Peter D. J. Grootenhuis; Michelle L. Lamb; Erin K. Bradley; Peter L. Myers; William A. Shirley; Daniel Rogers; Angelo J. Castellino; Jennifer L. Miller
Original assignee: Deltagen Research Laboratories, L.L.C.
Priority date: 2001-08-23
Filing date: 2002-08-22
Publication date: 2003-03-06
Also published as: WO2003019183A9

Abstract

A method for designing a gene-family screening library includes defining a gene-family source set including source molecules and/or target structures. The source molecules are selected based on activity towards a pre-determined gene-family, while the target structures include target structures of a predetermined gene-family. Members of a class of the structurally-abstract molecule descriptor are generated. Then an active molecule descriptor space that includes these members and is present in a predetermined number of source molecules in the gene-family source set is analyzed for correlation with target structures in order to identify a group of candidate molecules. Library molecules are then selected creating a gene-family screening library with molecules likely to exhibit activity against any designed targets.

Description

PROCESS FOR THE INFORMATIVE AND ITERATIVE DESIGN OF A GENE-FAMILY SCREENING LIBRARY

CROSS-REFERENCE TO RELATED APPLICATION [01] The instant patent application is a nonprovisional application of U.S. provisional patent application 60/314,616, filed August 23, 2001 and incorporated by reference for all purposes herein.

BACKGROUND OF THE INVENTION

1. Field of the Invention [02] The present invention, in general, relates to methods for the design of molecule libraries and, in particular, to methods for the design of gene-family screening libraries.

2. Description of the Related Art

[03] A variety of conventional techniques are known for selecting libraries of molecules for subsequent biological activity screening. These conventional techniques select molecules using a broad range of criteria, some being driven solely by chemical accessibility and reagent availability for combinatorial synthesis, others by molecular "diversity" (sampling of broad chemistry space) with no consideration of medicinal chemistry knowledge, and still others based on a similarity to molecule substructures (e.g., privileged substructures with an affinity for a receptor or enzyme), fragments or ligands that are known to be associated with a desired medicinal property. Such conventional methods for designing molecule libraries are described in, for example Spellmeyer et al., "Recent Developments in Molecular Diversity: Computational Approaches to Combinatorial Chemistry", Annual Reports in Medicinal Chemistry Review, 34, 287-296 (1999), Meyers et al., "Rapid, Reliable Drug Discovery", Today's Chemist at Work, 6(7), 46-53 (1997); Mason et al, "New 4-Point Pharmacophore Method for Molecular Similarity and Diversity Applications: Overview of the Method and Applications, Including a Novel Approach to the Design of Combinatorial Libraries Containing Privileged Substructures", Journal of Medicinal Chemistry, Volume 42, Number 17, 3251-3264 (1999); and Ajay, G. W. Bemis and M. A. Murcko, Designing Libraries with CNS Activity, Journal of Medicinal Chemistry, Volume 42, Number 24, 4942-4951 (1999), each of which is hereby fully incorporated by reference.

[04] A drawback of conventional molecule library design techniques, however, is that the broad chemistry spaces used therein are often not unique to a specific target or family of targets and, therefore, the molecules selected for inclusion in a molecule library are not specific to any target or family of targets. Alternatively, libraries may be designed to be focused on a single target, to the exclusion of related targets within a particular gene family. In addition, the use of structurally definite substructures and/or fragments limits the structural breadth, and thus the novelty, of the molecules selected for inclusion in the molecule library. [05] Still needed in the art is a method for selecting molecules for a library that are unique to a specific target(s), such as all target substances within a given gene-family. In addition, the method should not be limited to the selection of molecules that include structurally definite substructures or fragments.

SUMMARY OF THE INVENTION [06] Embodiments in accordance with the presen invention provide a method for selecting molecules for inclusion in a gene-family screening library that are unique to multiple target substances (e.g., enzymes and/or receptors) of a specific gene-family. In addition, the process may utilizes structurally-abstract molecule descriptors, rather than structurally definite substructures or fragments, and is, therefore, expected to provide for the design and/or selection of library molecules with increased structural novelty.

[07] An underlying concept of gene-family libraries is that target substances (e.g., enzymes and receptors) belonging to a predetermined gene family are likely to have similarities in amino acid sequence and, therefore, structure and mechanism of action. Because of this, it is expected that molecules which interact with (i.e., have activity towards) target substances of a predetermined gene family may have certain characteristics (e.g., structurally-abstract molecule descriptors) in common. A library of molecules that share such characteristics would then be expected to exhibit activity toward a multitude of target substances of the predetermined gene family. [08] An embodiment of a method in accordance with the present invention includes defining a gene-family source set that includes a plurality of source molecules and/or a plurality of target structures (e.g., enzyme and/or receptor structures). The plurality of source molecules that can be included in the gene-family source set are selected using selection criteria that includes activity towards a predetermined gene-family. The target structures that can be included in the gene-family source set are target structures of a predetermined gene- family.

[09] Next, at least one class of structurally-abstract molecule descriptor (e.g., a class of pharmacophore descriptors, a class of shape-feature, or a class of subshape-feature descriptors) is chosen. In accordance with one embodiment of the present invention, all members of at least one class of structurally-abstract molecule descriptor are then generated. An active molecule descriptor space is then established. Such an active molecule descriptor space includes each of the members of the at least one class of structurally-abstract molecule descriptor that is present in a predetermined number of the plurality of source molecules of the gene-family source set and/or correlated with a predetermined number of the plurality of target structures.

[10] In accordance with an alternative embodiment of the present invention, the active molecule descriptor space may be established by first identifying a predetermined number of the plurality of source molecules of the gene-family source set and/or a predetermined number of the plurality of target structures. Members of the chosen class of the structurally- abstract molecule descriptor which correlate to the predetermined number of gene family source molecules or target structures are then identified.

[11] Once the active molecule descriptor space is established, a group of candidate molecules is subsequently identified. Library molecules for inclusion in a gene-family screening library are then selected from the group of candidate molecules using a specific library design technique called informative library design, that optimizes coverage of the active molecule descriptor space. As a result of this process, a gene-family screening library that includes library molecules that are very likely to exhibit activity against any of multiple gene-family target substances is designed. [12] An embodiment of a method for selecting molecules for inclusion in a gene-family screening library comprises defining a gene-family source set that includes a plurality of source molecules selected using selection criteria that includes the criterion of activity towards a predetermined gene-family. At least one class of structurally-abstract molecule descriptor is chosen, and members of the at least one class of structurally-abstract molecule descriptor are generated. An active molecule descriptor space is established, the active molecule descriptor space including each of the members of the at least one class of structurally-abstract molecule descriptor that is present in a predetermined number of the plurality of source molecules of the gene-family source set. A group of candidate molecules is identified, and library molecules are selected for inclusion in a gene-family screening library from the group of candidate molecules, thereby designing a gene-family screening library.

[13] An alternative embodiment of a method for selecting molecules for inclusion in a gene-family screening library comprises defining a gene-family source set that includes a plurality of source molecules, the plurality of source molecules selected from a drug-like molecule database using selection criteria that includes the criterion of in vivo activity towards a predetermined gene-family. At least one class of structurally-abstract molecule descriptor is chosen. Members of the at least one class of structurally-abstract molecule descriptor are generated. An active molecule descriptor space is established utilizing a technique wherein the presence/absence of each of the members of the at least one class of structurally-abstract molecule descriptor in the source molecules is encoded in a matrix of source molecule bit strings, the active molecule descriptor space including each of the members of the at least one class of structurally-abstract molecule descriptor that is present in a predetermined number of the plurality of source molecules of the gene-family source set. A group of candidate molecules is identified. Library molecules are selected for inclusion in a gene-family screening library from the group of candidate molecules using an informative library design technique. This technique may include encoding in an active space bit string members of the at least one class of structurally-abstract molecule descriptor that are included in the active molecule descriptor space; encoding in a candidate bit string the presence/absence of each of the members of the at least one class of structurally-abstract molecule descriptor in each of the group of candidate library molecules, and then ascertaining the overlap of the candidate bit string and the active space bit string. [14] Another embodiment of a method for selecting molecules for inclusion in a gene- family screening library comprises defining a gene-family source set that includes a plurality of target structures of a predetermined gene family, choosing at least one class of structurally- abstract molecule descriptor, and generating members of the at least one class of structurally- abstract molecule descriptor. An active molecule descriptor space is established, the active molecule descriptor space including each of the members of the at least one class of structurally-abstract molecule descriptor that is correlated with a predetermined number of the plurality of target structures of the gene-family source set. A group of candidate molecules is identified; and library molecules are selected for inclusion in a gene-family screening library from the group of candidate molecules, thereby designing a gene-family screening library. [15] An embodiment of a method for examining suitability of a candidate molecule as a drug lead, comprises defining a gene-family source set that includes a plurality of molecules selected using selection criteria relating to a predetermined gene-family, choosing at least one class of structurally-abstract molecule descriptor, and generating members of the at least one class of structurally-abstract molecule descriptor. An active molecule descriptor space is established, the active molecule descriptor space including members of the at least one class of structurally-abstract molecule descriptor that correlate with a predetermined number of the plurality of molecules of the gene-family source set. A group of the candidate molecules is identified, and library molecules are selected for inclusion in a gene- family screening library from the group of the candidate molecules, thereby designing a gene-family screening library. [16] A better understanding of the features and advantages of the present invention will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the invention are utilized, and the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[17] FIG. 1 is a flow chart illustrating steps in a process according to one exemplary embodiment of the present invention;

[18] FIG. 2 illustrates a portion of a matrix of encoded source molecule bit strings wherein each row is an encoded bit string associated with a source molecule of the gene-family source set and each column is assigned to a member of a class of structurally-abstract molecule descriptors, wherein the light circles represent a "1" bit (when the structurally-abstract descriptor is present in the source molecule) and the darker circles represent a "0" bit (when the structurally-abstract descriptor is absent in the. source molecule);

[19] FIG. 3 is a diagram depicting the "space" defined by all members of a class of structurally-abstract molecule descriptors (depicted as a rectangle) and an active molecule descriptor space established therefrom (depicted as a circle disposed within the rectangle);

[20] FIG. 4 is a bar chart illustrating the results of a step of defining a gene-family source set in a process according to one exemplary embodiment of the present invention;

[21] FIG. 5 A is a simplified diagram of a computing device for processing information according to an embodiment of the present invention; and

[22] FIG. 5B is an illustration of basic subsystems in the computer system of FIG. 5 A.

[23] FIG. 6 is a simplified flowchart showing the steps for applying a gene family screening library created in accordance with the present invention to identify drug candidates. [24] FIG. 7 shows a simplified schematic diagram contrasting the selection of two screening libraries using Shannon entropy.

DESCRIPTION OF EXEMPLARY EMBODIMENTS OF THE INVENTION [25] To be consistent throughout the present specification and for clear understanding of the present invention, the following definitions are used:

[26] The term "gene-family" refers to target substances (e.g., receptors or enzymes) for which a family of genes (i.e., a group of genes with similar sequences) codes; and [27] The term "structurally-abstract molecule descriptors" refers to molecule descriptors (e.g., pharmacophore descriptors, shape-feature descriptors, and subshape-feature descriptors) that are structurally indefinite. The term, therefore, does not include definite molecule substructures (i.e., a completely defined portion of a molecule's structure) or definite molecule fragments. [28] FIG. 1 is a flow chart illustrating steps in a process 10 for selecting library molecules for inclusion in a gene-family screening library according to an exemplary embodiment of the present invention. Process 10 involves first defining a gene-family source set (as shown at step 12). The gene-family source set includes either a plurality of source molecules, a plurality of target structures, or a combination of source molecules and target structures. [29] The plurality of target structures that can be included in the gene-family source set includes enzymes and/or receptors structures from a predetermined gene family. The target structures can, for example, be derived experimentally or from homology models of target structures. Examples of experimental techniques useful for deriving target structures include, but are not limited to, x-ray crystallization, nuclear magnetic resonance (NMR) spectroscopy, and circular dichoism (CD) spectroscopy. The use of homology models is described by Bikker et al., "G-Protein Coupled Receptors: Models Mutagenesis, and Drug Design",

Journal of Medicinal Chemistry Vol. 41, No. 16, pp. 2912-2927 (1998), and by Marti-Renom et al., "Comparative Protein Structure Modeling of Genes and Genomes", Annu. Rev. Biophys. Biomol. Struct. 29, 291-325 (2000), both of which are hereby incorporated by reference for all purposes. [30] The plurality of source molecules that can be included in the gene-family source set are selected using selection criteria that include, at least, the criterion of activity toward a predetermined gene-family. Such source molecules can be selected, for example, from any suitable corporate, commercial or academic compound collection. The molecules can also be derived from natural sources, or be synthesized as peptides or other molecular forms. While not wishing to be limited to any particular molecule size, embodiments of gene family screening libraries in accordance with the present invention are typically composed of small molecules having molecular weights of less than 1000 and preferably less than 700, although the present invention is not to be interpreted as limited to any particular maximum size, and may incorporate larger-sized molecules. Furthermore, the molecules can be selected from published literature or other public documents. The use of databases containing drug-like molecules (e.g., the MDL Drug Data Report database ("MDDR"), the World Drug Index [WDI] database and the Comprehensive Medicinal Chemistry database) can beneficially serve to minimize the time required to select the source molecules and provide descriptions of molecule activity. Noteworthy of particular mention is the MDDR, which is produced by MDL Information Systems, Inc. of San Leandro, California

(http://www.mdli.com/mddr.html). A summary of the general and gene family classes of compounds included in the MDDR is presented by Schuffenhauer et al. in "An Ontology for Pharmaceutical Ligands and Its Application for in Silico Screening and Library Design", J. Chem. Inf. Comput. Sci. 42, 947-955 (2002), hereby incorporated by reference for all purposes.

[31] The predetermined gene-family can be any gene-family known to one skilled in the art including, but not limited to: the G-Protein Coupled Receptor (GPCR) gene-family including the chemokine receptor sub-family, the ion channel gene-family including the potassium channel and sodium channel sub-families, the serine protease gene-family, the phosphodiesterase gene-family, the nuclear receptor gene-family or the kinase gene-family. Since an objective of process 10 is the creation of a gene-family screening library, the definition of the gene-family source set distinctively involves either source molecule selection criteria that includes activity towards a predetermined gene-family or target structures of a predetermined gene-family. By defining the gene-family source set in this manner, the gene-family source set, and the active molecule descriptor space that will be subsequently established using the gene-family source set, capture properties that are unique and specific to the predetermined gene-family. This uniqueness and specificity enable processes according to the present invention to create a gene-family screening library that includes a reasonable number of library molecules, that contains library molecules with a high likelihood of having activity toward multiple target substances of the predetermined gene-family of interest, while excluding molecules that are less likely to have such activity. Compiling a gene family screening library in accordance with the present invention may also capture other, non-specific properties not currently known or associated with a specific gene family, but an objective of embodiments of the invention is to enrich the screening family in what is unique and specific to the gene family relative to a variety of possible molecular properties.

[32] Source molecule selection criteria in addition to activity toward a predetermined gene family can, if desired, be used to select a plurality of source molecules for the gene- family source set. For example, the selection criteria can include either a specific level of or general requirement for in vivo or in vitro activity toward the predeteπnined gene family, and/or molecular weight of less than 1000 or 700, Still another possible selection criterion is molecules that have passed Phase 2 clinical trials. Yet another possible selection criterion is molecules that pass additional filters such as the "rule of five" described by Lipinski et al., in "Experimental and Computational Approaches to Estimate Solubility and Permeability in Drug Discovery and Development Settings", Advanced Drug Delivery Reviews 23, 3-25 (1997), hereby incorporated by reference for all purposes. [33] Next, at least one class of structurally-abstract molecule descriptor is selected, as shown at step 14. The selection of the class of structurally-abstract molecule descriptors can be based on any suitable structurally-abstract characteristic, feature or property of a molecule. Therefore, structurally-abstract molecule descriptor classes include pharmacophore descriptors, shape-feature descriptors, subshape-feature descriptors, atom path-length descriptors, BCUT descriptors, and other biophysical descriptors (e.g., calculated solubility or logP) known to one skilled in the art. A discussion of BCUT descriptors is given by Pearlman et al. in "Metric Validation and the Receptor-Relevant Subspace Concept", J. Chem. Inf. Comput. Sci, 39, 28-35 (1999), hereby incorporated by reference for all purposes. A variety of other descriptors are presented by Todeschini et al., "Handbook of Molecular Descriptors", volume 11 in the series "Methods and Principles in Medicinal Chemistry", Mannhold et al. Eds., Wiley- VCH, Weinheim, Germany (2000), incorporated by reference in its entirety for all purposes herein.

[34] One skilled in the art will recognize that a pharmacophore comprises a set of relative positions in space which should be occupied by atoms of a specific type. For further description about pharmacophores, reference may be had to Bradley et al., "A Rapid Computational Method for Lead Evolution: Description and Application to alpha-Adrenergic Antagonists", J. Med. Chem. 43, 2770-2774 (2000), and to Y. Martin, J. Med. Chem. V. 35 pp. 2145-2154 (1992), both of which are incorporated herein by reference for all purposes. [35] Shape matching describes comparison of representations of the overall three dimensional shapes of molecules. A detailed description of approaches to shape matching is presented by Srinivasan et al. in "Evaluation of a Novel Shape-Based Computational Filter for Lead Evolution: Application to Thrombin Inhibitors", J. Med. Chem. 45, 2494-2500 (2002), hereby incorporated by reference for all purposes. One skilled in the art will recognize that shape-features represent a category of structurally-abstract molecule descriptors that can be employed in processes according to the present invention.

[36] Subshape features are three-dimensional representations of specific subshapes that fit within a larger molecular shape. One approach to subshape feature matching is described in a co-pending application titled "Method for Molecular Subshape Similarity Matching,", application no. / , (Atty. Docket No. 018590-005710US), which is hereby fully incorporated by reference. Once apprised of the present disclosure, one skilled in the art will recognize that subshape features represent a category of structurally-abstract molecule descriptors that can be employed in processes according to the present invention. [37] One skilled in the art will also recognize that structurally-abstract molecule descriptors do not explicitly employ known ligand connectivities. Processes in accordance with the present invention are, therefore, expected to select an increased number of novel library molecules in comparison to conventional library design methods that utilize known ligand connectivities and/or fragments derived from known ligands. If structurally-definite molecular substructures (e.g., privileged [substructures), structurally-definite molecular fragments, structurally-definite chemical scaffolds or structurally-definite side chains were to be chosen as the molecular descriptors, the active molecular descriptor space established in a subsequent step would be undesirably limited. However, it is to be understood that embodiments in accordance with the present invention may utilize known ligand connectivities or fragments such as privileged substructures derived from known ligands, to create the gene family screening library. A description of the concept of privileged substructures is presented by Patchett et al., in "Chapter 26: Privileged Structures - An Update", Ann. Rep. Med. Chem 35, 289 (2000), hereby incorporated by reference in its entirety for all purposes.

[38] Next, in accordance with one embodiment of the present invention, all members of the at least one class of structurally-abstract molecule descriptor are generated, as shown at step 16. For example, if the selected class of structurally-abstract descriptor is the class of 3- point pharmacophores wherein the 3 points are either aromatic ring features, hydrophobic features or positive charge features, with inter-feature distances of 1-15 Angstroms (A), all possible members of that class are generated. By generating all the members of the class, and taking all these members into consideration during the subsequent establishment of an active molecule descriptor space, processes according to the present invention avoid being undesirably limited to previously recognized and structurally-definite molecular substructures.

[39] Once all members of the at least one class of structurally-abstract molecule descriptor are generated, an active molecule descriptor space is established, as shown at step 18. In the circumstance that the gene-family source set includes a plurality of source molecules, the established active molecule descriptor space includes each member of the class of structurally-abstract molecule descriptor that is present in (represented in) a predetermined number (e.g., more than ten) of the plurality of source molecules of the gene-family source set. In other words, the active molecule descriptor "space" is defined by the members of the structurally-abstract molecule descriptor class (e.g., pharmacophores and/or subshape- features) that are frequently present in the source molecules. In the circumstance that the gene-family source set includes a plurality of target structures (i.e., enzymes and/or receptor structures), the established active molecule descriptor space includes each member of the class of structurally-abstract molecule descriptor that is correlated with a predetermined number (e.g., two) of the plurality of target structures of the gene- family source set. For example, a member of the class of structurally-abstract molecule descriptors can be considered "correlated" with a receptor of the gene-family source set if the member is correlated with (i.e., corresponds to or is "present" in) a binding site of the receptor. This target-structure approach is further described by Eksterowicz et al., "Coupling Structure- Based Design with Combinatorial Chemistry: Application of Active Site Derived Pharmacophores with Informative Library Design", J. Mol. Graphics Model 20, 469-477 (2002) hereby incorporated by reference in its entirety for all purposes. [40] The embodiment illustrated above in connection with FIG. 1 describes establishing an active molecule descriptor space by first generating all members of the at least one class of structurally-abstract molecule descriptor, and then correlating these members with gene family source molecules or target structures. With such an embodiment of a method in accordance with the present invention, it is possible and advantageous to generate all members of a certain class of structurally abstract descriptor, for example a pharmacophore descriptor that is sufficiently constrained by distance and/or feature parameters.

[41] However, it may be difficult or impossible to generate all members of other classes of structurally abstract descriptors, for example more complex pharmacophores or other descriptors based upon shapes or subshapes. Thus in accordance with one alternative embodiment of the present invention, fewer than all members of the class of structurally abstract descriptors may be generated prior to their correlation with gene family source molecules or target structures. Where some but not necessarily all members of the structurally-abstract molecule descriptors are to be generated from the gene family source molecules to establish the active molecule descriptor space, by defining parameters of the requisite correlation between the gene family source molecules/target structures and the members of the descriptor class(es), a relevant limitation on the number of structurally abstract descriptors can be imposed.

[42] With any of the embodiments described above, since the active molecule descriptor space is derived from (a) the source molecules of a specific gene-family source set and/or target structures of a specific gene-family and (b) the members of the class of structurally- abstract molecule descriptors, it is a molecule design space specifically targeted toward multiple target substances within the predetermined gene family, rather than a broad general design space. The active molecule descriptor space is, furthermore, the molecule design space that the library molecules of the gene-family screening library need to satisfy and cover.

[43] An exemplary two-step technique for establishing an active molecule descriptor space is described with reference to FIGs. 2 and 3. First, the presence/absence of each of the members of the at least one class of structurally-abstract molecule descriptor in the source molecules of a gene- family source set is encoded in a matrix of source molecule bit strings, with a bit of " 1 " representing the presence of a member and a bit code of "0" representing the absence of that member. Such a matrix of source molecule bit strings is depicted in FIG. 2 (where the bit code of "1" is depicted as a light circle and the bit code of "0" is depicted as a dark circle). In this matrix, each column is assigned to one of the members of the at least one class of structurally-abstract molecule descriptors. [44] Next, the matrix of source molecule bit strings is analyzed to resolve each of the members of the at least one class of structurally-abstract molecule descriptor that is present in a predetermined number (e.g., more than ten) of the plurality of source molecules. Since the matrix is in an encoded bit format, this procedure can be readily automated using computer- automated and software-based techniques. Each of the members thus resolved establishes (i.e., defines) the active molecule descriptor space. FIG. 3 depicts the relationship between the active molecule descriptor space 100 and a "space" 120 defined by some or all the members of the at least one class of structurally-abstract molecule descriptor. [45] Next, at step 20, a group of candidate molecules is identified. The group of candidate molecules is a subset of all possible molecules that can be chosen by any known means including medicinal chemistry intuition, selection from molecule inventories, synthetic accessibility, or computer design (i.e., virtual libraries of candidate molecules). [46] In certain embodiments of the present invention, the group of candidate molecules is encoded in candidate bit strings. As described in detail below, encoding the candidate molecules in this manner allows the presence/absence of each of the members of the at least one class of structurally-abstract molecule descriptor in each of the group of candidate molecules to be encoded in the candidate bit strings. Overlap of the candidate bit string and the active space bit string may subsequently be ascertained and analyzed to aid in the selection of library molecules providing optimal coverage. [47] Subsequently, library molecules for inclusion in a gene-family screening library are selected from the group of candidate molecules using a library design technique that optimizes coverage of the active molecule descriptor space (see step 22 of FIG. 1). The result of this selection is the creation of a gene-family screening library that includes a plurality of library molecules specifically directed toward multiple target substances of the predetermined gene family.

[48] Use of an informative library design technique is especially beneficial in terms of efficiently providing optimal coverage of the active molecule descriptor space and selecting library molecules that will provide the maximum amount of information when used in screening studies. If desired, the coverage of the active molecule descriptor space obtained using an informative library design technique can be beneficially optimized using Shannon entropy, as is discussed in detail below in connection with FIG. 7.

[49] An informative library design technique selects library molecules from the group of candidate molecules that cover the active molecule descriptor space in such a way that a maximum number of conclusions can be drawn out of subsequent screening experiments results regarding which are the preferred structurally-abstract molecule descriptors for molecules to have activity toward the predetermined gene family (hence the term "informative design"). Informative design in this unique context means that the library molecules selected for inclusion in the gene-family screening library sample multiple combinations of the members in the active molecule descriptor space in an overlapping fashion. This overlap allows a maximum number of conclusions to be drawn from gene- family screening experiments conducted using the library molecules. Moreover, knowledge of preferred descriptors provided by gene family screening experiments on targets may afford the researcher a head-start designing more specific libraries for screening against the target. [50] One informative library design technique that can be utilized in processes according to the present invention includes encoding, in active space bit strings, the members of the at least one class of structurally-abstract molecule descriptor that are included in the active molecule descriptor space. Next, the presence/absence of each of the members of the at least one class of structurally-abstract molecule descriptor in each of the group of candidate molecules is encoded in candidate bit strings. The overlap of the candidate bit string and the active space bit string, as well as the overlap of the active bit strings themselves, is subsequently ascertained. These overlaps are subsequently analyzed to aid in the selection of library molecules that provide an optimal coverage of the active molecule descriptor space. [51] Details regarding informative library design and its application in different contexts and manners than in the present invention can be found in S.L. Teig, "Informative Libraries are More Useful than Diverse Ones," J. Biomolecular Screening, Volume 3, No. 2, pp. 85-88 (1998), incorporated by reference herein for all purposes. The use of an informative library design technique in the selecting step of methods in accordance with the present invention is unique, however, in that it is applied to an active molecule descriptor space and, therefore, results in the selection of library molecules directed toward multiple target substances (e.g., enzymes or receptors) of a predetermined gene- family.

[52] As mentioned above, coverage of the active molecule descriptor space obtained using an informative library design technique can be beneficially optimized using Shannon entropy. Shannon entropy is described generally by Shannon, "A Mathematical Theory of Communication". The Bell System Technical Journal 27, 379-423, 623-656 (1948), incorporated herein by reference for all purposes. Shannon entropy enables quantification of the amount of information present in a collection of data in the form of bit strings. As described above, embodiments of the present invention expressing a correlation between a structurally-abstract descriptor and a candidate molecule can be expressed in the form of a bit string. Accordingly, coverage of the active molecule descriptor space and the amount of information returned by the gene family library molecules can be maximized when the Shannon entropy (H) of a set of candidate molecules is maximized according to equation (1):

where:

H = the Shannon entropy of the set of molecules with respect to the active molecule descriptor space; E= number of descriptors in the active molecule descriptor space; C = the number of distinct clusters (i.e. unique descriptor bit string patterns); and \cι\ — the size of cluster i. (number of descriptors having the unique bit string pattern of the cluster)

[53] A simplified example of optimization of coverage of the active molecule descriptor space using Shannon entropy is illustrated schematically in Figure 7. Figure 7 shows two potential libraries 700 and 702 comprising three molecules (rows) selected to cover a molecular descriptor space comprising six structurally-abstract descriptors (columns) A-G. [54] Each row thus represents the bit string for a molecule, with a "1" indicating the presence of a particular descriptor and a "0" indicating the absence of a particular descriptor in that molecule. Each column represents a bit string for a particular descriptor, again using "1" and "0" to denote the presence and absence of that descriptor in a particular molecule. [55] The descriptors A-G can be grouped based on the patterns in their bit strings (i.e. columns). Clusters of descriptors are defined based on the unique column bit patterns, with descriptors having the same pattern belonging to the same cluster. Thus in library 700 of Figure 7, molecules 704, 706, and 708 create four unique patterns or clusters 700a-d for the six descriptors A-G. Descriptors B and C have the same pattern, forming cluster 700b having a cluster size (c,) of 2. Descriptors A and D each have unique patterns comprising clusters 700a and 700d having a cluster size of 1.

[56] Figure 7 illustrates the difference between optimizing for coverage alone, and optimizing for information and coverage with Shannon entropy. Both library 700 and Library 702 cover the molecular descriptor space: i.e. each of the six descriptors A-G occurs at least once in the set of molecules). However, the Shannon entropy of the two libraries 700 and 702 is not the same. The Shannon entropy for Library 700 can be calculated using equation (1) based on the four clusters and their sizes as follows:

H = - [l/61n(l/6)] - [2/61n(2/6)] - [l/61n(l/6)] - [2/61n(2/6)] where the first, third, and fourth terms are the contribution of clusters 700a, 700c, and 700d respectively, each having a cluster size of one. The second term is from the second cluster 700b which has a cluster size of two. The resulting Shannon entropy for library 700 is 1.52. [57] The value of H reaches a maximum when each descriptor (column) has a unique pattern, thereby returning the most information. For the active molecular descriptor space shown in Figure 7, the maximum entropy is achieved when the set of molecules 704, 710, and 712 is selected such that there are six unique clusters of size 1, as is demonstrated in library 702. According to equation (1), the entropy for library 702 is then 6 x (l/61n(l/6)) = 2.37. The increase in entropy of library 702 over library 700 was achieved by selecting different molecules.

[58] While library coverage and information may be maximized using Shannon entropy alone, it may also be desirable to constrain the library molecules to satisfy certain distributions of desirable molecular properties. Discussion of molecular property distributions associated with desirable pharmacological traits is found in Hann et al, "Molecular Complexity and its Impact on the Probability of Finding Leads for Drug Discovery" J. Chem. Inf. Comput. Sci. (2001) 41 :856-864, and Oprea et al., "Is there a difference between Leads and Drugs? A Historical Perspective" J. Chem. Inf. Comput. Sci. (2001) 41:1308-1315, hereby incorporated by reference for all purposes. [59] For example, it is known that drug-like molecules typically comprise atoms connected by fifteen or fewer r tatable bonds. Members of a particular gene family source set may also be known to comprise at least two negatively charged groups. Thus where a gene family screening library comprising drug-like molecules is to be created in accordance with an embodiment of the present invention, in evaluating the coverage of the library it may be valuable to examine the distribution of these properties (numbers of rotatable bonds, negatively charged groups) amongst the library molecules. [60] Thus in accordance with one embodiment of the present invention, the effect of molecular distributions on the coverage of the selected library may be expressed by maximizing the library property (M) rather than the Shannon entropy (H). This library property M is calculated according to equation (2):

M = H -D (2) where

M = library property;

H= Shannon entropy of the library; and

D = total cost of the distributions

The library property M thus reflects both the Shannon entropy of the library (H), and how well the library matches specified property distribution constraints (D). The term D may also be characterized as a penalty where the library fails to match specified property distribution constraints. [61] The total cost term (D) may be calculated as a summation (Σ) of terms according to equation (3) as follows:

[62] The inner summation of equation (3):

calculates the cost of matching a specific property distribution, where p_d = desired fraction of distribution (user-defined); and p_c = current fraction of distribution (calculated for selected molecules). b = number of different bins or property ranges considered in a particular distribution;

[63] The outer summation of equation (3): p

Σ ^j=l", calculates the relative importance of matching one out of several possible property distributions, where: ω = weight term; and p = number of different properties included in the optimization [64] Table 1 gives a simplified example of the calculation of M for a library:

Table 1:

[01] There are two possible desired distributions, number of rotatable bonds and molecular weight (p = 2). There are contributions from each range. For example, there are four ranges (b) of each of the number of rotatable bonds and of the molecular weight, and the desired fraction for each range (p_d ) is indicated by the user (in the table). The current value of the distribution, p_{c ,} is determined for the set of selected molecules and also shown in Table 1. [66] Thus for the example of Table 1, the contribution/penalty for not matching the rotatable bond distribution would be: (0.35-0.25)² + (0.55-0.55)² + (0.10 - 0.10)² + ( 0.0 - 0.10)² = 0.02

The contribution/penalty for not matching the molecular weight distribution would be: (0.35-0.25)² + (0.50-0.55)² + (0.10 - 0.10)² + (0.05 - 0.10)² = 0.015

[01] Each distribution may be scaled by a weight term (ω) as shown in the outer summation of equation (3). This weight term (ω) allows the user to control the relative importance of the various properties. For example, it might be much more significant to match the molecular weight distribution than to match the number of rotatable bonds. Thus where ω = 0.75 for molecular weight and co = 0.25 for rotatable bonds, the total cost (D) for Table 1 of matching both of the distributions can be calculated using equation (3) as follows:

D = 0.25(0.02) + 0.75(0.015)

where the first term represents the contribution from the rotatable bonds distribution and the second term represents the molecular weight.

[68] It is often desirable to know whether the active molecule descriptor space has been adequately covered by the library molecules included in the gene-family screening library and then to expand the number of library molecules if coverage is insufficient. Therefore, at step 24 of FIG. 1, the coverage of the active molecule descriptor space by the library molecules is calculated. If the coverage is deemed sufficient, the process can be halted. [69] If the coverage is deemed insufficient, then at step 26, another group of candidate molecules not yet represented in the screening library is identified. This group of candidate molecules is identified such that "holes" (i.e., portions of the active molecule descriptor space that are not represented by any of the library molecules) in the active molecule descriptor space are likely to be filled be the proper selection of library molecules therefrom.

Thereafter, additional library molecules for inclusion in a gene-family screening library are selected from this group of candidate molecules using a library design technique (for example, an informative library design technique). This selection of additional library molecules creates a gene-family screening library with an improved (e.g., increased or more informative) coverage of the active molecule descriptor space and therefore a high likelihood of activity versus a family of target substances.

[70] The steps of calculating the coverage, identifying another group of candidate molecules and selecting additional library molecules can, if desired, be repeated in an iterative manner until a sufficient coverage of the active molecule descriptor space is obtained. This iterative aspect of process 10 is illustrated by the arrow linking steps 28 and 24 in FIG. 1. [71] Identification of available descriptors from the chosen class which have not been utilized to select the gene family screening library from the candidate molecules may indicate additional molecules to be considered for inclusion in the gene family screening library. For example, if any pharmacophores of the class remain to be covered in the active molecule descriptor space, it is possible to determine the combinations of features contained in each of the remaining pharmacophores. If the majority of these pharmacophores contain a negative- charge feature, then to cover the remaining active molecule descriptor space, only candidate molecules having a negative-charge feature will be useful. Thus a large pool of candidate molecules, such as a virtual library of millions of compounds, could be filtered rapidly to provide additional coverage for the gene family screening library, so that candidate molecules would comprise a smaller pool of negative charge-containing molecules. [72] If the originally identified pool of candidate molecules cannot provide features of the not-yet-covered active molecule descriptor space, additional candidate molecules meeting this criterion could be sought. For example, where the candidate molecules are drawn from synthetic combinatorial libraries, one criterion for identifying additional candidate molecules is that the candidate molecule must reflect the product of reaction between a combinatorial template and a reagent containing carboxylic acid or a bioisotere thereof (i.e. another functional group recognized as acting biologically like a carboxylic acid group). Such an additional candidate molecule with the appropriate feature types should still present those features in the appropriate relative locations for the pharmacophore to be "covered". However, knowledge of the pharmacophores and their associated features can direct the search for candidate molecules.

[73] A benefit of processes according to the present invention is that the minimum size of the gene-family screening library can be estimated as the process progresses. For example, if 3000 source molecules are included in the gene-family source set, then it can be deduced that 3000 library molecules could conceivably completely cover the active descriptor source space, if those 3000 library molecules include the requisite structurally-abstract molecule descriptors.

[74] Alternatively, the iterative aspect of the present invention also provides an opportunity to estimate the size of the gene-family screening library. For example, if it is calculated that 5000 library molecules provide 29% coverage of the active molecule descriptor space, it can be estimated that if additional candidate molecules covering a similar number of structurally-abstract molecule descriptor characteristics are selected, then the final screening library should contain approximately 17,240 molecules. [75] In yet another alternative method of estimating the size of the gene-family screening library, if the exact nature of the candidate molecules is known, then the average number of structurally-abstract molecule descriptors represented by each of the candidate molecules can be calculated. The estimated number of library molecules that need to be selected for a gene- family screening library in order to cover the active molecule descriptor space would then be equal to the number of structurally-abstract molecule descriptors in the active molecular descriptor space divided by the average number of structurally-abstract molecule descriptors represented by each of the candidate molecules.

[76] Once a gene family screening library has been compiled utilizing embodiments of the present invention, it may be employed to identify lead candidates for drug discovery. FIG. 6 is a flow chart showing the steps of a method 600 for applying a gene family screening library in accordance with the present invention to identify drug leads.

[77] In the first step 602, a gene-family screening library in accordance with an embodiment of the present invention is constructed as described in detail above. The first step 602 of the method 600 shown in FIG. 6 thus corresponds to step 22 or 28 of FIG. 1. [78] In a second step 604, members of the gene family screening library are procured. Members of the library can be procured in a number of ways. One approach is to synthesize in the laboratory one or more of the molecules comprising the library. Such synthesis can comprise conventional techniques, or more efficiently can employ combinatorial synthesis strategies wherein large numbers of organic compounds are created in parallel by linking chemical building blocks in all possible combinations. Such combinatorial synthesis approaches may involve solid phase synthesis wherein the molecules are anchored to beads, or may involve solution phase synthesis wherein the molecules are present in solution. Either or both solid or solution phase combinatorial synthesis techniques could be utilized to procure members of a gene family screening library created in accordance with embodiments of the present invention. [79] Another alternative approach for procuring members of the gene family screening library is to purchase existing molecules from commercial sources. Examples of commercial sources of molecules suitable for procuring members of a gene family screening library created in accordance with an embodiment of the present invention include, but are not limited to, Pharmacopeia Inc. of Princeton, New Jersey, Sigma- Aldrich Corp. of St. Louis, Missouri, Maybridge Plc.of Tintagel, Cornwall U.K., Chembridge Corp. of San Diego, CA, and Albany Molecular Research, of Albany, NY.

[80] Moreover, as described in detail below, certain virtual or in silico screening techniques do not require that the molecule of the gene family library be physically created in order for screening to occur. Instead, such in silico screening techniques rely upon an electronic representation of the molecule, which may include a three dimensional orientation of one or more conformers of a molecule. The generation of molecular conformers is described by Smellie et al., "Conformational Analysis by intersection: Ring conformation", Proc. of the 217th Meeting of the ACS, Anaheim (1999), and by Smellie et al., "Conformational Analysis by Intersection", J. Comput. Chem. Vol. _, No. _, pp.

(accepted for publication 2002), both of which are incorporated by reference herein. [81] Two- and three-dimensional molecular representations of gene family library members generated utilizing various software packages may be stored in a number of standardized formats, including but not limited to the SMILES format from Daylight Chemical Information Systems, Mission Viejo, California, and described by Weininger, in "SMILES 1. Introduction and Encoding Rules", J. Chem. Inf. Comput. Sci. 28, 31 (1988), incorporated herein by reference, the MOL2 format by Tripos Inc. of St. Louis, Missouri, the MOL and SDF formats of MDL of San Leandro, California, and the PDB format of the Protein Data Bank, http://www.rcsb.org/pdb/, incorporated herein by reference. [82] In a third step 606 of flowchart 600, some or all of the procured members of the gene family library in accordance with the present invention can be screened for activity. Examples of screening wherein novel drug leads have been successfully generated from combinatorial libraries, include the identification of novel cholecystokinin receptor antagonists, herpes simplex virus inhibitors, carbonic anhydrase II inhibitors, and peroxisome proliferator-activated receptor ligands. See Bunin et al., "Chapter 27. Application of Combinatorial and Parallel Synthesis to Medicinal Chemistry", Ann Rep Med Chem, 34, 267-286 (1999), incorporated by reference herein for all purposes. [83] Screening of the created gene family library can take the form of biological assays conducted outside of living tissue {in vitro). As is well known to one of skill in the art, examples of assay formats for measurement of enzyme activity or receptor binding include, but are not limited to, electrophoresis, scintillation proximity, ELISAs, immunoprecipitation, western blotting, and bead-based methods. Examples of detection techniques for application with biological assays include, but are not limited to, the use of time-resolved fluorescence, resonance energy transfer (FRET), fluorescence polarization, radioisotopic tracers, and chemiluminescent or colorimetric substrates. Other in vitro screening techniques for use in conjunction with gene family screening libraries created in accordance with the present invention include, but are not limited to, binding assays, enzyme activity assays, and cell- based assays such as functional assays and metabolism assays. [84] One or more of the screening techniques described above can be performed with different levels of throughput. High-throughput screening of compound libraries is a standard approach in pharmaceutical research to discover new lead compounds for drug design. High-throughput screening typically involves the use of ninety-six or a greater number of wells per plate. Such high-throughput screening methods have discovered novel molecules, dissimilar to known ligands, that nevertheless bind to the target receptor at micromolar or submicromolar concentrations. Examples of the use of high throughput in vitro screening to identify active molecules of a screening library are described by McGovern et al., "A Common Mechanism Underlying Promiscuous Inhibitors from Virtual and High- Throughput Screening", J. Med. Chem. 45, 1712-1722 (2002), and Golebiowski et al, "Lead compounds discovered from libraries", Curr. Opin Chem. Biol.., 5, 273-284 (2001), both of which are incorporated by reference herein for all purposes. Medium or low-throughput formats can be utilized to screen the gene family libraries created in accordance with embodiments of the present invention. [85] Alternatively, or in conjunction with in vitro testing, members of a gene family library created in accordance with embodiments of the present invention can be subjected to screening in living tissue {in vivo). Such in vivo assays include but are not limited to evaluation of a gene family screening library member activity in rodents, dogs, primates, or any other species. This evaluation may include testing of the library molecules in a suitable pharmacological model of a particular disease state, wherein physiological or behavioral changes in an animal are monitored. Such animals may be normal (wild-type) or genetically- modified, or may be subject to a particular experimental protocol. Data produced from in vivo assays may include but is not limited to physical examination, histological (organ/tissue) or behavioral observations, post-mortem examinations, and gene-expression analyses from tissue samples of animals exposed to library molecules. For example, library molecules may effectively reduce the size, weight and/or adipose tissue density of animals fed a high-fat diet, as a model for human obesity and diabetes, or may produce a response associated with reduced anxiety in a behavioral test, or may alter normal gene-expression in a given tissue as a result of interacting with an appropriate biological target. [86] In addition to in vitro and in vivo testing of members of a gene family screening library created in accordance with embodiments of the present invention, screening "in silico" - within the silicon of the integrated circuits comprising a computer processor or memory, - is emerging as an increasingly useful technique. In silico screening, also known as virtual screening, relies upon electronic representations of the molecules in two- or three- dimensions, rather than upon the physical molecules themselves. In silico screening may permit a researcher to rapidly compare and evaluate similarity between candidate molecules from the library and other structures, such as receptors or other molecules with previously- demonstrated activity against a particular receptor. While not replacing entirely bioassays that attempt to reproduce the in vitro and in vivo behavior of a molecule in chemical and biological environments, respectively, in silico screening has emerged as a useful tool for drug development. In silico screening is described in general by Terstappen et al. in "In silico research in Drug Discovery", Trends in Pharmacological Sciences, Vol. 22 No. 1 (2001), incorporated by reference herein for all purposes. An example of in silico screening of combinatorial libraries across a gene- family has been described by Aronov et al., "Virtual Screening of Combinatorial Libraries Across a Gene Family: in Search of Inhibitors of

Giardia lambia Guanine PhosphoribosyltransferaseAntimicrob Agents Chemother., 45,2571- 6 (2001), incorporated by reference herein for all purposes

[87] Compounds from the gene family screening library evidencing desirable activity in vitro, in vivo, in silico, or in some combination thereof, against one or more members of the gene-family are designated as 'hits', and may be validated and further optimized to identify leads and ultimately, drug candidates and drugs. A typical sequence of screening utilizing maximum efficiency of resources is initial screening of library members in silico, followed by in vitro screening of library members revealed as promising in silico, followed by in vivo screening of library members revealed as promising in vitro. However, this order of testing is not required, and the various techniques could be employed in any order to screen a gene family library created in accordance with one embodiment of the present invention. [88] While the above-referenced discussion of screening of the members of a gene family library created in accordance with the present invention has focuses upon activity regarding enzymes or receptors belonging to the original gene family upon which the library was constructed, this is not required by the present invention. In alternative embodiments, once a gene family screening library has been created according to the present invention, some or all of its members may be screened in vitro, in vivo, and/or in silico against enzymes or receptors not belonging to the gene family around which the screening library was designed. Although hits in such an alternative embodiment may be less frequent, such hits as do occur may be of interest to a researcher and provide a novel starting point for further lead generation and optimization.

[89] Processes according to the present invention can be implemented using computer- automated techniques that involve custom and/or commercial software routines implemented in a single application program or implemented as multiple programs in a distributed computing environment, such as a workstation, personal computer or remote terminal in a client-server relationship.

[90] FIG. 5 A is a simplified diagram of a computing device for processing information according to an embodiment of the present invention. This diagram is merely an example which should not limit the scope of the claims herein. One skilled in the art would recognize many other variations, modifications and alternatives. Embodiments according to the present invention can be implemented in a single application program such as a browser, or can be implemented as multiple programs in a distributed computing environment, such as a workstation, personal computer or a remote terminal in a client server relationship. [91] FIG. 5 A shows a computer system 510 including a display device 520, a display screen 530, a cabinet 540, a keyboard 550, and a mouse 570. Mouse 570 and keyboard 550 are representative "user input devices." Mouse 570 includes buttons 580 for selection of buttons on a graphical user interface device. Other examples of user input devices are a touch screen, light pen, track ball, data glove, microphone, and so forth. FIG. 5A is representative of but one type of system for embodying the present invention. It will be readily apparent to one of ordinary skill in the art that many system types and configurations are suitable for use in conjunction with the present invention. In a preferred embodiment, computer system 510 includes a Pentium class based computer, running Windows NT operating system by Microsoft Corporation. However, the apparatus is easily adapted to other operating systems and architectures by those skilled in the art without departing from the scope of the present invention.

[92] As noted, mouse 570 can have one or more buttons such as buttons 580. Cabinet 540 houses familiar computer components such as disk drives, a processor, storage device, etc. Storage devices include, but are not limited to, disk drives, magnetic tape, solid state memory, bubble memory, etc. Cabinet 540 can include additional hardware such as input/output (I/O) interface cards for connecting computer system 510 to external devices external storage, other computers or additional peripherals, further described below. FIG. 5B is an illustration of basic subsystems in computer system 510 of FIG. 5 A. This diagram is merely an illustration and should not limit the scope of the claims herein. One skilled in the art will recognize other variations, modifications and alternatives. In certain embodiments, the subsystems are interconnected via a system bus 575. Additional subsystems such as a printer 574, a keyboard 578, a fixed disk 579, a monitor 576, which is coupled to a display adapter 582, and others are shown. Peripherals and input/output (I/O) devices, which couple to an I/O controller 571, can be connected to the computer system by any number of means known in the art, such as a serial port 577. For example, serial port 577 can be used to connect the computer system to a modem 581, which in turn connects to a wide area network such as the Internet, a mouse input device, or a scanner. The interconnection via system bus allows a central processor 573 to communicate with each subsystem and to control the execution of instructions from system memory 572 or the fixed disk 579, as well as the exchange of information between subsystems. Other arrangements of subsystems and interconnections are readily achievable by those skilled in the art. System memory, and the fixed disk are examples of tangible media for storage of computer programs, other types of tangible media include floppy disks, removable hard disks, optical storage media such as CD-ROM's and bar codes, and semiconductor memories such as flash memory, read-only-memories (ROM) and battery backed memory.

EXAMPLES

[93] To illustrate the present invention, the following exemplary process for the selection of library molecules for inclusion in a G-protein coupled receptor (GPCR) gene-family screening library is detailed.

[94] First, a GPCR gene-family source set including 3321 source molecules was defined.

The 3321 source molecules were selected from version 2000.1 of the MDL Drug Data Report database using the selection criteria of (i) membership in a class that exhibits activity towards the GPCR gene-family; (ii) in vivo activity towards the GPCR gene-family; and (iii) a molecular weight between zero and 700.

[95] Activity toward the GPCR gene-family was assumed if a molecule was listed as belonging to any of 114 well-defined GPCR classes FIG. 4 is a bar chart illustrating the total number of molecules with activity towards a given GPCR gene-family receptor grouping (i) and those that were selected based on criteria (ii) and (iii) above.

[96] Next, two classes of pharmacophore descriptors were chosen. The two classes were 3-point and 4-point pharmacophore descriptors, each of which includes either 3 or 4 features _, selected from: (i) at most 2 positive charge features; (ii) at most 2 negative charge features; (iii) hydrogen bond donor features; (iv) hydrogen bond acceptor features; (v) at most two hydrophobic features; and (vi) aromatic ring features. The two classes of pharmacophore descriptors each included 25 distance bins in the range of 1.6 Angstroms to 24 Angstroms. [97] All members of these two classes of pharmacophore descriptors were then generated (enumerated). The members numbered approximately 35,000,000.

[98] An active molecule descriptor space was then established. This active molecule descriptor space included any of the members of the two classes of pharmacophore descriptors that were present in any eleven or more of the 3321 source molecules of the gene- family source set. The establishment of the active molecule descriptor space was facilitated by encoding the presence/absence of each of the members of the two classes of pharmacophore descriptors in the 3321 source molecules in a matrix of source molecule bit strings and then conducting a computer-based analysis of the matrix to establish the active molecule descriptor space. The active molecule descriptor space established in this manner included approximately 1,800,000 members (i.e., individual pharmacophore descriptors) of the two classes of pharmacophore descriptors.

[99] Next, a group of candidate molecules was defined. These candidate molecules were identified from potential chemistries available identified by one skilled in the art. From one set of -160,000 candidate molecules, a matrix of 5000 library molecules (10 reagentl x 20 reagent2 x 25 reagent3) was selected to optimize coverage in the active molecular descriptor space. These 5000 library molecules provided 29% coverage of the active molecule descriptor space (calculated as the percent of active molecular descriptor space pharmacophore members that were represented in greater than 10 of the 5000 molecules). [100] At this point 71% of the active molecule descriptor space remains uncovered, and a second iteration of the process would be performed. In the second iteration, informative design would be used to select another set of library molecules (e.g., 5000) which optimized the coverage of the remaining 71% of the active molecule descriptor space. The candidate pool can be the same or supplemented with additional compounds. After the second selection, the cumulative 10,000 member library would be checked for coverage of the active molecule descriptor space. Another iteration of design would be pursued if the coverage was not complete.

[101] A second experimental example illustrating the utilization of a gene family screening library created in accordance with an embodiment of the present invention is as follows. A GPCR targeted gene family library comprising 13,769 molecules was constructed in a manner similar to the procedure outlined in the first Example. Specifically, the known GPCR ligands collected in the MDDR were used to derive a pharmacophore space comprising 3- and 4-point pharmacophores. Small combinatorial libraries were selected, synthesized, and purified on the basis of 50-60 chemical scaffolds [102] Each of the 13,769 molecules of the GPCR gene family library were then screened against the μ-opioid receptor. Of the molecules of the gene family screening library, 357 exhibited activity against the μ-opioid receptor, with activity defined as greater than 50% inhibition at a concentration of lOμM (micromolar) or less. This translated to a hit rate percentage of (357/13,769) x 100 = 2.6%. [103] The percentage hit rate obtained utilizing a gene family screening library created in accordance with an embodiment of the present invention may fairly be contrasted with a conventional diverse, drug-like library of 10,560 molecules reported by Poulain et al, "From Hit to Lead, Combining Two Complementary Methods for Focused Library Design, Application to μ Opiate Ligands", the Journal of Medicinal Chemistry, 44, 3378-3390 (2001), hereby incorporated by reference for all purposes. Activity of molecules in the conventional library against the μ-opioid receptor resulted in a hit-rate of only 1.7%. [104] Moreover, the enhanced accuracy of screening utilizing the GPCR library created in accordance with an embodiment of the present invention was affirmed by subsequent research. Specifically, the largest number of active molecules of the GPCR gene family of this example corresponded to a combinatorial synthesis template or scaffold present in the known compound spiroxatrine.

[105] The MDDR represents a compilation of a consensus in the scientific and patent literature regarding chemical and biological activity of the compounds listed therein. In a prior (year 2000) edition of the MDDR, spiroxatrine was designated as exhibiting biological activity as an antagonist at the dopamine D2 receptor, and activity against the serotonin 5HT1 A receptor, both of which are G-protein coupled receptors (GPCR). Therefore, spiroxatrine or a compound of similar structure might be expected to exhibit activity against other members of the GPCR family. [106] Indeed, a subsequent (year 2001) edition of the MDDR designates spiroxatrine as having activity against the μ-opioid receptor, another member of the GPCR family. The additional biological activity for spiroxatrine-like molecules would thus have been indicated utilizing the embodiment of the gene family screening library constructed in accordance with the present invention. Thus a drug-discovery team focused on developing therapeutic molecules active against the μ-opioid receptor would find the GPCR screening family developed in accordance with an embodiment of the present invention to be a useful starting point for further optimization of molecules active against the μ-opioid receptor. Such a molecule of the gene family screening library that is revealed as active against a particular receptor may perhaps demonstrate other desirable properties or satisfy other necessary criteria as well, leading to the ultimate goal of further developing the candidates revealed from the screening library into viable drugs.

[107] It should be understood that various alternatives to the embodiments of the invention described herein may be employed in practicing the invention. For example, some of the steps can be separated, combined or re-ordered. It is intended that the following claims define the scope of the invention and that methods within the scope of these claims and their equivalents be covered thereby.

Claims

WHAT IS CLAIMED IS:

1. A method for selecting molecules for inclusion in a gene-family screening library, the method comprising: defining a gene-family source set that includes a plurality of source molecules selected using selection criteria that includes the criterion of activity towards a predetermined gene-family; choosing at least one class of structurally-abstract molecule descriptor; generating all members of the at least one class of structurally-abstract molecule descriptor; establishing an active molecule descriptor space, the active molecule descriptor space including each of the members of the at least one class of structurally- abstract molecule descriptor that is present in a predetermined number of the plurality of source molecules of the gene-family source set; identifying a group of candidate molecules; and selecting library molecules for inclusion in a gene-family screening library from the group of candidate molecules, thereby designing a gene-family screening library.

2. ' The method of claim 1 further including, after the selecting step, the steps of: calculating the coverage of the active molecule descriptor space by the library molecules; identifying another group of candidate molecules; and selecting additional library molecules for inclusion in a gene-family screening library from the another group of candidate molecules using the library design technique, thereby creating a gene-family screening library with an improved coverage of the active molecule descriptor space; and repeating the steps of calculating the coverage, identifying another group of candidate molecules and selecting additional library molecules in an iterative manner until a predetermined coverage of the active molecule descriptor space is obtained.

3. The method of claim 1, further comprising, during the defining step, selecting the plurality of source molecules from a molecule database.

4. The method of claim 3, further comprising, during the defining step, selecting the plurality of source molecules from a molecule database containing drug-like molecules.

5. The method of claim 1, further comprising, during the defining step, selecting the plurality of source molecules using selection criteria that further includes demonstrated in vivo activity towards the gene-family.

6. The method of claim 5, further comprising, during the defining step, selecting the plurality of source molecules using selection criteria that further includes a molecular weight of no more than 700.

7. The method of claim 1, further comprising, during the choosing step, choosing a class of pharmacophore descriptors.

8. The method of claim 1, further comprising, during the choosing step, choosing a class of subshape-feature descriptors.

9. The method of claim 1, wherein the establishing step utilizes a technique wherein the presence/absence of each of the members of the at least one class of structurally-abstract molecule descriptor in the source molecules is encoded in a matrix of source molecule bit strings.

10. The method of claim 9, wherein the selecting step utilizes an informative library design technique wherein the members of the at least one class of structurally-abstract molecule descriptor that are included in the active molecule descriptor space are encoded in a active space bit string; and wherein the presence/absence of each of the members of the at least one class of structurally- abstract molecule descriptor in each of the group of candidate library molecules is encoded in a candidate bit string and the overlap of the candidate bit string and the active space bit string is ascertained.

11. The method of claim 10, wherein the establishing step and the selecting step are accomplished using computer-automated techniques.

12. The method of claim 1 , wherein the selecting step uses an informative library design technique that optimizes coverage of the active molecule descriptor space.

13. The method of claim 12, wherein the selecting step optimizes coverage of the active molecule descriptor space based on Shannon entropy.

14. A method for selecting molecules for inclusion in a gene-family screening library, the method comprising: defining a gene-family source set that includes a plurality of source molecules, the plurality of source molecules selected from a drug-like molecule database using selection criteria that includes the criterion of in vivo activity towards a predetermined gene-family; choosing at least one class of structurally-abstract molecule descriptor; generating all members of the at least one class of structurally-abstract molecule descriptor; establishing an active molecule descriptor space utilizing a technique wherein the presence/absence of each of the members of the at least one class of structurally-abstract molecule descriptor in the source molecules is encoded in a matrix of source molecule bit strings, the active molecule descriptor space including each of the members of the at least one class of structurally-abstract molecule descriptor that is present in a predetermined number of the plurality of source molecules of the gene-family source set; identifying a group of candidate molecules; and selecting library molecules for inclusion in a gene-family screening library from the group of candidate molecules using an informative library design technique that includes: encoding, in an active space bit string, the members of the at least one class of structurally-abstract molecule descriptor that are included in the active molecule descriptor space; encoding, in a candidate bit string, the presence/absence of each of the members of the at least one class of structurally-abstract molecule descriptor in each of the group of candidate library molecules; and ascertaining the overlap of the candidate bit string and the active space bit string.

15. A method for selecting molecules for inclusion in a gene-family screening library, the method comprising: defining a gene-family source set that includes a plurality of target structures of a predetermined gene family; choosing at least one class of structurally-abstract molecule descriptor; generating all members of the at least one class of structurally-abstract molecule descriptor; establishing an active molecule descriptor space, the active molecule descriptor space including each of the members of the at least one class of structurally- abstract molecule descriptor that is correlated with a predetermined number of the plurality of target structures of the gene-family source set; identifying a group of candidate molecules; and selecting library molecules for inclusion in a gene-family screening library from the group of candidate molecules, thereby designing a gene-family screening library.

16. The method of claim 15, wherein the defining step defines a gene- family source set that includes a plurality of enzyme structures of a predeteπnined gene family.

17. The method of claim 15, wherein the defining step defines a gene- family source set that includes a plurality of receptor structures of a predetermined gene family.

18. The method of step 15, wherein the selecting step uses an informative library design technique that optimizes coverage of the active molecule descriptor space.

19. A method for examining suitability of a candidate molecule as a drug lead, the method comprising: defining a gene-family source set that includes a plurality of molecules selected using selection criteria relating to a predetermined gene-family; choosing at least one class of structurally-abstract molecule descriptor; generating members of the at least one class of structurally-abstract molecule descriptor; establishing an active molecule descriptor space, the active molecule descriptor space including members of the at least one class of structurally-abstract molecule descriptor that correlate with a predetermined number of the plurality of molecules of the gene-family source set; identifying a group of the candidate molecules; and selecting library molecules for inclusion in a gene- family screening library from the group of the candidate molecules, thereby designing a gene-family screening library.

20. The method of claim 19 wherein the defining step comprises identifying at least one of a target structure and a source molecule, the target structure defined by the criterion of structural similarity to the predetermined gene family based upon at least one of experimental results and homology studies, and the source molecule defined by the criterion of activity toward the predetermined gene family.

21. The method of claim 20 wherein the experimental results are selected from the group consisting of x-ray crystallization studies, nuclear magnetic resonance (NMR) imaging studies, and circular dichoism (CD) spectroscopy studies.

22. The method of claim 20 wherein defining the target structure comprises defining an enzyme or a receptor structure.

23. The method of claim 20 wherein: the defining step comprises identifying a plurality of target structures and source molecules, the structures molecules defined by the criterion of structural similarity to the predetermined gene family based upon at least one of experimental results and homology studies, and the source molecules defined by the criterion of activity toward the predetermined gene family; and the step of establishing an active molecule descriptor space comprises, including each member of the class of structurally-abstract molecule descriptor that is correlated with a predetermined number of the target structures, and including each member of the class of structurally-abstract molecule descriptor that is present in a predetermined number of source molecules.

24. The method of claim 20, further comprising, during the defining step, selecting the source molecule using selection criteria that further includes a molecular weight of 1000 or less.

25. The method of claim 20, further comprising, during the defining step, selecting the source molecule using selection criteria that further includes demonstrated in vivo activity towards the gene family.

26. The method of claim 19 further including, after the selecting step, the steps of: calculating a coverage of the active molecule descriptor space by the library molecules; identifying another group of candidate molecules; and selecting additional library molecules for inclusion in a gene-family screening library from the other group of candidate molecules using the library design technique, thereby creating a gene-family screening library with an improved coverage of the active molecule descriptor space; and repeating, the steps of calculating the coverage, identifying another group of candidate molecules and selecting additional library molecules in an iterative manner until a predetermined coverage of the active molecule descriptor space is obtained.

27. The method of claim 19 further including, after the selecting step, the steps of: identifying portions of the active molecule descriptor space not represented by the library molecules; and selecting additional library molecules for inclusion in the gene-family screening library from candidate molecules correlating with descriptors of the class not previously substantially relied upon, thereby creating a gene-family screening library with an improved coverage of the active molecule descriptor space; and repeating the steps of identifying portions of the active molecule descriptor space and selecting additional library molecules in an iterative manner until a predetermined coverage of the active molecule descriptor space is obtained.

28. The method of claim 27 wherein the candidate molecules correlating with the descriptors not substantially relied upon are supplemental to the original group of candidate molecules.

29. The method of claim 19, further comprising, during the choosing step, choosing a class selected from the group consisting of pharmacophore descriptors, atom path- length descriptors, biophysical descriptors, BCUT descriptors, shape descriptors, subshape descriptors, and shape-feature descriptors.

30. The method of claim 19, wherein the establishing step utilizes a technique wherein the presence/absence of each of the members of the at least one class of structurally-abstract molecule descriptor in the source molecules is encoded in a matrix of source molecule bit strings.

31. The method of claim 30, wherein the selecting step utilizes an informative library design technique wherein the members of the at least one class of structurally-abstract molecule descriptor that are included in the active molecule descriptor space are encoded in a active space bit string; and wherein the presence/absence of each of the members of the at least one class of structurally-abstract molecule descriptor in each of the group of candidate library molecules is encoded in a candidate bit string and the overlap of the candidate bit string and the active space bit string is ascertained.

32. The method of claim 19, wherein the selecting step uses an informative library design technique that optimizes coverage of the active molecule descriptor space.

33. The method of claim 32, wherein the selecting step optimizes coverage of the active molecule descriptor space based on Shannon entropy.

34. The method of claim 33, wherein the selecting step optimizes coverage of the active molecule descriptor space based on Shannon entropy discounted by consideration of an actual distribution of molecular properties of the library as compared with a desired distribution of the molecular properties of the library

35. The method of claim 19, wherein the identifying step comprises identifying a subset of all possible molecules by a criterion selected from the group consisting of medicinal chemistry intuition, availability in existing molecule inventories, synthetic accessibility, and computer design.

36. The method of claim 19, further comprising procuring at least one of the library molecules.

37. The method of claim 36, wherein the procuring step comprises obtaining a physical sample of the molecule.

38. The method of claim 37, wherein the physical sample of the molecule is obtained through the technique selected from the group consisting of purchasing the physical sample from a commercial vendor, isolating the physical sample from a natural source, and synthesizing the molecule in the laboratory.

39. The method of claim 38, wherein the molecule is synthesized utilizing combinatorial chemistry techniques.

40. The method of claim 36, wherein the procuring step comprises: generating a three-dimensional representation of a conformer of the library molecule in space; and storing the three-dimensional representation of the conformer in a computer- readable storage medium.

41. The method of claim 36 further comprising screening the procured gene family library molecule for activity.

42. The method of claim 41, wherein the screening step comprises in vitro testing of the procured gene family library molecule for activity toward the predetermined gene family.

43. The method of claim 41, wherein the screening step comprises in vitro testing of the procured gene family library molecule for activity toward other than the predetermined gene family.

44. The method of claim 41 , wherein the screening step comprises in vivo testing of the procured gene family library molecule for activity toward the predetermined gene family.

45. The method of claim 41, wherein the screening step comprises in vivo testing of the procured gene family library molecule for activity toward other than the predetermined gene family.

46. The method of claim 41 , wherein the screening step comprises comparison of a three-dimensional representation of a conformer of the gene family library molecule with a three-dimensional representation of a member of the predetermined gene family.

47. The method of claim 41, wherein the screening step comprises comparison of a three-dimensional representation of a conformer of the gene family library molecule with a three-dimensional representation of a member of other than the predetermined gene family.