US20060161407A1 - Modeling biological effects of molecules using molecular property models - Google Patents
Modeling biological effects of molecules using molecular property models Download PDFInfo
- Publication number
- US20060161407A1 US20060161407A1 US11/304,209 US30420905A US2006161407A1 US 20060161407 A1 US20060161407 A1 US 20060161407A1 US 30420905 A US30420905 A US 30420905A US 2006161407 A1 US2006161407 A1 US 2006161407A1
- Authority
- US
- United States
- Prior art keywords
- model
- training
- test molecule
- molecular property
- molecule
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Abstract
Description
- This application claims the benefit of U.S. provisional application, Ser. No. 60/636,645 filed on Dec. 16, 2004, which is incorporated herein by reference in its entirety. Also, this application is related to the following commonly assigned U.S. patent applications: “Methods For Molecular Property Modeling Using Virtual Data” (Ser. No. 11/074,587 filed on Mar. 8, 2005), “Estimating the Accuracy of Molecular Property Models and Predictions” (Ser. No. 11/172,216 filed on Jun. 29, 2005), and “Molecular Property Modeling using Ranking “(Ser. No. 11/172,215 filed on Jun. 29, 2005), each of which is incorporated herein by reference in its entirety.
- 1. Field of the Invention
- Embodiments of the present invention are generally related to machine learning. More specifically, embodiments of the present invention are related to machine learning techniques used to predict the biological effects of a molecule.
- 2. Description of the Related Art
- Molecules are continually being introduced into the marketplace or environment, (e.g., industrial detergents, industrial discharge, pharmaceuticals and cosmetics). Sometimes, such molecules may have unknown or undesirable biological effects (e.g. they may have some level of toxicity on humans, flora or fauna). It is of great benefit for organizations introducing such molecules, and for society in general, to anticipate such effects as early as possible. In this way it may be possible to take remedial action (e.g., not introducing the molecule, re-designing the molecule to remove the effect, or limiting the introduction of the molecule). Also, it is possible to identify molecules that have desirable biological effects, and the pharmaceutical industry spends billons of dollars each year to test and identify potentially useful molecules.
- The high-level effects of a molecule, both desirable (e.g. anti-inflammatory) and undesirable (e.g., toxicity), are overwhelmingly related to some lower level biochemical pathway. More specifically, high-level effects often result from the interaction of a molecule with a binding site on a protein present in some bio-chemical pathway. And a high-level effect of a molecule may result from the interaction of the molecule with multiple proteins in multiple pathways. In many cases, the particular protein(s) and pathway(s) may not be fully known or understood, even though the correlation between a high-level effect and the molecule may be well documented. For example, gabapentin (NeurontinTM) from Parke-Davis/Pfizer is used to treat epilepsy and neuropathic pain; however, the protein targets underlying the actions of these compounds are unknown.
- Currently, two general approaches are used for identifying the high level effects of a molecule. The first is to perform laboratory experiments using the molecule. The effects of the molecule may also be analyzed in various clinical trials, including trials with human subjects. For example, the pharmaceutical testing required by the United States Food and Drug Administration requires a variety of clinical studies be performed before a molecule may be distributed for medical purposes. However, one drawback to this approach is that physical laboratory experiments and clinical trials are typically both costly and time consuming, making them prohibitive to perform for more than limited number of candidate molecules. Accordingly, this approach is often used only after identifying a candidate molecule as being potentially beneficial.
- A second approach is to perform in silico simulations configured to generate predictions about the properties of a molecule. The term “in silico” is used to reference simulations performed using computer software applications that model the real-world behavior of the molecule. The simulation may be based on the physical characteristics of the molecule (e.g., structure, molecular weight, electron density, etc) and the characteristics of the simulated environment (e.g., the shape, position and characteristics of a particular protein receptor). Thus, an in silico simulation may be used to simulate the interaction between a molecule and a single protein target. The output of the simulation may include a prediction regarding a biological effect or property of the molecule, e.g., the binding affinity of the molecule against the protein target. Models have been developed that can predict these kinds of low-level properties with a reasonable degree of accuracy. However, the accuracy of in silico simulations used to predict high-level effects have typically been very poor. Thus, even though some protein/molecule interaction may be known to be related to an observed high-level effect, no one has currently been able to bridge the gap between using an in silico simulation to predict a low-level activity regarding a molecule and using an in silico simulation to predict whether a molecule is likely to have a given a high-level effect when introduced into a biological system (e.g., a human individual).
- The state of the art in in silico prediction for low-level effects is to construct models based on a topological representation of a molecule, or based on simple three-dimensional models of a molecule. For example, current in silico simulations typically rely on data that may include the position, orientation, or electrostatic properties of the molecule in 3D space. This approach, however, has typically resulted in inaccurate predictions regarding high-level biological effects. A number of reasons may account for this. For example, the representation of the molecule is too high dimensional for the high level effect being modeled, too few data points may be used to model a high-level effect, the representation fails to capture the relevant information, e.g., the “cause” of the biological effect is not a property (or function) of the orientation or electrostatic properties of the molecule, these and other shortcomings may all contribute to the poor results obtained from current in silico simulations.
- Accordingly, there remains a need for improved techniques for predicting the biological effects of molecules in general, and for modeling biological effects that may result from the interaction between a test molecule and a biological system.
- The present invention generally provides methods, systems, and articles of manufacture for modeling the biological effects of molecules. Embodiments of the invention predict the biological effect of a molecule of interest using a molecular properties model configured using machine learning techniques.
- One embodiment of the invention provides a method for using a machine-learned meta-model to generate a prediction regarding a biological effect of a test molecule. The method generally includes training a plurality of molecular property models using a first set of training data, wherein each trained molecular property model is configured to generate a prediction regarding a property of interest of a test molecule modeled by each respective molecular properties model. In one embodiment of the invention, the molecular property model is a single target activity model. The method generally further includes training the meta-model using the set of training data, wherein the trained meta-model is configured to generate the prediction regarding the biological effect of the test molecule from the predictions generated for the test molecule by each of the plurality of trained molecular property models.
- Once trained, the meta-model may be used to generate a prediction for a test molecule. Generally, such a prediction is obtained by selecting the test molecule, generating a representation of the test molecule appropriate for the plurality of molecular property models and providing the representation of the test molecule to the molecular property models to obtain the prediction regarding the test molecule from each of the of the molecular property models. The predictions are then supplied to the meta-model, which generates a prediction for the test molecule regarding the biological effect.
- Each of the molecular property models and the meta-model may be “trained” by performing a selected machine learning algorithm, although not necessarily the same algorithm need be performed by each model. Representative machine learning algorithms include a classification learning algorithm, a kernel based learning algorithm, a Boosting algorithm, RankBoost algorithm, Alternating Decision Trees algorithm, Support Vector Machines algorithm, a Perceptron algorithm, Winnow, a Hedge Algorithm, decision trees, neural networks, genetic algorithms, or genetic programming algorithm.
- So that the manner in which the above recited features of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. The appended drawings, however, illustrate only typical embodiments of this invention and are, therefore, not to be considered limiting of its scope, for the invention may admit to other equally effective embodiments.
-
FIG. 1 illustrates a computing environment that may be used to implement an embodiment of the invention. -
FIG. 2 is a block diagram illustrating a molecular properties model, according to one embodiment of the invention. -
FIG. 3 illustrates a meta-model configured to learn from a hierarchy of single-target activity models, according to one embodiment of the invention. -
FIG. 4 illustrates a method for training a meta-model using the output from a plurality of single target activity models, according to one embodiment of the invention. -
FIG. 5 illustrates a method for generating a prediction related to a biological effect of a molecule using a meta-model trained according to the method illustrated inFIG. 4 , according to one embodiment of the invention. - Generally, machine learning techniques may be used to develop a software application—referred to as a model—that improves its ability to perform a task as it analyzes more data related to the task. Often, the task is to predict an unknown attribute or quantity from known information (e.g., the binding affinity of a molecule against a specific protein target). Typically, a machine learning model is trained using a set of training examples. Each training example may include an example of an object, along with a value for the otherwise unknown property of the object (e.g., a representation of a molecule and a known binding affinity for the molecule and a protein). By processing a set of training examples that includes both an object and a property value for the object, the model “learns” what attributes or characteristics of the object are associated with a particular property value. This learning” may then be used to predict the property or to predict a classification for other objects.
- In the fields of bioinformatics and computational chemistry, machine learning applications may be used to develop models of various molecular properties. Oftentimes, such models may be developed to predict whether a particular molecule will exhibit a property of interest being modeled. For example, models may be developed to model biological properties such as pharmacokinetic or pharmacodynamic properties, physiological or pharmacological activity, toxicity or selectivity. Other examples of properties of interest that may be modeled include models that predict chemical properties such as reactivity, binding affinity, or properties of specific atoms or bonds in a molecule, e.g. bond stability. Similarly, models may be developed that predict physical properties such as the melting point or solubility of a substance. Further, molecular models may be developed that predict properties useful in physics-based simulations such as force-field parameters or the free energy states of different possible conformations of a molecule.
- The training examples used to train a molecular properties model may include a description for a molecule (e.g., the atoms and bond structure of a particular molecule) and data regarding a property of interest for the molecule. Collectively, the training examples are referred to as a “training set” or as “training data.” Data regarding the property of interest may include (i) a value from a continuous range (e.g., the solubility of a molecule at a solute temperature or the known binding affinity between the molecule in the example and the protein target being modeled), or (ii) a label asserting presence or absence of the property of interest relative to the molecule included in the training example. Another form of a training example includes a ranking of two or more molecules. A ranking is used to order to or more molecules relative to the property being modeled. Detailed examples of ranking techniques used in a machine learned molecular property model are described in commonly assigned U.S. patent application, filed on [date] titled “Molecular Property Modeling Using Ranking.”
- The molecules included in the training data may be selected from molecules with a known value for the property being modeled. The known value may be based on experimentation, simulation, analysis, or even reasonable assumptions regarding the property being modeled. In one embodiment, assumed values may be used for one or more of the molecules represented in the training data. Detailed examples of using assumed values for some activity measurements are described in a commonly owned co-pending U.S. patent application, Ser. No. 11/074,587 titled “Methods for Molecular Property Modeling Using Virtual Data.”
- The training set is then used to train a molecular properties model. In one embodiment, the model performs a selected machine learning algorithm using the training set. Once trained, the model may be used to generate a prediction about a test molecule, relative to the property of interest. For example, the model may be configured to predict the binding affinity of a test molecule with a protein target represented by the model. In this example, the binding affinity is the property of interest. When a representation of the test molecule is supplied to the trained model, the output may comprise a prediction regarding the value of the property being modeled for the test molecule. The predictions may take the form of a value from a continuous range of values, a discrete value, or a ranking of two or more molecules, relative to the property of interest.
- Embodiments of the invention harness the predictions generated by a plurality of these models (referred to herein as “single target activity models”) by using the output of these models as input for a meta-model. The single target activity models are a type of molecular properties model. As used herein, a single target activity model refers to a molecular properties model configured to predict properties such as the activation or inhibition properties of a molecule against a protein, whether a molecule will bind to any (or to a specific) receptor on a protein, or combinations of these properties. Other forms of molecular property models may be used to generate predictions that are used by the meta-model.
- Oftentimes, a biological effect may have many different underlying causes. For example, a risk of heart attack may be affected by interfering with the HERG K+ protein, increasing blood pressure, or by increasing the risk of blood clots. At the molecular level, these biological effects are overwhelmingly caused by the interaction of a molecule with a protein target, or targets, present in a biochemical pathway. A meta-model configured according to an embodiment of the invention, however, may be able to predict whether a molecule will have the biological effect, without having to identify the particular protein(s) involved in the interaction, or the mechanism of action underlying the biological effect.
- The more single target activity models used to generate input data for the meta-model, the more the single target activity models may become representative of the complete set of proteins in a given biological system. That is, even though a protein that is responsible for a given biological effect of a molecule may not be modeled by one of the single target activity models, the meta-model may still accurately predict that the molecule possesses the biological effect. Thus, the models that are included may act as a surrogate for proteins that are not represented by a single target activity model. Accordingly, broad biological effects, such as toxicity, or increased potential for both desirable and undesirable effects may be modeled, even though models of the actual protein targets responsible for the high-level effect may not even exist. For example, the anti-tuberculosis effects of a molecule may be modeled without using any models of tuberculosis proteins.
- In one embodiment, the meta-model is configured to generate a prediction regarding the biological effect of a molecule. For example, a prediction may specify whether a particular test molecule has, does not have, causes, or does not cause, the biological effect property being modeled. Any relevant biological effect may be modeled. For example, among others, the meta-model may be configured to predict undesirable effects such as increasing risk of heart attack, toxicity, or carcinogenic properties of a molecule, or may be configured to predict desirable properties such as the analgesic, anti-inflammatory, anti-cancer, antibacterial or antiviral properties of the molecule.
- Once both the plurality of single target activity models and the meta-model have been trained, the output predictions generated by the plurality of single target activity models are used to generate an input to the meta-model. In one embodiment, the input data includes a representation of the test molecule appropriate for the meta-model and the predictions of each of the single target activity models. Thus, embodiments of the invention provide a hierarchy of models wherein the meta-model is trained using the outputs of the plurality of single target activity models. Although described herein using a two-level hierarchy, the techniques of the present invention may be extended to create deeper hierarchies of models. For example, the output of a plurality of meta-models may be used as input for a second-order meta-model.
- Embodiments of the invention may be implemented using any available computer system and adaptations are contemplated for both known and later developed computing platforms and hardware. Accordingly, the methods described below may be carried out by software applications configured to execute on computer systems ranging from single-user workstations, client server networks, large distributed systems employing peer-to-peer techniques, or clustered grid systems. In one embodiment, a high-speed computing cluster such as a Beowulf cluster or other clustered configuration may be used. Those skilled in the art will recognize that a clustering is a method for creating a high-performance computing environment by connecting inexpensive personal computer systems over high-speed network paths.
- Further, the computer systems used to practice the methods of the present invention may be geographically dispersed across local or national boundaries using a data communications network such as the Internet. Moreover, predictions generated for a test molecule at one location may be transported to other locations using well known data storage and transmission techniques, and predictions may be verified experimentally at the other locations. For example, a computer system may be located in one country and configured to generate predictions about the property of interest for a selected group of molecules, this data may then be transported (or transmitted) to another location, or even another country, where it may be the subject of further investigation e.g., laboratory confirmation of the prediction or further computer-based simulations.
- An Exemplary Computing Environment
- Embodiments of the invention may be implemented as computer software products (programs) for use with computer systems like the one illustrated in
FIG. 1 . Such programs may be contained on a variety of signal-bearing media. Examples of signal-bearing media include (i) information permanently stored on non-writable storage media (e.g., a CD or DVD disk); (ii) alterable information stored on writable storage media (e.g., floppy disks within a diskette drive or hard-disk drive); and (iii) information conveyed to a computer by a communications network, including wireless communications. The latter embodiment specifically includes information made available on the Internet and other networks. Such signal-bearing media, when carrying computer-readable instructions that implement the methods of the invention, represent embodiments of the invention. - Referring now to
FIG. 1 , acomputing environment 100 is shown. In general, theenvironment 100 includes afirst computer system 105 and a plurality of client computer systems each connected vianetwork 175. Thecomputer system 105 may represent any type of computer, computer system or other programmable electronic device, including a client computer, a server computer, a portable computer, an embedded controller, a PC-based server, a minicomputer, a midrange computer, a mainframe computer, a grid based, or clustered computer system and other computers adapted to support embodiments of the invention. - Illustratively, the
computer system 105 comprises a networked system. However, thecomputer system 105 may also comprise a standalone device. In any case, it is understood thatFIG. 1 is illustrates one possible configuration for acomputer system 105. - The embodiments of the present invention may also be practiced in distributed computing environments in which tasks are performed by remote processing devices that are linked through
communications network 175. Thecomputer system 105 may include a number of operators and peripheral systems as shown, for example, by amass storage interface 140 connected to a directaccess storage device 155 containing adatabase 185, by avideo interface 145 operable connected to adisplay 165, and by a network interface to network 175 (e.g. WAN, LAN). Thedisplay 165 may be any video output device for outputting viewable information. -
Computer system 105 is shown comprising at least oneprocessor 135, which obtains instructions and data via abus 120 from amain memory 115. Theprocessor 135 could be any processor adapted to support the methods of the invention. - The
main memory 115 is any memory sufficiently large to hold the necessary programs and data structures.Main memory 115 could be one or a combination of memory devices, including Random Access Memory, nonvolatile or backup memory, (e.g., programmable or Flash memories, read-only memories, etc.). In addition,memory 115 may be considered to include memory physically located elsewhere in acomputer system 105, for example, any storage capacity used as virtual memory or stored on a mass storage device (e.g., direct access storage device 155) or on another computer coupled to thecomputer system 105 viabus 120 ornetwork 175. - The
memory 115 is shown configured with anoperating system 130. Theoperating system 130 is the software used for managing the operation of thecomputer system 110. As shown, the memory includes a plurality of singletarget activity models 205 in communication with a meta-model 310, both of which are described in greater detail below. -
FIG. 2 is a block diagram 200 illustrating a singletarget activity model 205, according to one embodiment of the invention. As shown,FIG. 2 includes a singletarget activity model 205, a set oftest molecules 230, a set oftraining data 220, andoutput predictions 240. In one embodiment, the singletarget activity model 205 may be configured to perform amachine learning algorithm 210 usingtraining data 220 to generate the learnedmodel 215. Once generated, the learnedmodel 215 may be used to generateoutput predictions 240 fortest molecules 230. - The
machine learning algorithm 210 performed by themodel 205 may include both currently known and later developed machine learning algorithms. For example, thelearning algorithm 210 may include at least one of Boosting, a variant of Boosting, Alternating Decision Trees, Support Vector Machines, the Perceptron algorithm, Winnow, the Hedge Algorithm, an algorithm constructing a linear combination of features or data points, Decision Trees, Neural Networks, Genetic Algorithms, Genetic Programming, logistic regression, Bayes nets, log linear models, Perceptron-like algorithms, Gaussian processes, Bayesian techniques, probabilistic modeling techniques, regression trees, ranking algorithms, Kernel Methods, Margin based algorithms, or linear, quadratic, convex, conic or semi-definite programming techniques or any modifications of the foregoing. - The
training data 220 may be selected according to any of the techniques described above at paragraphs 20-28. However selected,training data 220 is used by themachine learning algorithm 210 to train the singletarget activity model 205 and generate learnedmodel 215. Illustratively, each training example 225 provides a vector that includes a representation of the molecule appropriate for themachine learning algorithm 210 and a value (labeled as “Activity_Score”) for the property being modeled by singletarget activity model 205. Illustratively,training data 220 includes three vectors, each representing a training example 225. Specifically. <mol_A, Activity_score>, <mol_B, Activity_score>, and <mol_C, Activity_score> to illustrate three training examples represented as a vector. As shown, each training example 225 includes a molecule representation element and an activity score regarding the property being modeled bymodel 205. In practice however, it is contemplated that significantly more examples would be included in thetraining data 220. - Once the single
target activity model 205 performs themachine learning algorithm 210 to generate the learnedmodel 215, it may be used to generate predictions fortest molecules 230. Illustratively,test molecules 230 includes thee testcandidates 235. Like the training examples 225, eachtest candidate 235 may include representation of a molecule stored in a vector. However, instead of including an activity score for the molecules represented by “mol —1,” “mol —2,” and “mol —3,” the second element of the vector representation is not included for thetest candidates 235; instead this is the information predicted for eachtest candidate 235 using learnedmodel 215. Accordingly,output predictions 240 include the three test candidates 245 (namely, “mol —1”, “mol —2”, and “mol —3”), with a completed vector representation that includes a prediction for the property modeled by singletarget activity model 205. -
FIG. 3 is a block diagram 300 illustrating a meta-model 310 configured to learn from a hierarchy of single-target activity models 205, according to one embodiment of the invention. As shown, diagram 300 includes three singletarget activity models 205. In practice however, it is contemplated that significantly more singletarget activity models 205 would be used to create meta-model 310. Like the view of a singletarget activity model 205 illustrated inFIG. 2 , each of the threemodels 205 may be configured to process a set oftraining data 220 to train a learnedmodel 215. For example, each singletarget activity model 205 may represent a different protein (or receptor on a protein), and the learnedmodels 215 may be configured to predict the binding affinity between a test molecule and the particular protein represented by eachrespective model 205. The predictions generated by a singletarget activity model 205 may be stored in a data set ofpredictions 240. - Also like single
target activity model 205, meta-model 310 may be configured to perform amachine learning algorithm 315 usingtraining data 220. Themachine learning algorithm 315 may be selected from any one of the (or other) machine learning algorithms identified above in paragraph 39. In one embodiment, the training examples 225 used to train the meta-model 310 may comprise a composite of the training examples 225 used to train each of the singletarget activity models 205. For example, using the first training example 225 illustrated inFIG. 2 , acomposite representation 305 may be styled as vector similar to the following:
<mol_A,Score —1,Score —2, Score_N, value_for_modeled_effect>
In this vector “mol_A” provides a representation of the molecule appropriate formachine learning algorithm 315. TheScore —1,Score —2, and Score_N components represent the value supplied to amachine learning algorithm 210 performed by each different singletarget activity model 205. Finally, the “value_for_modeled_effect” component may identify a value for the biological effect being modeled by meta-model 310. - Optionally, the
composite representation 305 may include additional information obtained fromadditional models 355 or from biological assays or otherexperimental data 350. For example, in an alternative embodiment,training data 220 may include the output data generated from physical laboratory experiments. In such an embodiment, a meta-model 310 may be trained using the outputs from a plurality of biological assays or other laboratory experimentation performed using a particular molecule and a suite of different protein targets. Following this approach, each molecule in the training data is screened by performing a physical experiment and the results of these experiments are used to generate thecomposite representation 305 for thelearning algorithm 310.Additional models 355 may be used to provide additional information to include incomposite representation 305. Other embodiments extend the methods illustrated inFIG. 3 by including other models or measurements into thecomposite representation 305. For example, in addition to predictions or measurements of activity against a protein target generated by the singletarget activity models 205, properties such as predictions or measurements of solubility, blood brain barrier permeability, or other physical, chemical or biological properties of a molecule may be predicted using the techniques disclosed herein. - Note however, that a
composite representation 305 used as a training example or a test candidate need not be “complete.” That is, for a givencomposite representation 305, there may be a predicted value for less than all of the single target activity models included in the meta-model hierarchy. - Once the
machine learning algorithm 310 is used to generate learnedmodel 320 from thetraining data 220, the meta-model 310 may be used to generate predictions regarding the biological effect of atest molecule 230. Illustratively,predictions 340 include a prediction generated for threecandidate molecules 345 regarding the biological effect modeled by meta-model 310. - In one embodiment, a prediction for a
test molecule 230 is generated by supplying a representation of thetest molecule 230 to each respective singletarget activity model 205. The form of the representation may be configured to be appropriate for the particular learnedmodel 215. Using the outputs of these models 205 (i.e., using predictions 240) acomposite representation 305 may then be generated for the test molecule. The composite representation may include a representation of the test molecule in a form appropriate for the learnedmodel 320, the prediction regarding the property of interest and the test molecule output by each singletarget activity model 205, and any additional information provided byadditional models 355 orexperimental data 350. Thereafter, the learnedmodel 320 may be configured to output aprediction 345 fortest molecule 230 identified bycomposite representation 305. Depending on the configuration of learnedmodel 320, the meta-model 310 may be configured to predict aparticular test molecule 230 has, does not have, causes, or does not cause, the biological effect property being modeled.Predictions 345 illustrate a generic “prediction” result generated by the meta-model 310. Alternatively (or additionally) meta-model may be configured to predict a value for a biological effect selected from a range of continuous values, or from a set of discrete choices. In another alternative the meta-model may be configured to predict a ranking of two or more test molecules relative to the biological effect modeled by the meta-model. -
FIG. 4 illustrates a method for constructing a meta-model 310 configured to predict a biological effect of a molecule based on the predictions of a plurality of singletarget activity models 205, according to one embodiment of the invention. Themethod 400 begins atstep 405 where the set of singletarget activity models 205 is selected. In one embodiment, the singletarget activity models 205 may be selected based on the available training data, the measured (or predicted) accuracy of the models, and on any other relevant criteria in a particular case. Generally, however, the more singletarget activity models 205 that are available, the more accurate the predictions generated by meta-model 310 regarding the high-level biological effect may become. - At
step 410, a set of training data is selected to train the singletarget activity models 205. Although complete one-to-one correspondence is not required, a molecule selected to be included in thetraining data 220 is typically used to generate a training example 225 for each singletarget activity model 205. In one embodiment, the training examples 225 are represented as a vector that includes the representation of the molecule and a value for the property being modeled. Atstep 415, the training examples generated atstep 410 are used to train the singletarget activity models 205. At this step, each singletarget activity model 205 performs the selectedmachine learning algorithm 210 to generate learnedmodel 215. - At
step 420, the training examples generating training data for the meta-model 310. In one embodiment, each training example for the meta-model 310 may include a representation of a molecule represented by the example, the prediction for the molecule from each of the single target activity models, along with a value for the property of interest. At step 450, the training examples are used by meta-model 310 to performmachine learning algorithm 315 and generate learnedmodel 320. Thus the training data may also be used to train the meta-model 310. In one embodiment, the training examples used to train the meta-model 310 may include a representation of the molecule and a value for the property of interest for each of the singletarget activity models 205 and a value for the biological effect being modeled by meta-model 310. -
FIG. 5 illustrates amethod 500 for generating a prediction related to a high level biological effect of a molecule using the learnedmodel 320 of meta-model 310 generated using the method illustrated inFIG. 4 , according to one embodiment of the invention. - The
method 500 begins atstep 505 where a set oftest molecules 230 is selected. Atstep 510, a representation of thetest molecules 230 is generated in a form appropriate for the learnedmodel 215 used by each respective singletarget activity model 205. Atstep 515, the representations of the test molecule is supplied to each of the learnedmodels 215, which, in response, may be configured to generate a prediction regarding the test molecule and the property modeled by each of the singletarget activity models 205. For example, in an embodiment where the single target activity models predict the binding affinity between the test molecule and the protein represented by a singletarget activity model 205, the prediction may comprise a value for binding affinity. - At
step 520, the set of predictions generated atstep 515 are included in acomposite representation 305 that includes both the predictions output atstep 515 and a representation oftest molecule 230 in a form appropriate for the learnedmodel 320 of meta-model 310. Atstep 525, the representation of thetest molecule 230 is supplied to learnedmodel 320, which, in response, may be configured to generate a prediction regarding the test molecule and the property modeled by meta-model 310. - While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof.
Claims (38)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/304,209 US7856321B2 (en) | 2004-12-16 | 2005-12-14 | Modeling biological effects of molecules using molecular property models |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US63664504P | 2004-12-16 | 2004-12-16 | |
US11/304,209 US7856321B2 (en) | 2004-12-16 | 2005-12-14 | Modeling biological effects of molecules using molecular property models |
Publications (2)
Publication Number | Publication Date |
---|---|
US20060161407A1 true US20060161407A1 (en) | 2006-07-20 |
US7856321B2 US7856321B2 (en) | 2010-12-21 |
Family
ID=36588523
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/304,209 Active US7856321B2 (en) | 2004-12-16 | 2005-12-14 | Modeling biological effects of molecules using molecular property models |
Country Status (3)
Country | Link |
---|---|
US (1) | US7856321B2 (en) |
EP (1) | EP1839227A4 (en) |
WO (1) | WO2006065950A2 (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7809663B1 (en) | 2006-05-22 | 2010-10-05 | Convergys Cmg Utah, Inc. | System and method for supporting the utilization of machine language |
US20120304301A1 (en) * | 2010-02-02 | 2012-11-29 | Nec Corporation | Confidentiality analysis support system, method and program |
US8452668B1 (en) | 2006-03-02 | 2013-05-28 | Convergys Customer Management Delaware Llc | System for closed loop decisionmaking in an automated care system |
WO2019018780A1 (en) * | 2017-07-20 | 2019-01-24 | The University Of North Carolina At Chapel Hill | Methods, systems and non-transitory computer readable media for automated design of molecules with desired properties using artificial intelligence |
US20190108320A1 (en) * | 2017-09-07 | 2019-04-11 | Accutar Biotechnology Inc. | Neural network for predicting drug property |
EP3365841A4 (en) * | 2015-09-30 | 2019-06-19 | Just, Inc. | Systems and methods for identifying entities that have a target property |
US10410114B2 (en) * | 2015-09-18 | 2019-09-10 | Samsung Electronics Co., Ltd. | Model training method and apparatus, and data recognizing method |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8379830B1 (en) | 2006-05-22 | 2013-02-19 | Convergys Customer Management Delaware Llc | System and method for automated customer service with contingent live interaction |
US11721413B2 (en) | 2018-04-24 | 2023-08-08 | Samsung Electronics Co., Ltd. | Method and system for performing molecular design using machine learning algorithms |
US10515715B1 (en) | 2019-06-25 | 2019-12-24 | Colgate-Palmolive Company | Systems and methods for evaluating compositions |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030074141A1 (en) * | 2001-10-15 | 2003-04-17 | Camitro Corporation | Meta-models for predicting blood brain barrier penetration based on simple physicochemical descriptors |
US20040180322A1 (en) * | 2000-07-28 | 2004-09-16 | Grass George M. | Regional intestinal permeability model |
US20040249664A1 (en) * | 2003-06-05 | 2004-12-09 | Fasttrack Systems, Inc. | Design assistance for clinical trial protocols |
US20050119832A1 (en) * | 2001-12-07 | 2005-06-02 | Walter Schmitt | Computer system and method for calculating adme properties |
US20050209785A1 (en) * | 2004-02-27 | 2005-09-22 | Wells Martin D | Systems and methods for disease diagnosis |
US20050260663A1 (en) * | 2004-05-18 | 2005-11-24 | Neal Solomon | Functional proteomics modeling system |
-
2005
- 2005-12-14 EP EP05849722A patent/EP1839227A4/en not_active Withdrawn
- 2005-12-14 WO PCT/US2005/045344 patent/WO2006065950A2/en active Application Filing
- 2005-12-14 US US11/304,209 patent/US7856321B2/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040180322A1 (en) * | 2000-07-28 | 2004-09-16 | Grass George M. | Regional intestinal permeability model |
US20030074141A1 (en) * | 2001-10-15 | 2003-04-17 | Camitro Corporation | Meta-models for predicting blood brain barrier penetration based on simple physicochemical descriptors |
US20050119832A1 (en) * | 2001-12-07 | 2005-06-02 | Walter Schmitt | Computer system and method for calculating adme properties |
US20040249664A1 (en) * | 2003-06-05 | 2004-12-09 | Fasttrack Systems, Inc. | Design assistance for clinical trial protocols |
US20050209785A1 (en) * | 2004-02-27 | 2005-09-22 | Wells Martin D | Systems and methods for disease diagnosis |
US20050260663A1 (en) * | 2004-05-18 | 2005-11-24 | Neal Solomon | Functional proteomics modeling system |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8452668B1 (en) | 2006-03-02 | 2013-05-28 | Convergys Customer Management Delaware Llc | System for closed loop decisionmaking in an automated care system |
US7809663B1 (en) | 2006-05-22 | 2010-10-05 | Convergys Cmg Utah, Inc. | System and method for supporting the utilization of machine language |
US20120304301A1 (en) * | 2010-02-02 | 2012-11-29 | Nec Corporation | Confidentiality analysis support system, method and program |
US10410114B2 (en) * | 2015-09-18 | 2019-09-10 | Samsung Electronics Co., Ltd. | Model training method and apparatus, and data recognizing method |
EP3365841A4 (en) * | 2015-09-30 | 2019-06-19 | Just, Inc. | Systems and methods for identifying entities that have a target property |
EP4009246A1 (en) * | 2015-09-30 | 2022-06-08 | Just, Inc. | Systems and methods for identifying entities that have a target property |
US11568287B2 (en) | 2015-09-30 | 2023-01-31 | Just, Inc. | Discovery systems for identifying entities that have a target property |
WO2019018780A1 (en) * | 2017-07-20 | 2019-01-24 | The University Of North Carolina At Chapel Hill | Methods, systems and non-transitory computer readable media for automated design of molecules with desired properties using artificial intelligence |
US20190108320A1 (en) * | 2017-09-07 | 2019-04-11 | Accutar Biotechnology Inc. | Neural network for predicting drug property |
US10923214B2 (en) * | 2017-09-07 | 2021-02-16 | Accutar Biotechnology Inc. | Neural network for predicting drug property |
Also Published As
Publication number | Publication date |
---|---|
US7856321B2 (en) | 2010-12-21 |
EP1839227A4 (en) | 2009-02-18 |
WO2006065950A3 (en) | 2007-07-05 |
EP1839227A2 (en) | 2007-10-03 |
WO2006065950A2 (en) | 2006-06-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7856321B2 (en) | Modeling biological effects of molecules using molecular property models | |
Volkov et al. | On the frustration to predict binding affinities from protein–ligand structures with deep neural networks | |
Harper et al. | Prediction of biological activity for high-throughput screening using binary kernel discrimination | |
US7702467B2 (en) | Molecular property modeling using ranking | |
US20050278124A1 (en) | Methods for molecular property modeling using virtual data | |
Qureshi et al. | AI in drug discovery and its clinical relevance | |
Wolf et al. | Targeted molecular dynamics calculations of free energy profiles using a nonequilibrium friction correction | |
Flores et al. | Multiscale modeling of macromolecular biosystems | |
Barozet et al. | A reinforcement-learning-based approach to enhance exhaustive protein loop sampling | |
Garrido‐Rodriguez et al. | Integrating knowledge and omics to decipher mechanisms via large‐scale models of signaling networks | |
Hariry et al. | From Industry 4.0 to Pharma 4.0 | |
Neveu et al. | RapidRMSD: Rapid determination of RMSDs corresponding to motions of flexible molecules | |
Hadfield et al. | AI in 3D compound design | |
Diaz-Flores et al. | Evolution of artificial intelligence-powered technologies in biomedical research and healthcare | |
Lim et al. | Fragment pose prediction using non-equilibrium candidate Monte Carlo and molecular dynamics simulations | |
Scantlebury et al. | A small step toward generalizability: training a machine learning scoring function for structure-based virtual screening | |
Serrano et al. | Accelerating drugs discovery with deep reinforcement learning: An early approach | |
Li et al. | PLA-MoRe: A Protein–Ligand Binding Affinity Prediction Model via Comprehensive Molecular Representations | |
Sumathi et al. | A review on deep learning-driven drug discovery: strategies, tools and applications | |
Preto et al. | Molecular dynamics and related computational methods with applications to drug discovery | |
Faraggi et al. | GENN: a GEneral Neural Network for learning tabulated data with examples from protein structure prediction | |
Martin et al. | High-Throughput Structure-Based Drug Design (HT-SBDD) Using Drug Docking, Fragment Molecular Orbital Calculations, and Molecular Dynamic Techniques | |
WO2002025570A2 (en) | Systems, methods and computer program products for processing genomic data in an object-oriented environment | |
Rao et al. | E2EATP: Fast and High-Accuracy Protein–ATP Binding Residue Prediction via Protein Language Model Embedding | |
Ma et al. | PRODIGEN: visualizing the probability landscape of stochastic gene regulatory networks in state and time space |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: PHARMIX CORPORATION, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LANZA, GUIDO;DUFFY, NIGEL P.;BOARDMAN, PAUL;REEL/FRAME:017395/0376 Effective date: 20060308 |
|
AS | Assignment |
Owner name: NUMERATE, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:PHARMIX CORPORATION;REEL/FRAME:020063/0499 Effective date: 20070928 Owner name: NUMERATE, INC.,CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:PHARMIX CORPORATION;REEL/FRAME:020063/0499 Effective date: 20070928 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
AS | Assignment |
Owner name: LEADER VENTURES, LLC, AS AGENT, CALIFORNIA Free format text: SECURITY AGREEMENT;ASSIGNOR:NUMERATE, INC.;REEL/FRAME:029793/0056 Effective date: 20121224 |
|
FPAY | Fee payment |
Year of fee payment: 4 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YR, SMALL ENTITY (ORIGINAL EVENT CODE: M2552) Year of fee payment: 8 |
|
AS | Assignment |
Owner name: NUMERATE, INC., CALIFORNIA Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:LEADER VENTURES, LLC;REEL/FRAME:050417/0740 Effective date: 20190917 |
|
AS | Assignment |
Owner name: INTEGRAL HEALTH, INC., MASSACHUSETTS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:NUMERATE, INC.;REEL/FRAME:052932/0861 Effective date: 20190924 |
|
AS | Assignment |
Owner name: SILICON VALLEY BANK, CALIFORNIA Free format text: SECURITY INTEREST;ASSIGNOR:INTEGRAL HEALTH, INC.;REEL/FRAME:052943/0660 Effective date: 20200615 |
|
AS | Assignment |
Owner name: VALO HEALTH, INC., MASSACHUSETTS Free format text: CHANGE OF NAME;ASSIGNOR:INTEGRAL HEALTH, INC.;REEL/FRAME:053787/0514 Effective date: 20200911 |
|
FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 12TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1553); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 12 |
|
AS | Assignment |
Owner name: FIRST-CITIZENS BANK & TRUST COMPANY, AS AGENT, CALIFORNIA Free format text: SECURITY INTEREST;ASSIGNORS:VALO HEALTH, LLC;VALO HEALTH, INC.;REEL/FRAME:064207/0957 Effective date: 20230630 |