WO1998047086A2 - Autonomous intelligent agents for the annotation of genomic databases - Google Patents

Autonomous intelligent agents for the annotation of genomic databases Download PDF

Info

Publication number
WO1998047086A2
WO1998047086A2 PCT/US1998/007327 US9807327W WO9847086A2 WO 1998047086 A2 WO1998047086 A2 WO 1998047086A2 US 9807327 W US9807327 W US 9807327W WO 9847086 A2 WO9847086 A2 WO 9847086A2
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
database
agent
data
invoking
Prior art date
Application number
PCT/US1998/007327
Other languages
French (fr)
Other versions
WO1998047086A3 (en
Inventor
R. Mark Adams
Original Assignee
Alpha Gene, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alpha Gene, Inc. filed Critical Alpha Gene, Inc.
Priority to AU69677/98A priority Critical patent/AU6967798A/en
Publication of WO1998047086A2 publication Critical patent/WO1998047086A2/en
Publication of WO1998047086A3 publication Critical patent/WO1998047086A3/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99941Database schema or data structure
    • Y10S707/99944Object-oriented database structure
    • Y10S707/99945Object-oriented database structure processing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99941Database schema or data structure
    • Y10S707/99948Application of database or data structure, e.g. distributed, multimedia, or image

Definitions

  • These computerized tools typically include database programs of various types .
  • a monolithic master program directs a series of algorithms to be performed.
  • Such algorithms may include simple comparisons, such as to determine if a particular gene sequence has been previously identified, or may include more sophisticated mathematical analysis, such as those used to determine if a gene sequence is likely to exhibit certain characteristics .
  • the algorithms may be run on either a database of existing full length sequences or on sequence segments prior to their entry into the database. Furthermore, researchers may wish to record annotations concerning the sequences as records in the database .
  • genetic analysis algorithms may be provided by academic, government, or commercial sources.
  • Gene sequence data may originate from in-house laboratory tests, from external commercially available databases, or from privately developed sources .
  • the problems associated with conventional systems include difficulties other than keeping the database of gene sequences synchronized with the latest versions of the analytical software. If, as in many cases, the software source code is organized to pre-annotate the data before it enters the database, as a first step in a pipeline of analytical algorithms or processes, it can be very difficult to add new tools or update existing ones. For example, if a change is made to the database while gene sequence data are already in the pipeline, then all new result records will reflect the update, but the old result records already in the database will still need to be updated to reflect the changes in the processing.
  • a genomic database system should permit the sequence data and algorithms to be changed without the need to take the database offline or to otherwise be concerned with data synchronization and data heterogeneity issues. It should thus be possible for the database of gene sequences to be updated or added to at a fairly high rate.
  • the set of software tools that are used to analyze or annotate the database should also be permitted to constantly evolve. In fact, it would be ideal if the users could not only change their analysis tools and to have the changes in the tools automatically reflected in the analysis process, but also to have the tools automatically rerun when changes are made to sequence information in the database. The results database should then also be automatically updated.
  • each software tool is implemented as an agent program which autonomously operates on sequence data read from the genomic database.
  • the software architecture of the agents includes a sensor process, an intelligence process, and an effector process .
  • the sensor process identifies sequence records that need processing.
  • the intelligence process which is called after the sensor process, runs the software tools on the output of the sensor process to obtain results data.
  • These software tools may implement analytical algorithms or may be of other types such as process control tools.
  • the effector process is run. The effector process places the data output by the agent in the form of result records back into the database.
  • the result records created by the effector process each preferably contain a reference to the gene sequence data and the agent process from which they were created.
  • the result records also contain a flag to indicate that the particular record has been visited by a particular agent .
  • the agents may implement various software tools for operating on gene sequence data. These may include functions and algorithms such as expression data retrieval from both external databases and laboratory equipment, mathematical sequence modeling and comparison, and the control of laboratory processes such as sequence finishing.
  • the agents may be used to record annotations concerning the sequence data.
  • the autonomous agents are implemented as daemon processes in a multitasking operating system environment . The agent processes thus typically wait in a quiescent state most of the time, and are only periodically invoked to look for the presence of sequence records which have not been processed by the agent.
  • each agent performs only one specific role.
  • an agent is typically associated with each particular analytical or process control tool. Because the agents are run as processes in a multitasking environment, each agent may thus run exclusive of the time, mode or the appearance of new sequence records . As a result, complex software infrastructure for process synchronization and database heterogeneity is not needed. Instead, these issues are addressed by using the results record, and in particular, by using a flag that indicates that a particular sequence record has been processed by a particular agent, since the flag is toggled only after the effector process in each agent is completed.
  • a genomic database system therefore permits new sequence records to be added to the database without having to modify or interrupt algorithms which may be already running.
  • FIG. 1 is a symbolic diagram of an autonomous intelligent agent system for operating with a genomic database according to the invention
  • FIG. 2 is a symbolic diagram of various agent subclasses and the manner in which they interact with entities such as hardware devices and software processes both internal and external to the system
  • FIGS. 3A through 3C are flow diagrams for a specific comparison agent process that includes a sensor sub- process, intelligence sub-process, and effector sub- process, respectively;
  • FIG. 4 is a symbolic diagram of a set of sequence finishing agents.
  • FIG. 5 is a symbolic diagram of a set of annotation agents .
  • FIG. 1 is a symbolic diagram of a genomic database system 10 which makes use of an autonomous agent processes according to the invention.
  • the system 10 includes a genomic sequence database 12 and a central processing unit 14.
  • the sequence database 12 includes gene sequence and related information in the form of database entries such as a sequence entry 16 and a results entry 18.
  • the central processing unit 14 has several software entities for performing different tasks . These software entities are referred to herein as agent processes 20-1, 20-2,..., 20-n (collectively, the agents 20).
  • the central processing unit 14 is of the type that supports a multitasking operating system that permits many agent processes 20 to be running simultaneously.
  • the agents 20, in a manner which will be understood in more detail shortly, obtain access to the sequence database 12 to retrieve data contained in the sequence entries 16. The agents 20 then process the retrieved data and create results entries 18 containing the output of such processing.
  • Specific agents 20 may be dedicated to a number of different tasks. These tasks may include simple jobs such as locating gene sequence data and adding an annotation to it, or more complex tasks such as comparing known gene sequences against other newly discovered sequences in order to identify their similarities, or may even control processes, such as processes that create full length gene sequences from gene fragments.
  • the agents 20 are typically autonomous in the sense that they are running all of the time looking for data in the sequence database 12 for which to process. This can be accomplished, for example, by implementing the agents in a multitasking operating system as daemon processes that wait in a quiescent state most of the time, and only periodically activate themselves to search for data.
  • the agents 20 are also preferably implemented as object oriented programs. This permits each agent to be dedicated to a specific task while at the same time being preferably instantiated as a subclass of an agent core code base, as will also be understood shortly.
  • each such sequence entry 16 contains a number of records, including an identification record 16-1, a name record 16-2, a source record 16-3, and a sequence record 16-4.
  • the identification record 16-1 contains data such as an index number that provides for unique identification of the particular sequence entry 16.
  • the name record 16-2 contains name information for the gene sequence associated with sequence entry 16. This may include text data items such as a sequence name, a genetic symbol, an organization from which the gene sequence originated, or the tissue sample from which it was taken.
  • the source record 16-3 is used by the system 10 to keep track of where the sequence record 16 originated from, and may include information such as the assembly method, the identification of a well in a laboratory microtiter plate from which the gene sequence sample was taken, the sequence creation date, or other information.
  • the sequence data records 16-4 contain the actual gene sequence data.
  • the data may typically be in the form of a text record in "ATCG" format, in the case of DNA sequences, or in other formats for other types of genes .
  • the sequence data records 16-4 may actually contain records for several gene sequences of varying lengths.
  • the sequence data records 16-4 may thus include a single gene fragment, a group of fragments, or a complete full length gene sequence.
  • sequence data records 16-4 may originate from many sources such as laboratory equipment or from other databases, including databases that are external to the system 10, as well as gene sequence data created by the result of running algorithms on other sequence entries 16.
  • Other entries in the sequence database 12 include results entries 18.
  • Each result entry 18 is associated with a particular sequence record 16 and agent 20; there are typically many different result entries 18 for each sequence entry 16.
  • result entries 18 are created as the agents 20 work on the data contained in the sequence entries 16.
  • Each result entry 18 includes a number of records such as an identification record 18-1 which uniquely identifies the result entry 18 to the system 10.
  • a sequence identification record 18-2 contains a reference to the sequence entry 16 from which the particular result entry 18 was generated.
  • An agent identifica ion record 18-3 identifies the particular one of the agents 20 which created the result record 18.
  • at least one result data record 18-4 is included the result entry 18.
  • the result data record 18-4 may simply be a flag indicating that the particular agent 20 indicated by the agent record 18-3 has processed the particular sequence entry 16 indicated by the sequence record 18-2. In other instances, the result data record 18-4 contains done more complex data records, such as an annotation output by a particular agent 20 after performing an analysis of the sequence data 16-4 in a particular sequence entry 16.
  • One feature of the system 10 is that all pieces of data associated with a particular sequence entry 16 are placed in results entries 18. Furthermore, the addition of new sequence entries 16 to the database 12 as well as all other modifications to the database 12 are made through agent processes 20. By so doing, process synchronization and heterogeneity of the database 12 are simple matters to maintain.
  • agents 20 each contain program code which can be divided into three categories of functionality, or sub-processes. These include a sensor sub-process 22, an intelligence sub- process 24, and an effector sub-process 26.
  • the sensor sub-process 22 is primarily responsible for identifying sequence entries 16 in need of processing by the particular agent 20-1.
  • the sensor process 22 typically performs a series of steps or states to accomplish this .
  • a first state includes a wait state 220 in which the agent process 20-1 remains dormant.
  • a check environment state 221 is then entered, in which the agent 20-1 determines the availability of needed resources in the central processing unit 14, and other resources that may be required for carrying out its assigned tasks . If the results of the check environment state 221 are positive, then the sensor process 22 continues to a state 222 to identify a candidate sequence entry 16 which has not yet been processed by the agent 20-1. This may be done by examining results data records 18-4.
  • a first state 240 reads the sequence data record 16-4 in the sequence entry 16.
  • a next state 241 other software processes are invoked as necessary to perform the particular task to which the agent 20-1 has been assigned. The invocation of such other processes unique to the agent process 20-1 is carried out in an encapsulated manner such that all relevant outputs therefrom are returned to the intelligence sub-process 24 as results data.
  • a next state 242 is entered in which the encapsulated results data are read.
  • the effector sub-process 26 is begun. In this process, the results data provided by the intelligence sub-process 24 are written back into the sequence database 12 as a results entry 18. As the arrow indicates, the agent process 20-1 may then typically return to the performing the sensor process 22 once again.
  • FIG. 2 is a symbolic diagram of the agent core 30 showing several particular types of agents that may be implemented as subclasses of the core 30.
  • the illustrated agents 20 include a gene expression agent 31, a laboratory expression agent 32, a mathematical modeling agent 33, and a comparison agent 34, and a sequence finishing agent 35.
  • the agent core 30 is a software entity that incorporates agent functionality that is global or generic in nature, with the agents 20 being implemented as object oriented programs running in a multitasking operating system environment . Any particular agent such as the agent 20-2 can thus be instantiated as a subclass of an agent core 30.
  • the agent core 30 can be thought of as a software code base incorporating generic functionality for performing specific tasks common to all agents 20. Such may include code for retrieving sequence entries 16 from the database 12, or for writing the results of such processes in the results entries 18.
  • the agent core 30 thus contains model software processes for controlling control all read and write access to the information in the sequence entries 16 and results entries 18.
  • the agent core 30 may also include code implementing at least the sensor 22, intelligence 24, and effector 26 processes in their most generic state.
  • Each of the agents shown in FIG. 2 also include specific instantiations of the agent class. That is, each have code in the sensors 22, intelligence 24, and effectors 26 that are specific to their assigned task.
  • the expression database agent 31 contains code which serves the purpose of obtaining gene sequence data from external sources and writing it as sequence annotation entries in the sequence database 12.
  • the illustrated expression data agent 31 is in particular responsible for obtaining gene expression data located in a remotely located external computer system 36.
  • the external computer system 36 consists of a host processor 36-1 and an associated expression database 36-2.
  • the computer system 36 may, for example, be a publicly or commercially available gene sequence expression database 36-2 that is accessed through a remote host interface software process 37.
  • the remote host interface 37 thus consists of appropriate software for establishing a connection to the remote host 36, presenting a data query instruction, and then receiving the gene sequence data from the expression database 36-2.
  • the interface 39 over which the remote host interface 37 interacts with the computer system 36 may, for example, be a TCP/IP connection over a network.
  • the intelligence sub-process 24 (FIG. 2) thus encapsulates the remote host interface 37.
  • a laboratory expression agent 32 obtains gene sequence data from a laboratory system 40.
  • the laboratory system 40 is another computer system that includes a host 40-1 and an expression database 40-2.
  • the particular data of interest in the expression database 40-2 is obtained from a gene expression reader 40-3.
  • the gene expression reader 40-3 is a laboratory instrument which process physical tissue samples to determine partial gene sequence information in the form of gene fragments, or gene expressions. These gene expressions are stored in the expression database 36-2.
  • the laboratory expression agent 32 uses a sensor process 22 that determines when a new expression record is added to the expression database 40-2. When such new data exists, the laboratory expression agent 32 invokes its intelligence process 24 to retrieve the data from the expression database 40-2. An effector process 26 then creates the appropriate sequence entry 16 and results entry 18 in the sequence database 12.
  • the laboratory system 40 operates autonomously from the gene sequence system 10, and gene sequence data resulting from tissue sample analysis may be placed in the expression database 40-2 at any time. It is only when the laboratory expression agent 32 is invoked as a process on the central processing unit 14, that the data is sought from the expression database 40-2 and then forwarded to the sequence database 12 in proper form. It is thus possible to have clones being subjected to analysis by the laboratory system 40, completely outside of other processes that are analyzing these same clones as implemented by the agents 20 running in the annotated database system 10. As described below in connection with FIG. 4, sequence finishing agents may be running at the same time that the laboratory system 40 is running.
  • a mathematical modeling agent 33 is typically responsible for reading sequence entries 16 and performing various mathematical algorithms on them. This may, for example, be a structure analysis algorithm that analyzes gene sequence data for the existence of secretory proteins . As shown, the code which implements the algorithm may typically reside as part of the intelligence process 24. Alternatively, the secondary structure code may exist and run on a separate piece of hardware, in which case the intelligence process 24 might typically include an appropriate Application Programmer Interface (API) software for accessing such code. The secondary structure code typically returns a score based upon a mathematical analysis of the gene sequence under examination. This core then becomes the data recorded in the annotation entry 18 by the effector process 24.
  • a comparison agent 34 is responsible for comparing one or more gene sequence entries 16 against other gene sequence entries 16 to find a degree of match between them.
  • the particular comparison agent process 34 described here makes use of external super-computer hardware in order to perform such comparisons.
  • the super-computer may include a Fast Data Finder (FDF) that makes use of a Smith/Waterman algorithm such as is available from Paracel Corporation.
  • the comparison agents 34 communicates with the Fast Data Finder hardware 43 such as over a local area network (LAN) interface 42.
  • LAN local area network
  • FIGS. 3A through 3C are a more detailed set of flow diagrams for one implementation of the comparison agent 34.
  • FIG. 3A is a state diagram of the sensor sub-process 22
  • FIG. 3B is a detailed state diagram of the intelligence sub-process
  • FIG. 3C is a state diagram of the effector sub-process 26.
  • the comparison agent 34 is in a wait state in which it remains quiescent.
  • the wait state 301 is an example, in an object oriented implementation of the comparison agent 34, of a component which may use the generic agent core 30.
  • the comparison agent process 34 periodically enters a state 302.
  • the sensor sub-process 22 determines whether sufficient processor resources such as memory, available processor time, local area network access ports, and other resources necessary for executing the remainder of the comparison agent process 34 are available. If this is not the case then the sensor process 22 returns to state 301. If, however, sufficient resources are available, then the sensor sub-process 22 continues to state 302.
  • State 302 is an example of a state which may take advantage of both the agent core 30 as well as code specific to the particular agent 20.
  • the code necessary for determining whether sufficient memory and central processing unit resources are available may be part of the agent core 30.
  • the portion of state 302 which determines whether sufficient local area network access ports are available may be solely implemented for in comparison agent 34.
  • the sensor sub-process 22 continues to a state 303.
  • this state 303 the resources necessary for continuing the comparison agent process 34 are allocated. Processing then continues to a state 304 in which target sequence entries 16 not yet processed by the comparison agent 34 are identified.
  • FIG. 3B the intelligence sub-process 24 is shown in more detail.
  • a first state 310 the sequence entries 16 identified to the intelligence sub-process 24 are placed in proper format for communication to the Fast Data Finder hardware.
  • the parsed data is stored in a data file.
  • a communication connection such as in the form of a TCP/IP socket is opened to the Fast Data Finder hardware.
  • Processing continues to a state 313 where the data file is sent over the TCP/IP connection, and the Fast Data Finder hardware is asked to initiate a comparison operation.
  • state 314 the intelligence sub-process 24 waits for a result.
  • the intelligence sub-process 24 retrieves the results in the form of a data file from the Fast Data Finder hardware and closes the socket .
  • the comparison agent 34 then performs the effector sub-process 26 as shown in FIG. 3C.
  • the results file is parsed to retrieve match data of interest.
  • a results entry 18 is created for recording the results of the operation.
  • This results entry 18 contains its own identification record 18- 1, the sequence identification record 18-2, the agent identification record 18-3 associated with the comparison process 43, and a result record 18-4.
  • the match data retrieved in state 320 is also written into the result data record 18-4.
  • the effector sub- process 26 also adds a record to the results entry 18 such as another result data record 18-4 that indicates that the comparison process 34 has visited the particular target sequence identified by sequence identification record 18-2. This is the final state of the comparison process 34, and at this point, processing returns to state 301 of FIG. 3A in which the comparison agent 34 is again available for processing other data.
  • the results entry 18 created in state 322 is preferably a different entry for each version of a sequence comparison process 34 that is running in the system 10. This serves two purposes. First, if the agent processes 20 fail to complete their tasks, the sequence database 12 is still intact. That is, by not indicating that a process has visited the target sequence until the data is actually completed for processing, then the particular sequence record 16 is again a candidate for processing when the comparison agent process 34 comes back online .
  • a set of agents may also be used to accomplish a series of gene sequences related tasks using the system 10.
  • a set of sequence finishing agents 35 implements a sequence finishing process on the gene expression entries 16.
  • the sequence finishing agent set 35 consists of a number of databases including an expressed sequence tag (EST) database 401, a finish database 402, and a full length sequence database 403.
  • EST expressed sequence tag
  • the sequence finishing agent set 35 also consists of a number of agents including a chosen-for-finishing agent 410, a sequence walk agent 411, an assembly halted agent 414, and a sequence assembly agent 415.
  • the purpose of the sequence finishing agent set 35 is to assemble short gene fragments, such as may be available in the express sequence tag database 401, to assemble such gene fragments into complete full length gene sequences, and then to store assembled full length sequences in the full length data base 403.
  • the sequence finishing process is required in genetic research since most laboratory equipment may obtain only approximately 500 elements of a gene sequence during any one operation, whereas the average length of a gene sequence is typically is excess of 3,000 elements.
  • the various databases associated with the sequence finishing agent set 35 may each be implemented as portions of the sequence database 12.
  • the express sequence tag database 401 is actually a portion of the sequence database 12 consisting of sequence entries 16 that are in the form of expressed sequence tags .
  • the finish database 402 is an intermediate database that has results entries 18 that are used to keep track of the state of the sequence finishing process.
  • the full length database 403 contains sequence entries 16 that are the desired full length gene sequences .
  • Each of the sub-processes in the sequence finishing agent set 35 may be implemented as agent processes 20 as previously described above.
  • the chosen for finishing process 410 is the first agent to operate on the expressed sequence tag database 401.
  • the chosen-for- finishing agent 410 searches through the express sequence tag database 401 for gene fragments of interest. Once a gene fragment of interest is found, then it is placed into the finish database 402 by the effector sub-process 26 in the chosen-for-finishing process 410.
  • a results record containing an annotation indicating that the chosen-for- finishing process 410 has operated on the sequence entry 16 is then created.
  • the sequence walk agent 411 has a sensor sub-process 22 that ensures that elements necessary in order to assemble a sequence from gene fragments are available.
  • This agent 411 thus has a sensor process 22 which looks for expressed sequence tags placed in the finish database 402 by the chosen-for-finishing process 410.
  • the effector sub- process 24 of the sequence walk agent 411 uses known mathematical and other processing techniques to create a results entry 18 in the finish database 402, as part of its effector process 26, that contains data required so that expressed sequence tags may be assembled together in the proper order.
  • the assembly halted agent 414 may be used to search the finish database 402 for annotations in the results entries 18 that indicate sequence assembly has been halted.
  • the effector process 26 in this agent may then alert a human process technician, such as by sending an e-mail message, that action is needed to attend to the halted assembly.
  • the assembly halted agent 414 may perform other tasks, such as recording an annotation from a technician that the halt is now cleared, in order to create an annotation in the finish database 402 to that effect.
  • the sequence assembly agent 415 then completes the assembly of the gene fragments into a full length sequence.
  • the sequence assembly agent 415 thus has a sensor process 22 which obtains annotations relating to how to create complete sequences from the finish database 402, and places these completed sequences in the full length database 404.
  • FIG. 5 is a illustration of another example of the use of a set of agents 20 to accomplish a particular task.
  • the agents 20 are used to implement a set of annotation functions associated with the full length database 403.
  • these include a sequence locate agent 508, an inconsistency agent 510, a translation agent 511, a comparison agent 512, and a key word processing agent 513.
  • the purpose of the various agents illustrated in FIG. 5 is to perform a set of operations that create annotations in the form of result records 18 that contain various types of information with respect to the full length sequences stored in the database 403.
  • a sequence locate agent 508 may perform an initial search for gene sequences in need of annotation. This sequence locate agent 508 may, for example, seek out full length sequences in the database 403 generated by the sequence finishing agent set 35.
  • An inconsistency agent 510 may then obtain information from an external source such as an e-mail system which creates results records 18 in the full length database 403 indicating the annotations made by human technicians. For example, the particular inconsistency agent 510 receives e- mail messages from the technicians indicating that, for example, a particular sequence did not assemble properly.
  • a translation agent 511 may be used to convert DNA type sequence information contained in the full length database 403 to other forms such as a predicted protein sequence. The results of the conversion are then stored as results records 18 as for the other agents described above.
  • a sequence comparison agent 512 may be tasked with searching through the full length database 403 to find sequences related to the particular sequences assembled by the sequence locating agent 508. In particular, information as to sequence analogues that may be available in public sequence databases 540. The comparison agent 512 is thus a particular instance of the agent 34 previously described in connection with FIG. 1.
  • the key word processing agent 513 in one example of an agent which obtains annotation information for the full length sequences from various external databases.
  • databases may for example, include access to public resources such as the MedLine database 520, Internet search engines 521 such as Lycos or Alta Vista, a local information database 522 produced by the laboratory, or other information such as may be available from a public database such as ProSite.

Abstract

A genomic database system that makes use of autonomous agents for providing access to the database. The autonomous agents, which are preferably implemented in a multitasking environment, each seek target data to be processed and then call a program to process the target data. The results of the program are then placed in the database. The autonomous agent model permits the sequence data and processing programs to be changed without the need to be concerned with data synchronization or heterogeneity. The agents may be implemented as object oriented programs that permit the extraction of generic software code in an agent code base. The agents may implement annotation functions, analysis algorithms, or may control the assembly of gene sequences.

Description

AUTONOMOUS INTELLIGENT AGENTS FOR THE ANNOTATION OF GENO IC DATABASES
BACKGROUND OF THE INVENTION
Historically, the analysis of genetic information has been carried out using chemical laboratory methods . While such chemical methods can provide adequate information for a limited number of gene sequences, computer-based research tools are increasingly being used for a variety of purposes related to genetic research. These computer-based tools include both hardware and software that perform high speed algorithms and other processes using gene sequence information stored as computer data.
These computerized tools typically include database programs of various types . In the conventional genomic database system a monolithic master program directs a series of algorithms to be performed. Such algorithms may include simple comparisons, such as to determine if a particular gene sequence has been previously identified, or may include more sophisticated mathematical analysis, such as those used to determine if a gene sequence is likely to exhibit certain characteristics . The algorithms may be run on either a database of existing full length sequences or on sequence segments prior to their entry into the database. Furthermore, researchers may wish to record annotations concerning the sequences as records in the database .
This database-oriented approach is to some degree the result of need to unify the efforts of many different individuals working under many different circumstances. For example, genetic analysis algorithms may be provided by academic, government, or commercial sources. Gene sequence data may originate from in-house laboratory tests, from external commercially available databases, or from privately developed sources .
These hardware and software devices have become powerful tools for the researcher in the field of genetics. In particular, once a comprehensive database of gene sequence information is available, many different algorithms may be run at high speed against all of the known gene sequences . The computer based methods thus provide results much more rapidly then if such analysis were carried out as chemical or laboratory experiments .
However, a number of problems occur when changes must be made to the sequence database or to the algorithms . This is especially the case in a real-time commercial environment, where new genetic sequence information and new algorithms are under continuous research and development.
One such problem occurs when a particular genetic sequence algorithm is replaced by a new version. When this happens, the old result records must be updated by running the new version of each algorithm against each of the genetic sequences . Not only may the process be time consuming, but also the database must typically be kept offline while it is being updated, in order to avoid losing track of which sequence records have yet to be updated. Other problems exist if the database is not static. In particular, the addition of new gene sequence information to the database must be routinely accommodated in commercial environments . New sequence data may become available on a daily or even hourly basis as gene sequences are continuously produced by automated sequencing equipment. When new data is added, each of the existing algorithms must be rerun with the new data .
The problems associated with conventional systems include difficulties other than keeping the database of gene sequences synchronized with the latest versions of the analytical software. If, as in many cases, the software source code is organized to pre-annotate the data before it enters the database, as a first step in a pipeline of analytical algorithms or processes, it can be very difficult to add new tools or update existing ones. For example, if a change is made to the database while gene sequence data are already in the pipeline, then all new result records will reflect the update, but the old result records already in the database will still need to be updated to reflect the changes in the processing.
The solution to this has in the past been typically thought to require halting the processing pipeline long enough to make the required changes, and to then write new software to update the database of old result records to reflect the new process changes . Completing these procedures is a complex task, and is made even more complex by the interruptions in the availability of the database for other purposes . These problems are exacerbated as the number of records increases.
SUMMARY OF THE INVENTION
Ideally, a genomic database system should permit the sequence data and algorithms to be changed without the need to take the database offline or to otherwise be concerned with data synchronization and data heterogeneity issues. It should thus be possible for the database of gene sequences to be updated or added to at a fairly high rate.
At the same time, the set of software tools that are used to analyze or annotate the database should also be permitted to constantly evolve. In fact, it would be ideal if the users could not only change their analysis tools and to have the changes in the tools automatically reflected in the analysis process, but also to have the tools automatically rerun when changes are made to sequence information in the database. The results database should then also be automatically updated.
The present invention accomplishes these objectives by implementing a genomic database and related software code base using an autonomous agent model. In particular, each software tool is implemented as an agent program which autonomously operates on sequence data read from the genomic database. The software architecture of the agents includes a sensor process, an intelligence process, and an effector process . The sensor process identifies sequence records that need processing. The intelligence process, which is called after the sensor process, runs the software tools on the output of the sensor process to obtain results data.
These software tools may implement analytical algorithms or may be of other types such as process control tools. After the intelligence process, the effector process is run. The effector process places the data output by the agent in the form of result records back into the database.
The result records created by the effector process each preferably contain a reference to the gene sequence data and the agent process from which they were created. Typically, the result records also contain a flag to indicate that the particular record has been visited by a particular agent .
The agents may implement various software tools for operating on gene sequence data. These may include functions and algorithms such as expression data retrieval from both external databases and laboratory equipment, mathematical sequence modeling and comparison, and the control of laboratory processes such as sequence finishing. The agents may be used to record annotations concerning the sequence data. The autonomous agents are implemented as daemon processes in a multitasking operating system environment . The agent processes thus typically wait in a quiescent state most of the time, and are only periodically invoked to look for the presence of sequence records which have not been processed by the agent.
Preferably, each agent performs only one specific role. For example, an agent is typically associated with each particular analytical or process control tool. Because the agents are run as processes in a multitasking environment, each agent may thus run exclusive of the time, mode or the appearance of new sequence records . As a result, complex software infrastructure for process synchronization and database heterogeneity is not needed. Instead, these issues are addressed by using the results record, and in particular, by using a flag that indicates that a particular sequence record has been processed by a particular agent, since the flag is toggled only after the effector process in each agent is completed. Thus, when a new version of a particular agent is created, the sensor process in the new version of the agent need only seek the target data it is programmed to process, as would normally be the case anyway, and then toggle its associated flag in its results record. A genomic database system according to the invention therefore permits new sequence records to be added to the database without having to modify or interrupt algorithms which may be already running.
Furthermore, existing agent processes need not be interrupted when new agent processes are added to the system, unless the new agents are expressly programmed to do so.
BRIEF DESCRIPTION OF THE DRAWINGS The above and further features of the invention include various novel details of construction and combination of components . These novel features will now be more particularly pointed out in the following claims, and their advantages will also become evident as they are described with reference to the accompanying drawings, in which:
FIG. 1 is a symbolic diagram of an autonomous intelligent agent system for operating with a genomic database according to the invention;
FIG. 2 is a symbolic diagram of various agent subclasses and the manner in which they interact with entities such as hardware devices and software processes both internal and external to the system,- FIGS. 3A through 3C are flow diagrams for a specific comparison agent process that includes a sensor sub- process, intelligence sub-process, and effector sub- process, respectively;
FIG. 4 is a symbolic diagram of a set of sequence finishing agents; and
FIG. 5 is a symbolic diagram of a set of annotation agents .
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
Turning attention now to the drawings, FIG. 1 is a symbolic diagram of a genomic database system 10 which makes use of an autonomous agent processes according to the invention. The system 10 includes a genomic sequence database 12 and a central processing unit 14. The sequence database 12 includes gene sequence and related information in the form of database entries such as a sequence entry 16 and a results entry 18.
The central processing unit 14 has several software entities for performing different tasks . These software entities are referred to herein as agent processes 20-1, 20-2,..., 20-n (collectively, the agents 20). The central processing unit 14 is of the type that supports a multitasking operating system that permits many agent processes 20 to be running simultaneously.
The agents 20, in a manner which will be understood in more detail shortly, obtain access to the sequence database 12 to retrieve data contained in the sequence entries 16. The agents 20 then process the retrieved data and create results entries 18 containing the output of such processing.
Specific agents 20 may be dedicated to a number of different tasks. These tasks may include simple jobs such as locating gene sequence data and adding an annotation to it, or more complex tasks such as comparing known gene sequences against other newly discovered sequences in order to identify their similarities, or may even control processes, such as processes that create full length gene sequences from gene fragments. The agents 20 are typically autonomous in the sense that they are running all of the time looking for data in the sequence database 12 for which to process. This can be accomplished, for example, by implementing the agents in a multitasking operating system as daemon processes that wait in a quiescent state most of the time, and only periodically activate themselves to search for data.
The agents 20 are also preferably implemented as object oriented programs. This permits each agent to be dedicated to a specific task while at the same time being preferably instantiated as a subclass of an agent core code base, as will also be understood shortly.
Turning attention now to the format of a sequence entry 16 more particularly, each such sequence entry 16 contains a number of records, including an identification record 16-1, a name record 16-2, a source record 16-3, and a sequence record 16-4. The identification record 16-1 contains data such as an index number that provides for unique identification of the particular sequence entry 16. The name record 16-2 contains name information for the gene sequence associated with sequence entry 16. This may include text data items such as a sequence name, a genetic symbol, an organization from which the gene sequence originated, or the tissue sample from which it was taken. The source record 16-3 is used by the system 10 to keep track of where the sequence record 16 originated from, and may include information such as the assembly method, the identification of a well in a laboratory microtiter plate from which the gene sequence sample was taken, the sequence creation date, or other information. Finally, the sequence data records 16-4 contain the actual gene sequence data. The data may typically be in the form of a text record in "ATCG" format, in the case of DNA sequences, or in other formats for other types of genes . The sequence data records 16-4 may actually contain records for several gene sequences of varying lengths. The sequence data records 16-4 may thus include a single gene fragment, a group of fragments, or a complete full length gene sequence.
As explained below, the sequence data records 16-4 may originate from many sources such as laboratory equipment or from other databases, including databases that are external to the system 10, as well as gene sequence data created by the result of running algorithms on other sequence entries 16. Other entries in the sequence database 12 include results entries 18. Each result entry 18 is associated with a particular sequence record 16 and agent 20; there are typically many different result entries 18 for each sequence entry 16. Specifically, result entries 18 are created as the agents 20 work on the data contained in the sequence entries 16.
Each result entry 18 includes a number of records such as an identification record 18-1 which uniquely identifies the result entry 18 to the system 10. A sequence identification record 18-2 contains a reference to the sequence entry 16 from which the particular result entry 18 was generated. An agent identifica ion record 18-3 identifies the particular one of the agents 20 which created the result record 18. Finally, at least one result data record 18-4 is included the result entry 18.
In its minimal form, the result data record 18-4 may simply be a flag indicating that the particular agent 20 indicated by the agent record 18-3 has processed the particular sequence entry 16 indicated by the sequence record 18-2. In other instances, the result data record 18-4 contains done more complex data records, such as an annotation output by a particular agent 20 after performing an analysis of the sequence data 16-4 in a particular sequence entry 16.
One feature of the system 10 is that all pieces of data associated with a particular sequence entry 16 are placed in results entries 18. Furthermore, the addition of new sequence entries 16 to the database 12 as well as all other modifications to the database 12 are made through agent processes 20. By so doing, process synchronization and heterogeneity of the database 12 are simple matters to maintain.
To understand this further, consider that only the agents 20 access the data contained in the sequence entries 16, only the agents are permitted to act upon the sequence data, and that only the agents 20 are permitted to record results in the result entries 18.
A specific agent such as agent 20-1 is typically assigned a well-defined unique task. In general the agents 20 each contain program code which can be divided into three categories of functionality, or sub-processes. These include a sensor sub-process 22, an intelligence sub- process 24, and an effector sub-process 26.
As shown in FIG. 1 the sensor sub-process 22 is primarily responsible for identifying sequence entries 16 in need of processing by the particular agent 20-1. The sensor process 22 typically performs a series of steps or states to accomplish this .
A first state includes a wait state 220 in which the agent process 20-1 remains dormant. A check environment state 221 is then entered, in which the agent 20-1 determines the availability of needed resources in the central processing unit 14, and other resources that may be required for carrying out its assigned tasks . If the results of the check environment state 221 are positive, then the sensor process 22 continues to a state 222 to identify a candidate sequence entry 16 which has not yet been processed by the agent 20-1. This may be done by examining results data records 18-4.
Once a candidate sequence entry 16 is found, the agent process 20-1 then proceeds to the intelligence sub-process 24. In this intelligence sub-process 24, a first state 240 reads the sequence data record 16-4 in the sequence entry 16. In a next state 241, other software processes are invoked as necessary to perform the particular task to which the agent 20-1 has been assigned. The invocation of such other processes unique to the agent process 20-1 is carried out in an encapsulated manner such that all relevant outputs therefrom are returned to the intelligence sub-process 24 as results data. Once the particular processes in state 241 are completed, a next state 242 is entered in which the encapsulated results data are read. Following state 242 the effector sub-process 26 is begun. In this process, the results data provided by the intelligence sub-process 24 are written back into the sequence database 12 as a results entry 18. As the arrow indicates, the agent process 20-1 may then typically return to the performing the sensor process 22 once again.
FIG. 2 is a symbolic diagram of the agent core 30 showing several particular types of agents that may be implemented as subclasses of the core 30. The illustrated agents 20 include a gene expression agent 31, a laboratory expression agent 32, a mathematical modeling agent 33, and a comparison agent 34, and a sequence finishing agent 35.
As previously described, the agent core 30 is a software entity that incorporates agent functionality that is global or generic in nature, with the agents 20 being implemented as object oriented programs running in a multitasking operating system environment . Any particular agent such as the agent 20-2 can thus be instantiated as a subclass of an agent core 30. The agent core 30 can be thought of as a software code base incorporating generic functionality for performing specific tasks common to all agents 20. Such may include code for retrieving sequence entries 16 from the database 12, or for writing the results of such processes in the results entries 18. The agent core 30 thus contains model software processes for controlling control all read and write access to the information in the sequence entries 16 and results entries 18. The agent core 30 may also include code implementing at least the sensor 22, intelligence 24, and effector 26 processes in their most generic state.
Each of the agents shown in FIG. 2 also include specific instantiations of the agent class. That is, each have code in the sensors 22, intelligence 24, and effectors 26 that are specific to their assigned task. For example, the expression database agent 31 contains code which serves the purpose of obtaining gene sequence data from external sources and writing it as sequence annotation entries in the sequence database 12. The illustrated expression data agent 31 is in particular responsible for obtaining gene expression data located in a remotely located external computer system 36. The external computer system 36 consists of a host processor 36-1 and an associated expression database 36-2. The computer system 36 may, for example, be a publicly or commercially available gene sequence expression database 36-2 that is accessed through a remote host interface software process 37. The remote host interface 37 thus consists of appropriate software for establishing a connection to the remote host 36, presenting a data query instruction, and then receiving the gene sequence data from the expression database 36-2. The interface 39 over which the remote host interface 37 interacts with the computer system 36 may, for example, be a TCP/IP connection over a network. In the case of the expression database agent 31, the intelligence sub-process 24 (FIG. 2) thus encapsulates the remote host interface 37.
A laboratory expression agent 32 obtains gene sequence data from a laboratory system 40. The laboratory system 40 is another computer system that includes a host 40-1 and an expression database 40-2. However, the particular data of interest in the expression database 40-2 is obtained from a gene expression reader 40-3. The gene expression reader 40-3 is a laboratory instrument which process physical tissue samples to determine partial gene sequence information in the form of gene fragments, or gene expressions. These gene expressions are stored in the expression database 36-2.
The laboratory expression agent 32 uses a sensor process 22 that determines when a new expression record is added to the expression database 40-2. When such new data exists, the laboratory expression agent 32 invokes its intelligence process 24 to retrieve the data from the expression database 40-2. An effector process 26 then creates the appropriate sequence entry 16 and results entry 18 in the sequence database 12.
The laboratory system 40 operates autonomously from the gene sequence system 10, and gene sequence data resulting from tissue sample analysis may be placed in the expression database 40-2 at any time. It is only when the laboratory expression agent 32 is invoked as a process on the central processing unit 14, that the data is sought from the expression database 40-2 and then forwarded to the sequence database 12 in proper form. It is thus possible to have clones being subjected to analysis by the laboratory system 40, completely outside of other processes that are analyzing these same clones as implemented by the agents 20 running in the annotated database system 10. As described below in connection with FIG. 4, sequence finishing agents may be running at the same time that the laboratory system 40 is running.
A mathematical modeling agent 33 is typically responsible for reading sequence entries 16 and performing various mathematical algorithms on them. This may, for example, be a structure analysis algorithm that analyzes gene sequence data for the existence of secretory proteins . As shown, the code which implements the algorithm may typically reside as part of the intelligence process 24. Alternatively, the secondary structure code may exist and run on a separate piece of hardware, in which case the intelligence process 24 might typically include an appropriate Application Programmer Interface (API) software for accessing such code. The secondary structure code typically returns a score based upon a mathematical analysis of the gene sequence under examination. This core then becomes the data recorded in the annotation entry 18 by the effector process 24. A comparison agent 34 is responsible for comparing one or more gene sequence entries 16 against other gene sequence entries 16 to find a degree of match between them. The particular comparison agent process 34 described here makes use of external super-computer hardware in order to perform such comparisons. For example, the super-computer may include a Fast Data Finder (FDF) that makes use of a Smith/Waterman algorithm such as is available from Paracel Corporation. The comparison agents 34 communicates with the Fast Data Finder hardware 43 such as over a local area network (LAN) interface 42.
FIGS. 3A through 3C are a more detailed set of flow diagrams for one implementation of the comparison agent 34. FIG. 3A is a state diagram of the sensor sub-process 22, FIG. 3B is a detailed state diagram of the intelligence sub-process, and FIG. 3C is a state diagram of the effector sub-process 26.
As shown in an initial state 301, the comparison agent 34 is in a wait state in which it remains quiescent. The wait state 301 is an example, in an object oriented implementation of the comparison agent 34, of a component which may use the generic agent core 30.
From state 301, the comparison agent process 34 periodically enters a state 302. In this state 302, the sensor sub-process 22 determines whether sufficient processor resources such as memory, available processor time, local area network access ports, and other resources necessary for executing the remainder of the comparison agent process 34 are available. If this is not the case then the sensor process 22 returns to state 301. If, however, sufficient resources are available, then the sensor sub-process 22 continues to state 302.
State 302 is an example of a state which may take advantage of both the agent core 30 as well as code specific to the particular agent 20. In this case, the code necessary for determining whether sufficient memory and central processing unit resources are available may be part of the agent core 30. However, in the specific instantiation of sensor process 22 for the comparison agent 34, the portion of state 302 which determines whether sufficient local area network access ports are available may be solely implemented for in comparison agent 34.
From state 302, the sensor sub-process 22 continues to a state 303. In this state 303, the resources necessary for continuing the comparison agent process 34 are allocated. Processing then continues to a state 304 in which target sequence entries 16 not yet processed by the comparison agent 34 are identified.
In FIG. 3B the intelligence sub-process 24 is shown in more detail. In a first state 310, the sequence entries 16 identified to the intelligence sub-process 24 are placed in proper format for communication to the Fast Data Finder hardware. In a next state 311, the parsed data is stored in a data file. In a following state 312, a communication connection such as in the form of a TCP/IP socket is opened to the Fast Data Finder hardware.
Processing continues to a state 313 where the data file is sent over the TCP/IP connection, and the Fast Data Finder hardware is asked to initiate a comparison operation. In state 314 the intelligence sub-process 24 waits for a result. In a final state 315, the intelligence sub-process 24 retrieves the results in the form of a data file from the Fast Data Finder hardware and closes the socket . The comparison agent 34 then performs the effector sub-process 26 as shown in FIG. 3C. In a first state 320, the results file is parsed to retrieve match data of interest. In a next state 321, a results entry 18 is created for recording the results of the operation. This results entry 18 contains its own identification record 18- 1, the sequence identification record 18-2, the agent identification record 18-3 associated with the comparison process 43, and a result record 18-4. In this case, the match data retrieved in state 320 is also written into the result data record 18-4. In state 322, the effector sub- process 26 also adds a record to the results entry 18 such as another result data record 18-4 that indicates that the comparison process 34 has visited the particular target sequence identified by sequence identification record 18-2. This is the final state of the comparison process 34, and at this point, processing returns to state 301 of FIG. 3A in which the comparison agent 34 is again available for processing other data. By creating the annotation entry in state 322, which specifically indicates that the comparison process 34 has visited a particular target sequence 16, a mechanism therefore exists for tracking which data has been operated on by which version of each particular process. That is, it should be understood that the results entry 18 created in state 322 is preferably a different entry for each version of a sequence comparison process 34 that is running in the system 10. This serves two purposes. First, if the agent processes 20 fail to complete their tasks, the sequence database 12 is still intact. That is, by not indicating that a process has visited the target sequence until the data is actually completed for processing, then the particular sequence record 16 is again a candidate for processing when the comparison agent process 34 comes back online .
As mentioned briefly in connection with FIG. 2, a set of agents may also be used to accomplish a series of gene sequences related tasks using the system 10. As shown in more detail in FIG. 4, a set of sequence finishing agents 35, implements a sequence finishing process on the gene expression entries 16. The sequence finishing agent set 35 consists of a number of databases including an expressed sequence tag (EST) database 401, a finish database 402, and a full length sequence database 403.
The sequence finishing agent set 35 also consists of a number of agents including a chosen-for-finishing agent 410, a sequence walk agent 411, an assembly halted agent 414, and a sequence assembly agent 415. The purpose of the sequence finishing agent set 35 is to assemble short gene fragments, such as may be available in the express sequence tag database 401, to assemble such gene fragments into complete full length gene sequences, and then to store assembled full length sequences in the full length data base 403. The sequence finishing process is required in genetic research since most laboratory equipment may obtain only approximately 500 elements of a gene sequence during any one operation, whereas the average length of a gene sequence is typically is excess of 3,000 elements.
The various databases associated with the sequence finishing agent set 35 may each be implemented as portions of the sequence database 12. In particular, the express sequence tag database 401 is actually a portion of the sequence database 12 consisting of sequence entries 16 that are in the form of expressed sequence tags . The finish database 402 is an intermediate database that has results entries 18 that are used to keep track of the state of the sequence finishing process. The full length database 403 contains sequence entries 16 that are the desired full length gene sequences .
Each of the sub-processes in the sequence finishing agent set 35 may be implemented as agent processes 20 as previously described above. For example, the chosen for finishing process 410 is the first agent to operate on the expressed sequence tag database 401. The chosen-for- finishing agent 410 searches through the express sequence tag database 401 for gene fragments of interest. Once a gene fragment of interest is found, then it is placed into the finish database 402 by the effector sub-process 26 in the chosen-for-finishing process 410. A results record containing an annotation indicating that the chosen-for- finishing process 410 has operated on the sequence entry 16 is then created.
The sequence walk agent 411 has a sensor sub-process 22 that ensures that elements necessary in order to assemble a sequence from gene fragments are available. This agent 411 thus has a sensor process 22 which looks for expressed sequence tags placed in the finish database 402 by the chosen-for-finishing process 410. The effector sub- process 24 of the sequence walk agent 411 uses known mathematical and other processing techniques to create a results entry 18 in the finish database 402, as part of its effector process 26, that contains data required so that expressed sequence tags may be assembled together in the proper order.
The assembly halted agent 414 may be used to search the finish database 402 for annotations in the results entries 18 that indicate sequence assembly has been halted. The effector process 26 in this agent may then alert a human process technician, such as by sending an e-mail message, that action is needed to attend to the halted assembly. The assembly halted agent 414 may perform other tasks, such as recording an annotation from a technician that the halt is now cleared, in order to create an annotation in the finish database 402 to that effect.
The sequence assembly agent 415 then completes the assembly of the gene fragments into a full length sequence. The sequence assembly agent 415 thus has a sensor process 22 which obtains annotations relating to how to create complete sequences from the finish database 402, and places these completed sequences in the full length database 404. FIG. 5 is a illustration of another example of the use of a set of agents 20 to accomplish a particular task. In this illustration, the agents 20 are used to implement a set of annotation functions associated with the full length database 403. In particular, these include a sequence locate agent 508, an inconsistency agent 510, a translation agent 511, a comparison agent 512, and a key word processing agent 513. The purpose of the various agents illustrated in FIG. 5 is to perform a set of operations that create annotations in the form of result records 18 that contain various types of information with respect to the full length sequences stored in the database 403.
For example, a sequence locate agent 508 may perform an initial search for gene sequences in need of annotation. This sequence locate agent 508 may, for example, seek out full length sequences in the database 403 generated by the sequence finishing agent set 35.
An inconsistency agent 510 may then obtain information from an external source such as an e-mail system which creates results records 18 in the full length database 403 indicating the annotations made by human technicians. For example, the particular inconsistency agent 510 receives e- mail messages from the technicians indicating that, for example, a particular sequence did not assemble properly. A translation agent 511 may be used to convert DNA type sequence information contained in the full length database 403 to other forms such as a predicted protein sequence. The results of the conversion are then stored as results records 18 as for the other agents described above. A sequence comparison agent 512 may be tasked with searching through the full length database 403 to find sequences related to the particular sequences assembled by the sequence locating agent 508. In particular, information as to sequence analogues that may be available in public sequence databases 540. The comparison agent 512 is thus a particular instance of the agent 34 previously described in connection with FIG. 1.
Finally, the key word processing agent 513 in one example of an agent which obtains annotation information for the full length sequences from various external databases. These databases, may for example, include access to public resources such as the MedLine database 520, Internet search engines 521 such as Lycos or Alta Vista, a local information database 522 produced by the laboratory, or other information such as may be available from a public database such as ProSite.
While this invention has been particularly shown and described with references to the preferred embodiments thereof, it will be understood by those of skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

CLAIMS What is claimed is:
1. A genetic database system comprising: a database of gene sequence entries, the gene sequence entries containing gene sequence data; a plurality of programs for processing the gene sequence data; a database of results entries containing results records generated by the application of the programs to the gene sequence entries; a plurality of autonomous agent programs, each agent program comprising:
(A) a sensor process that seeks target gene sequence data to be processed by at least one of the programs,- (B) an intelligence process that invokes the at least one program to process the target gene sequence data; and
(C) an effector process for revising the results records in response to the intelligence process.
2. A database system as in claim 1 wherein each autonomous agent records annotations in the results records in accordance with output produced by at least one program.
3. A database system as in claim l wherein each autonomous agent includes a sensor process that determines the availability of system resources.
4. A database system as in claim 1 wherein the sensor process additionally invokes the intelligence process.
5. A database system as in claim 1 wherein the effector process calls a sensor process for processing other target gene sequence data.
6. A database system as in claim 1 wherein the agents run as self-invoking processes in a multitasking data processing environment.
7. A database system as in claim l wherein the agents include an effector process which revises the results entries with a gene sequence annotation record.
8. A database system as in claim 1 wherein the results records include a flag record indicating that a particular version of the program has processed at least one of the gene sequence entries .
9. A database system as in claim l wherein the results record contains output produced by at least one program.
10. A database system as in claim 1 wherein the agents are implemented as object oriented programs.
11. A database system as in claim 1 wherein the sensor process selects a candidate gene sequence entry for processing by the agent.
12. A database system as in claim 1 wherein the results records comprise a sequence identification record that uniquely identifies the program which created the results entry.
13. A database system as in claim l wherein a least one of the agents is an expression database agent having an intelligence process that obtains gene expression data from an expression database external to the database of gene sequence records .
14. A database system as in claim 1 wherein at least one of the agents is a laboratory expression agent having an intelligence process that obtains gene expression data from a laboratory analyzer.
15. A database system as in claim 1 wherein the intelligence process compares gene sequence records.
16. A database system as in claim 1 wherein at least one of the agents is a sequence assembly agent.
17. A database system as in claim 16 wherein a group of the sequence entries comprise an expressed tag sequence database of partial gene sequences, and the sequence assembly agent is a set of agents comprising a chosen-for- finishing agent having an intelligence process which identifies expressed sequence tags that are in need of assembly.
18. A database system as in claim 17 wherein a group of the sequence entries comprise a finish database.
19. A database system as in claim 18 wherein the chosen- for-finishing agent includes an effector process that adds a results entry to indicate the sequence entry in the finish database is in need of sequence assembly.
20. A database system as in claim 18 wherein the sequence finishing agent includes a sequence walk agent which adds a results entry to indicate instructions for finishing assembly of a sequence entry in the finish database.
21. A database system as in claim 18 wherein a group of sequence entries comprise a full length sequence database and the sequence finishing agent includes a sequence assembly agent that creates a sequence entry in the full length sequence database from data in the finish database.
22. A method of storing data in a genetic database system comprising the steps of : storing a database of gene sequence entries, the gene sequence entries containing gene sequence data; storing a plurality of programs using the gene sequence data as input ,- storing a database of results entries containing results data records generated by the running the programs with the gene sequence entries as input; and executing a plurality of autonomous agent programs, each autonomous agent program comprising the steps of: (A) invoking a sensor process that seeks target gene sequence data to be processed by at least one of the stored programs ;
(B) invoking an intelligence process that applies at least one program to process the target data; and (C) invoking an effector process for revising the results entries in response to invoking the intelligence process .
23. A method as in claim 22 wherein the step of invoking an effector process comprises a step of creating annotation entries in accordance with output produced by at least one program.
24. A database system as in claim 22 wherein the step of invoking a sensor process comprises a step of determining the availability of system resources.
25. A database system as in claim 22 wherein the step of invoking a sensor process comprises a step of autonomously invoking the intelligence process .
26. A database system as in claim 22 wherein the step of invoking an effector process comprises a step of autonomously invoking a sensor process for processing other target gene sequence data.
27. A method as in claim 22 wherein the step of executing the plurality of autonomous agent programs comprises a set of steps of executing self-invoking processes in a multitasking data processing environment.
28. A method as in claim 22 wherein the step of executing the plurality of autonomous agent programs comprises the step of invoking an effector process which revises the results records with annotation data.
29. A method as in claim 28 wherein the step of invoking an effector process comprises revising the results records with a flag record indicating that a particular version of the program has annotated at least one of the gene sequence entries .
30. A method as in claim 28 wherein the step of invoking an effector process comprises revising the results records with output data produced by the program.
31. A method as in claim 22 wherein the step of executing agent programs comprises executing agent programs that are implemented as object oriented programs.
32. A method as in claim 22 wherein the step of invoking a sensor process additionally includes the step of selecting a candidate gene sequence entry for processing by the autonomous agent program.
33. A method as in claim 22 wherein the step of invoking an effector process comprises the step of revising the results records to contain a sequence identification record that uniquely identifies the program which created the results entry.
34. A method as in claim 22 wherein a least one of the autonomous agents is an expression database agent and the step of invoking an intelligence process comprises the step of obtaining gene expression data from an expression database external to the database of gene sequence records .
35. A database system as in claim 22 wherein at least one of the agents is a laboratory expression agent and the step of invoking an intelligence process comprises obtaining gene expression data from by a laboratory analyzer.
36. A method as in claim 22 wherein the step of invoking an intelligence process comprises the step of comparing gene sequence records .
37. A method as in claim 22 wherein the step of executing a plurality of autonomous agent programs executes a sequence assembly agent program.
38. A method as in claim 37 wherein a group of the sequence entries comprise an expressed tag sequence database of partial gene sequences, and the step of executing a sequence assembly agent includes a step of executing a chosen-for-finishing agent which invokes an intelligence process which identifies expressed sequence tags that are in need of assembly.
39. A method as in claim 38 wherein the step of storing a database of gene sequence entries comprises a step of storing gene sequence entries comprising a finish database.
40. A method as in claim 39 wherein the step of executing a chosen-for-finishing agent additionally comprises the step of invoking an effector process that adds an annotation entry to the result record to indicate the sequence entry in the finish database is in need of sequence assembly.
41. A method as in claim 39 wherein the step of executing a sequence assembly agent additionally comprises the step of invoking a sequence walk agent adds a results entry to indicate instructions for finishing assembly of a sequence entry in the finish database.
42. A database system as in claim 39 wherein a group of sequence entries comprise a full length sequence database and the step of executing a sequence assembly agent comprises creating a sequence entry in the full length sequence database from data in the finish database.
PCT/US1998/007327 1997-04-15 1998-04-14 Autonomous intelligent agents for the annotation of genomic databases WO1998047086A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU69677/98A AU6967798A (en) 1997-04-15 1998-04-14 Autonomous intelligent agents for the annotation of genomic databases

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US08/837,963 US5966711A (en) 1997-04-15 1997-04-15 Autonomous intelligent agents for the annotation of genomic databases
US08/837,963 1997-04-15

Publications (2)

Publication Number Publication Date
WO1998047086A2 true WO1998047086A2 (en) 1998-10-22
WO1998047086A3 WO1998047086A3 (en) 1999-02-11

Family

ID=25275899

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1998/007327 WO1998047086A2 (en) 1997-04-15 1998-04-14 Autonomous intelligent agents for the annotation of genomic databases

Country Status (3)

Country Link
US (1) US5966711A (en)
AU (1) AU6967798A (en)
WO (1) WO1998047086A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7246315B1 (en) 2000-05-10 2007-07-17 Realtime Drama, Inc. Interactive personal narrative agent system and method

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3985369B2 (en) * 1998-08-21 2007-10-03 株式会社セガ Game screen display control method, character movement control method, game machine, and recording medium recording program
US7058517B1 (en) 1999-06-25 2006-06-06 Genaissance Pharmaceuticals, Inc. Methods for obtaining and using haplotype data
DE1233365T1 (en) * 1999-06-25 2003-03-20 Genaissance Pharmaceuticals Method for producing and using haplotype data
US20030097222A1 (en) * 2000-01-25 2003-05-22 Craford David M. Method, system, and computer software for providing a genomic web portal
US6675166B2 (en) 2000-02-09 2004-01-06 The John Hopkins University Integrated multidimensional database
US6772026B2 (en) 2000-04-05 2004-08-03 Therics, Inc. System and method for rapidly customizing design, manufacture and/or selection of biomedical devices
US6931326B1 (en) 2000-06-26 2005-08-16 Genaissance Pharmaceuticals, Inc. Methods for obtaining and using haplotype data
US6681186B1 (en) 2000-09-08 2004-01-20 Paracel, Inc. System and method for improving the accuracy of DNA sequencing and error probability estimation through application of a mathematical model to the analysis of electropherograms
US20020095585A1 (en) * 2000-10-18 2002-07-18 Genomic Health, Inc. Genomic profile information systems and methods
US8010295B1 (en) * 2000-11-06 2011-08-30 IB Security Holders LLC System and method for selectively classifying a population
US20020129342A1 (en) * 2001-03-07 2002-09-12 David Kil Data mining apparatus and method with user interface based ground-truth tool and user algorithms
CA2377213A1 (en) * 2001-03-20 2002-09-20 Ortho-Clinical Diagnostics, Inc. Method for providing clinical diagnostic services
US7251568B2 (en) * 2001-04-18 2007-07-31 Wyeth Methods and compositions for regulating bone and cartilage formation
AU2002305193A1 (en) * 2001-04-18 2002-11-05 Wyeth Methods and reagents for regulating bone and cartilage formation
US6996477B2 (en) 2001-04-19 2006-02-07 Dana Farber Cancer Institute, Inc. Computational subtraction method
US7155453B2 (en) * 2002-05-22 2006-12-26 Agilent Technologies, Inc. Biotechnology information naming system
US20020194201A1 (en) * 2001-06-05 2002-12-19 Wilbanks John Thompson Systems, methods and computer program products for integrating biological/chemical databases to create an ontology network
US20020194154A1 (en) * 2001-06-05 2002-12-19 Levy Joshua Lerner Systems, methods and computer program products for integrating biological/chemical databases using aliases
US20040267458A1 (en) * 2001-12-21 2004-12-30 Judson Richard S. Methods for obtaining and using haplotype data
MXPA06011259A (en) * 2004-04-01 2007-01-26 Neomedia Tech Inc System and method of using dna for linking to network resources.
US7937250B2 (en) * 2007-04-27 2011-05-03 International Business Machines Corporation Method and system for addressing non-functional concerns
US8176074B2 (en) * 2009-10-28 2012-05-08 Sap Ag Methods and systems for querying a tag database
WO2011137368A2 (en) 2010-04-30 2011-11-03 Life Technologies Corporation Systems and methods for analyzing nucleic acid sequences
US9268903B2 (en) 2010-07-06 2016-02-23 Life Technologies Corporation Systems and methods for sequence data alignment quality assessment
US9317861B2 (en) * 2011-03-30 2016-04-19 Information Resources, Inc. View-independent annotation of commercial data

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0646883A1 (en) * 1993-09-27 1995-04-05 Hitachi Device Engineering Co., Ltd. Gene database retrieval system

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5404295A (en) * 1990-08-16 1995-04-04 Katz; Boris Method and apparatus for utilizing annotations to facilitate computer retrieval of database material
US5379420A (en) * 1991-12-26 1995-01-03 Trw Inc. High-speed data searching apparatus and method capable of operation in retrospective and dissemination modes
US6114114A (en) * 1992-07-17 2000-09-05 Incyte Pharmaceuticals, Inc. Comparative gene transcript analysis
US5546577A (en) * 1994-11-04 1996-08-13 International Business Machines Corporation Utilizing instrumented components to obtain data in a desktop management interface system
BR9606931A (en) * 1995-01-23 1997-11-11 British Telecomm Information access system and process for monitoring the insertion of information into a data store
JPH11501741A (en) * 1995-01-27 1999-02-09 インサイト ファーマシューティカルズ インク. Computer system for storing and analyzing microbiological data
US5618672A (en) * 1995-06-02 1997-04-08 Smithkline Beecham Corporation Method for analyzing partial gene sequences
US5793964A (en) * 1995-06-07 1998-08-11 International Business Machines Corporation Web browser system
US5745754A (en) * 1995-06-07 1998-04-28 International Business Machines Corporation Sub-agent for fulfilling requests of a web browser using an intelligent agent and providing a report

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0646883A1 (en) * 1993-09-27 1995-04-05 Hitachi Device Engineering Co., Ltd. Gene database retrieval system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LI P ET AL: "An efficient delivery of historical information for the Mendelian Inheritance in Man database" NINETEENTH ANNUAL SYMPOSIUM ON COMPUTER APPLICATIONS IN MEDICAL CARE. TOWARD COST-EFFECTIVE CLINICAL COMPUTING. PROCEEDINGS, PROCEEDINGS OF NINETEENTH ANNUAL SYMPOSIUM ON COMPUTER APPLICATIONS IN MEDICAL CARE, NEW ORLEANS, LA, USA, 28 OCT.-1 NOV. 199, pages 127-131, XP002080067 ISBN 1-56053-123-1, 1995, Philadelphia, PA, USA, Hanley & Belfus, USA *
MARKOWITZ V M ET AL: "Data management tools for genomic applications: a progress report" DATABASE AND EXPERT SYSTEMS APPLICATIONS. 4TH INTERNATIONAL CONFERENCE, DEXA '93 PROCEEDINGS, PROCEEDINGS OF 4TH INTERNATIONAL CONFERENCE ON DATABASE AND EXPERT SYSTEMS APPLICATIONS, PRAGUE, CZECHOSLOVAKIA, 6-8 SEPT. 1993, pages 529-540, XP002080066 ISBN 3-540-57234-1, 1993, Berlin, Germany, Springer-Verlag, Germany *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7246315B1 (en) 2000-05-10 2007-07-17 Realtime Drama, Inc. Interactive personal narrative agent system and method

Also Published As

Publication number Publication date
WO1998047086A3 (en) 1999-02-11
AU6967798A (en) 1998-11-11
US5966711A (en) 1999-10-12

Similar Documents

Publication Publication Date Title
US5966711A (en) Autonomous intelligent agents for the annotation of genomic databases
CA2258252C (en) Delta model processing logic representation and execution system
EP0883848B1 (en) Automatic transmission of legacy system data
US7552103B2 (en) Application integration system and method using intelligent agents for integrating information access over extended networks
WO2002093409A1 (en) Multi-paradigm knowledge-bases
US7103885B2 (en) Comment driven processing
US20040215651A1 (en) Platform for management and mining of genomic data
US6374261B1 (en) Expert system knowledge-deficiency reduction through automated database updates from semi-structured natural language documents
Barker et al. The PIR protein sequence database
US7251642B1 (en) Analysis engine and work space manager for use with gene expression data
Kalyanaraman et al. Space and time efficient parallel algorithms and software for EST clustering
JP2003162545A (en) File search device, index file creation device and file search program
US5649180A (en) Method for generating hierarchical specification information from software
US7047137B1 (en) Computer method and apparatus for uniform representation of genome sequences
Sternberg PROMOT: a FORTRAN program to scan protein sequences against a library of known motifs
WO2000015847A2 (en) Genomic knowledge discovery
Bukhman et al. BioMolQuest: integrated database-based retrieval of protein structural and functional information
Kerlavage et al. Analysis and management of data from high-throughput expressed sequence tag projects
Inman et al. A high-throughput distributed DNA sequence analysis and database system
WO2002086667A2 (en) Computer software for automated annotation of biological sequences
US6564200B1 (en) Apparatus for cross referencing routines and method therefor
Goodman et al. A glimpse at the DBMS challenges posed by the human genome project
Chen et al. A complex biological database querying method
US20050004785A1 (en) System, method and computer product for predicting biological pathways
Jin et al. Plan-Based coordination of a multi-agent system for protein structure prediction

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GE GH GM GW HU ID IL IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG UZ VN YU ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW SD SZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN ML MR NE SN TD TG

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
AK Designated states

Kind code of ref document: A3

Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GE GH GM GW HU ID IL IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG UZ VN YU ZW

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): GH GM KE LS MW SD SZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

NENP Non-entry into the national phase

Ref country code: JP

Ref document number: 1998544134

Format of ref document f/p: F

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: CA