WO2002025564A1 - A system, method and interface for building biological databases using templates - Google Patents

A system, method and interface for building biological databases using templates Download PDF

Info

Publication number
WO2002025564A1
WO2002025564A1 PCT/SG2000/000155 SG0000155W WO0225564A1 WO 2002025564 A1 WO2002025564 A1 WO 2002025564A1 SG 0000155 W SG0000155 W SG 0000155W WO 0225564 A1 WO0225564 A1 WO 0225564A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
templates
database
bioinformatics
databases
Prior art date
Application number
PCT/SG2000/000155
Other languages
French (fr)
Inventor
Vladimir Brusic
Christian Schonbach
Lie Yong Judice Koh
Original Assignee
Kent Ridge Digital Labs
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kent Ridge Digital Labs filed Critical Kent Ridge Digital Labs
Priority to PCT/SG2000/000155 priority Critical patent/WO2002025564A1/en
Priority to GB0306836A priority patent/GB2383452B/en
Publication of WO2002025564A1 publication Critical patent/WO2002025564A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B45/00ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks

Definitions

  • This invention relates to a system, method and interface for building biological databases using templates and particularly, but not exclusively, to such systems, methods and interfaces for immunological databases or databases for MHC molecules.
  • Genbank, SWISS-PROT, and other general-purpose databases are the main source of biological information, but they have data entries from various species.
  • Bioinformatics is a field dealing with biomolecular, related structural, related functional, related clinical, and related biochemical data.
  • a standard practice for reusing databases is creation of a subset of an existing bioinformatics database (for example, creating a subset of Genbank that contains only swine sequences).
  • Maslyn et al., US Patent 5,5953,727 relates to a relational bioinformatics database suitable for cataloguing and searching sequences according to association with one or more projects.
  • the present invention provides a system for building and integrating of databases based on combining template modules using multiple dimensions, or views, to data and templates for database tools.
  • Eilbeck K. et al., ISM8199,87-105 is an object oriented database created to provide scientists with a resource for examining protein-protein interactions and inferring possible interactions from the data stored.
  • the present invention is a system for building databases using templates. Some of the dimensions (views) can use, but are not restricted to, an object-oriented data design.
  • Nowacki et al., Nucleic Acids Res 1998 Janl; 26(1): 2205 describes a database of nucleotide variation using a relational data model.
  • the present invention is a system for building of databases using templates. Use of templates
  • Tzolkin system facilitates the expression of clinical queries to reduce the manual data processing that users must undertake to decipher the answers to their queries. This approach is general, facilitates software reuse, and thus decreases the cost of building new software systems that require this functionality. Tzolkin facilitates software reuse for generating clinical queries, while the present invention facilitates software reuse for building bioinformatics-related databases.
  • Kleisli system enables complex queries across multiple databases and data integration. Kleisli does not specify the system for the user-end data integration using templates.
  • the present invention can, but is not limited to, use Kleisli for accessing heterogeneous data sources and data extraction.
  • the present invention is a system for building databases using templates.
  • the present invention provides a general system, method, and interfaces for building and integrating databases based on combining template modules using multiple dimensions, or views, to data and templates for database tools.
  • the method may be applicable to domains characterised by complex data where multiple different views to data need to be combined for extraction of information.
  • An example application is from bioinformatics, where multiple databases using this system and method may be built.
  • the information extraction and data management is based on using templates, each of which is designed for a specific purpose.
  • This invention may also provide a system and method for a relatively efficient creation of bioinformatics databases that concentrate on a particular subject within the bioinformatics field, and may reuse templates related to various views of the subject-related data.
  • the breadth and the depth of the coverage of the resulting database may depend on user specifications.
  • FIG 1 is an illustration of the general arrangement of the present invention
  • Figure 2 is a representation of an example of the main page of an interface
  • Figure 3 is a representation of a second example of the main page of an interface
  • Figure 4 is a representation of the views and links of an exemplary database
  • Figure 5 is a flow chart for the building of a database
  • Figure 6 is an illustration of the design structure for the definition and implementation of a template
  • Figure 7 is a flow-chart for the building of a template
  • Figure 8 is a representation of a graphical user interface for template selection
  • Figure 9 is an illustration of a graphical user interface for template building.
  • Figure 10 is an illustration of a graphical user interface for the input of parameters for the main template.
  • the present invention system uses templates for building a database by loading data into a structure defined by the templates.
  • Data sources can be external (e.g. "Genbank”, or other databases) or local. Data can be acquired by a variety of means: manually or by automatic data extraction - for example using the techniques disclosed in (Davidson et al., referred to above). Templates given in Figure 1 include:
  • Database design can be f rther divided into more specific steps: a) decision on the database content; b) selection of the main template and define the main interface page outlay; c) selection of the view, tool, and report templates; d) design new templates if required; and e) template linking Data loading steps are: f) data acquisition; and g) data loading into files.
  • the template for the main page ( Figure 2 and Fig 3) for each was reused.
  • the SLAD modules 'Retrieve Allele Info', Retrieve Epitope Info', and 'Search References' and the FIMM modules 'Diseases', 'Antigens', 'Search FIMM', and 'References' all use the same family of templates that enable keyword searching.
  • variations of these templates include optional selection of data dimensions, selection of the number of output entries, or other possible search limiting criteria.
  • a template preferably consists of an interface page, file formats for data storing, and a set of programs that allow data storing and data retrieval.
  • the interface page may take a standard form such as, for example, the BLAST interface, widely used for the Internet BLAST services; or may be novel such as 'Phylogenetic analysis' of SLAD, used for inter-species sequence comparison.
  • the format of files for data storing is flexible - it depends on the bioinformatics problem related to the question asked.
  • some of the files may be flat files, containing record fields, labels and delimiters, or bin-hexed files suitable for BLAST searches.
  • the list of possible templates is given in Table 1.
  • the present invention is not limited to these templates and other templates for other purposes may be developed.
  • the present invention allows users to build their own databases by selecting the appropriate templates, maintain the databases, and annotate new entries. It also allows users to combine sequence search and analysis tools within the database. It also allows database access and tools to be packaged in a single interface, and brings together the capacity for a user to build the databases and integrate sequence analysis tools. Integrated sequence analysis tools were previously available through packages like GCG (Genetics Computer Group, Wisconsin, USA) but these packages do not enable a user to build databases; they only enable user to create individual sequence entries as separate files and access them through lists of file names.
  • the present invention allows the building of bioinformatic data warehouses.
  • a data warehouse is a database structured to facilitate analytical tasks, rather than operational purposes.
  • the present invention provides the framework for building bioinformatics warehouses by combining and integrating various data views and analysis tools.
  • Data warehouses are commonly used for performing Knowledge Discovery from Databases (KIDD).
  • KDD is defined as the non-trivial process of identifying valid, novel, potentially useful, and understandable patterns in data. Data warehousing has not previously been described in bioinformatics.
  • FIMM and SLAD utilise dimensional modelling, which enables users to form multidimensional views of the relevant facts which are stored in a 'flat' (non- structured), easy-to-comprehend and easy-to-access database. Relational modelling appears too rigid to provide efficient extraction of data for analytical processing needs. Another alternative approach, using the object-oriented modelling can deal with complex data structure, but is difficult to build and has highly structured data. At the core of the dimensional modelling are fact tables that contain the non- discrete, additive data.
  • Database building using the present invention may be a multi-step process. It preferably consists of template selection, template storage, template building, refinement (if necessary), and integration of the templates into the database.
  • Each database may have at least three dimensions selected from the list consisting of, but not limited to, sequence structural data, sequence functional data, gene expression data, protein expression data, relevant pathology associations, evolutionary data, data on biologically active sites within biomolecular sequences, data on biochemically active sites within biomolecular sequences, pharmacological data, and sequence patterns and motifs.
  • Each of the templates may contain the full set of sub-specifications, or a partial set (i.e. not all templates will have data input-output).
  • the user interface for the template is usually a HTML page which collects the input parameters from the users.
  • An example is given in Figure 10 for the main template.
  • the source and the format of the input data for the template are then specified.
  • the data output format refers to the format of the stored data records.
  • the format of the record to be displayed by the template after processing is then specified.
  • the default format is based on the data output.
  • a set of tools serves as a master copy for the tool specification as a result, the system then generates the integration logic of the template with other templates based on the specifications.
  • Figure 8 shows a graphical user interface for the template selection.
  • the graphical user interface may have a first polygonal area and a plurality of contained polygonal areas and/or textual links within the first polygonal area; the contained polygonal areas including at least one second polygonal area to enable available templates to be listed, and a third polygonal area to enable the selected templates to be listed.
  • the second polygonal area can display titles of selected templates, and the third polygonal area can specify the data to be entered on the database.
  • a fourth polygonal area may be provided for specifying additional data to be entered on the database.
  • the second polygonal area may be in a plurality of segments, with there being one segment for each template title.
  • the first, second, third and fourth polygonal areas are preferably rectangular, as are each of the plurality of segments.
  • the template shown in Figure 10 shows a preferred form of a graphical user interface for the specification of the input parameters for the main template. It has a first polygonal area which contains a second polygonal area for displaying a Hst of selected templates, and a third polygonal area for specifying the data to be entered on the database.
  • a contained fourth polygonal area for specifying additional data to be entered on the database may also be provided.
  • the second polygonal area is preferably in a plurality of segments, there being one segment for each template title. It is preferred that the first, second, third and fourth polygonal areas are rectangular, as are the segments. However, other shapes may be used, if desired.
  • Figure 9 shows the GUI for template building . Like the GUIs of Figures 8 and 10, it has a first polygonal area, preferably rectangular, and two contained polygonal areas, which are also preferably rectangular. The first contained area is used to select the template, and the second contained area is used to list the sub-specifications of the template selected.

Abstract

With the above and other objects in mind, the present invention provides a general system, method, and interfaces for building and integrating databases based on combining template modules using multiple dimensions, or views, to data and templates for database tools. The method may be applicable to domains characterised by complex data where multiple different views to data need to be combined for extraction of information. An example application is from bioinformatics, where multiple databases using this system and method may be built. The information extraction and data management is based on using templates, each of which is designed for a specific purpose.

Description

A SYSTEM, METHOD AND INTERFACE FOR
BUILDING BIOLOGICAL DATABASES USING TEMPLATES
Field of the Invention
This invention relates to a system, method and interface for building biological databases using templates and particularly, but not exclusively, to such systems, methods and interfaces for immunological databases or databases for MHC molecules.
Background to the invention
A considerable amount of biological data is available from public and other databases. Biological databases are characterised by various degrees of heterogeneity in that they:
• encode different views of the biological domain;
• utilise different data formats;
• utilise various database management systems;
• utilise different data manipulation languages; • encode data of various levels of complexity;
• are constantly evolving, and are geographically scattered;
Genbank, SWISS-PROT, and other general-purpose databases, are the main source of biological information, but they have data entries from various species.
There is an increasing need for specialist databases that store more detailed subject- oriented data, compared to general-purpose databases. There is also an increasing need to enable database users (for example researchers) to have the ability to create their own databases. A common purpose for such a database is to combine one's own data with publicly available data for creating special-purpose and subject-oriented databases suitable for data mining and knowledge discovery. Bioinformatics is a field dealing with biomolecular, related structural, related functional, related clinical, and related biochemical data. A standard practice for reusing databases is creation of a subset of an existing bioinformatics database (for example, creating a subset of Genbank that contains only swine sequences). Databases that encode different views to biological data exist, but they are heterogeneous and standards for data integration into an unified database are lacking (Markowitz and Ritter, 1995; Brusic and Zeleznikow, 1999). Attempts to build a unified bioinformatics database for storing biological data have failed to date.
Finally, there is a need to create specialist databases for the same subject across different species.
Consideration of the prior art
Bioinformatics databases:
a) Maslyn et al., US Patent 5,5953,727 relates to a relational bioinformatics database suitable for cataloguing and searching sequences according to association with one or more projects. The present invention provides a system for building and integrating of databases based on combining template modules using multiple dimensions, or views, to data and templates for database tools. b) Eilbeck K. et al., ISM8199,87-105 is an object oriented database created to provide scientists with a resource for examining protein-protein interactions and inferring possible interactions from the data stored. The present invention is a system for building databases using templates. Some of the dimensions (views) can use, but are not restricted to, an object-oriented data design. c) Nowacki et al., Nucleic Acids Res 1998 Janl; 26(1): 2205 describes a database of nucleotide variation using a relational data model. The present invention is a system for building of databases using templates. Use of templates
d) Cruz I.F. and Lucas W.T., 1998. Automatic generation of user-defined virtual documents using query and layout templates. Theory and Practice of Object
Systems 4(4), 245-260. An authoring, querying, and visualisation framework for multimedia information retrieved from distributed repositories. Users compose virtual documents by specifying visually templates that contain both layout information and query specification. The present invention is used for building bioinformatics databases.
e) Thalhammer-Reyero, July 27, 1999 US patent 5930154. Computer-based system and methods for information storage, modelling and simulation of complex systems organised in discrete compartments in time and space. US patent 5930154. An integrated computer-based system, methods, and graphical interfaces, providing an environment for development of visual models of complex systems organised in discrete time and space compartments, used for graphic information storage and retrieval, visual modelling and dynamic simulations of said complex systems. The present invention is used for building bioinformatics databases.
f) Barsalou T., 1989. An object-based architecture for biomedical expert database systems. Computer Methods and Programs in Biomedicine 30(2- 3):157-168. This discloses an object-oriented system for database structuring and manipulation for expert systems. The present invention can use, but is not restricted to using an object-oriented design. The present invention is used for building and use of bioinformatics databases. Reusable databases and systems:
g) Kojima T., Nakata H., Kawagishi M., Uehara T., 1998. A framework for constructing databases for supervisory control systems. Electrical Engineering in Japan 123(1), 32-42. The proposed framework utilises a generation-based approach and object-oriented framework libraries. The present invention is used for building bioinformatics databases.
h) Nguyen J.H., Shahar Y., Tu S.W., Das A.K. and Musen M.A, 1999. Integration of temporal reasoning and temporal-data maintenance into a reusable database mediator to answer abstract, time-oriented queries: The Tzolkin system. Journal of Intelligent Information Systems 13(1-2), 121-145. Tzolkin system facilitates the expression of clinical queries to reduce the manual data processing that users must undertake to decipher the answers to their queries. This approach is general, facilitates software reuse, and thus decreases the cost of building new software systems that require this functionality. Tzolkin facilitates software reuse for generating clinical queries, while the present invention facilitates software reuse for building bioinformatics-related databases.
i) Gennari J.H., Cheng H.N., Altman R.B. and Musen M.A., 1998. Reuse, CORBA, and knowledge-based systems. International Journal of Human- Computer Studies 49(4), 523-546. They developed CORBA-based architecture for a library of platform-independent, sharable problem-solving methods and knowledge bases. The aim of this library is to allow developers to reuse these components across different tasks and domains. The present invention system can, but is not limited to, use CORBA for extraction of data. The present invention approach does not necessarily utilize CORBA standards.
Integration of heterogeneous data:
j) Davidson, S.B., Overton, C, Tannen, V., Wong, L., 1997. BioKleisli: a digital library for biomedical researchers. International Journal of Digital Libraries
1(1), 36-53. Kleisli system enables complex queries across multiple databases and data integration. Kleisli does not specify the system for the user-end data integration using templates. The present invention can, but is not limited to, use Kleisli for accessing heterogeneous data sources and data extraction.
k) Macauley J., Wang H. and Goodman N., 1998. A model system for studying the integration of molecular biology databases. Bioinformatics 14(7):575-582. They tried to build a gene data warehouse by automatic extraction of entries from public databases and discovered numerous errors (up to 20% of entries were determined erroneous by a single criterion). The present invention allows use of templates for expert annotation and is not limited to automatic data acquisition.
1) Chen I.M., Kosky A.S., Markowitz N.M., Szeto E. and Topaloglou T., 1998.
Advanced query mechanisms for biological databases. ISMB, 6, 43-51. This describes a system for integrating tools for exploring multiple heterogeneous databases using Object-Protocol-Model. The present invention allows integration of tools based on templates defined for each tool.
Data warehousing
m) Wu O.P., Seow K.T., Wong L., Chung S.Y. and Subbiah S. 1998. From sequence to structure to literature: the protocol approach to bioinformation. Pacific Symposium of Biocomputing, 747-758. They have described a system for data integration and building data warehouses by extracting information from heterogeneous sources. The present invention describes a system for building a bioinformatic database or a data warehouse by using a set of templates and integration with a set of tools for use of this database.
n) Eckman B.A., Aaronson J.S., Borkowski J.A., Bailey W.J., Elliston K.O.,
Williamson A.R., Blevins R.A., 1998. Bioinformatics. 1998;14(1):2-13 describe a database for storage and use of the expressed sequence tag (EST) data. The present invention is a system for building databases using templates.
o) Sorace J.M. and Canfϊeld K. 1998 Collaborative bioinformatics: data warehouses for targeted experimental results. Journal of Interferon and Cytokine Research 18(9), 799-802. They describe a data warehouse that stores heterogeneous data on measurements of in vitro cellular functions using a single data model. The present invention is a general model for building bioinformatics databases using templates.
Dimensional data model
p) Bunardzic A., 1995. Dimensional modelling: beyond data processing constraints. Medinfo, 8 Pt 1, 520. This describes the dimensional; model focusing on the knowledge of the relevant facts, which are reflecting the business operations and are the real basis for the decision support and business analysis. The present invention focuses on bioinformatics domain.
Knowledge discovery from databases
q) Kolchanov N.A., Ponomarenko M.P., Frolov A.S., Ananko E.A., Kolpakov F.A., Ignatieva E.N., Podkolodnaya O.A., Goryachkovskaya T.Ν., Stepanenko I.L., Merkulova T.I., Babenko V.V., Ponomarenko Y.N., Kochetov A.N., Podkolodny Ν.L., Vorobiev D.V., Lavryushev S.N., Grigorovich D.A.,
Kondrakhin Y.N., Milanesi L., Wingender E., Solovyev N. and Overton G.C. 1999. Integrated databases and computer systems for studying eukaryotic gene expression. Bioinformatics 1999 Jul;15(7):669-686. They describe an integrated database for integration of informational and software resources on the regulation of gene expression, navigation through them and discovery of related knowledge. The present invention is the general system for building and using bioinformatics databases based on template use, suitable for knowledge discovery.
Brusic N. and Zeleznikow J., 1999. Knowledge Discovery and Data Mining in Biological Databases. Knowledge Engineering Review 14(3).
Markowitz V.M. and Ritter O., 1995. Characterising heterogeneous molecular biology database systems. Journal of Computational Biology 2(4), 547-556.
Further references
Altschul S.F. and Gish W. (1996). Methods Enzymol. 266: 460-480. Bairoch,
A., Apweiler, R., 1999. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Res. 27, 49-54. Ballard C, Herreman D., Schau D., Bell R, Kim, E. and Valencic A. 1998.
Data Modeling Techniques for Data Warehousing. IBM Corporation,
International Technical Support Organization, San Jose, California.
Benson, D.A., Boguski, M.S., Lipman, D.J., Ostell, J., Ouellette, B.F., Rapp,
B.A., Wheeler, D.L., 1999. GenBank. Nucleic Acids Res. 27, 12-17. Brusic V. and Zeleznikow J., 1999. Knowledge Discovery and Data Mining in
Biological Databases. Knowledge Engineering Review 14(3).
Bunardzic A., 1995. Dimensional Modeling: beyond data processing constraints. Medinfo 8 Pt 1, 520.
Davidson, S.B., Overton, C, Tannen, V., Wong, L., 1997. BioKleisli: a digital library for biomedical researchers. International Journal of Digital Libraries
1(1), 36-53.
Fayyad U., Piatetsky-Shapiro G. and Smyth P., 1996. From data mining to knowledge discovery. Al Magazine 17(3), 37-54.
Fisher L., 1996. Along the infobahn. Strategy &Business, Third Quarter, 1996. Booz - Allen & Hamilton, Inc. <http://www.strategy- business.com/technology/96308>. Objects of the invention
It is therefore the principal object of the present invention to provide a system which has a framework for relatively fast and relatively efficient: a) building of specialist bioinformatics databases for different species using the same, or a similar, database structure; and/or b) building specialist bioinformatics databases for different molecular forms using the same or a similar, database structure; and/or c) building families of related bioinformatics databases allowing an arbitrary level of complexity; and/or d) selective combination of private data with public data into new bioinformatics databases; and/or e) building of data warehouses for data mining and knowledge discovery in life sciences; and/or f) enabling of the building of data warehouses that integrate data with database search and analysis tools, without a requirement for a significant bioinformatics background.
Summary of the Invention
With the above and other objects in mind, the present invention provides a general system, method, and interfaces for building and integrating databases based on combining template modules using multiple dimensions, or views, to data and templates for database tools. The method may be applicable to domains characterised by complex data where multiple different views to data need to be combined for extraction of information. An example application is from bioinformatics, where multiple databases using this system and method may be built. The information extraction and data management is based on using templates, each of which is designed for a specific purpose. This invention may also provide a system and method for a relatively efficient creation of bioinformatics databases that concentrate on a particular subject within the bioinformatics field, and may reuse templates related to various views of the subject-related data. The breadth and the depth of the coverage of the resulting database may depend on user specifications.
Description of the drawings
In order that the invention may be readily understood and put into practical effect, there shall now described by way of non-limitative example only a preferred embodiment of the present invention, the description being with reference to the accompanying illustrative drawings in which:
Figure 1 is an illustration of the general arrangement of the present invention;
Figure 2 is a representation of an example of the main page of an interface;
Figure 3 is a representation of a second example of the main page of an interface;
Figure 4 is a representation of the views and links of an exemplary database;
Figure 5 is a flow chart for the building of a database;
Figure 6 is an illustration of the design structure for the definition and implementation of a template;
Figure 7 is a flow-chart for the building of a template; Figure 8 is a representation of a graphical user interface for template selection;
Figure 9 is an illustration of a graphical user interface for template building; and
Figure 10 is an illustration of a graphical user interface for the input of parameters for the main template.
Description of preferred embodiment.
The present invention system uses templates for building a database by loading data into a structure defined by the templates.
The general operation of the present invention is given in Figure 1. Data sources can be external (e.g. "Genbank", or other databases) or local. Data can be acquired by a variety of means: manually or by automatic data extraction - for example using the techniques disclosed in (Davidson et al., referred to above). Templates given in Figure 1 include:
• a main template that defines data dimensions and relevant tools for the target database;
• templates for various data dimensions;
• templates for specific tools to be used for searching and analysis of the target database; and
• templates for search and analysis reports.
Templates for additional functions are to be defined and created as required. Database building using present invention has two essential elements: database design, and the loading steps. Database design can be f rther divided into more specific steps: a) decision on the database content; b) selection of the main template and define the main interface page outlay; c) selection of the view, tool, and report templates; d) design new templates if required; and e) template linking Data loading steps are: f) data acquisition; and g) data loading into files.
These steps will be explained using an example that involves databases exemplary swine leukocyte antigen (SLA) (Figure 2), and which was used as the template for a database of functional immunology (Figure 3).
The contents of the two databases differ, but several templates may be used in them both. The template for the main page (Figure 2 and Fig 3) for each was reused. The SLAD modules 'Retrieve Allele Info', Retrieve Epitope Info', and 'Search References' and the FIMM modules 'Diseases', 'Antigens', 'Search FIMM', and 'References' all use the same family of templates that enable keyword searching.
The variations of these templates include optional selection of data dimensions, selection of the number of output entries, or other possible search limiting criteria.
The BLAST search (Altschul and Gish, 1996) templates were used in modules 'BLAST MHC Databases" (SLA) as well as in 'Blast Antigens' and 'Blast HLA'
(FIMM). Other modules used in SLAD and FIMM include modules providing physical maps of genes, sequence alignments for proteins and DNA, phylogenetic analysis, finding and analysis of peptide binding sites, display of 3-D structure of molecules, internet links, and motif searching. Templates for other analysis and search queries related to bioinformatics problems can be added and integrated into the template library. A template preferably consists of an interface page, file formats for data storing, and a set of programs that allow data storing and data retrieval. The interface page may take a standard form such as, for example, the BLAST interface, widely used for the Internet BLAST services; or may be novel such as 'Phylogenetic analysis' of SLAD, used for inter-species sequence comparison. The format of files for data storing is flexible - it depends on the bioinformatics problem related to the question asked. For SLAD and FIMM, some of the files may be flat files, containing record fields, labels and delimiters, or bin-hexed files suitable for BLAST searches.
The list of possible templates is given in Table 1. The present invention is not limited to these templates and other templates for other purposes may be developed. The present invention allows users to build their own databases by selecting the appropriate templates, maintain the databases, and annotate new entries. It also allows users to combine sequence search and analysis tools within the database. It also allows database access and tools to be packaged in a single interface, and brings together the capacity for a user to build the databases and integrate sequence analysis tools. Integrated sequence analysis tools were previously available through packages like GCG (Genetics Computer Group, Wisconsin, USA) but these packages do not enable a user to build databases; they only enable user to create individual sequence entries as separate files and access them through lists of file names.
The present invention allows the building of bioinformatic data warehouses. A data warehouse is a database structured to facilitate analytical tasks, rather than operational purposes. The present invention provides the framework for building bioinformatics warehouses by combining and integrating various data views and analysis tools. Data warehouses are commonly used for performing Knowledge Discovery from Databases (KIDD). KDD is defined as the non-trivial process of identifying valid, novel, potentially useful, and understandable patterns in data. Data warehousing has not previously been described in bioinformatics.
Table 1. Table 1
Figure imgf000014_0001
Figure imgf000015_0001
FIMM and SLAD utilise dimensional modelling, which enables users to form multidimensional views of the relevant facts which are stored in a 'flat' (non- structured), easy-to-comprehend and easy-to-access database. Relational modelling appears too rigid to provide efficient extraction of data for analytical processing needs. Another alternative approach, using the object-oriented modelling can deal with complex data structure, but is difficult to build and has highly structured data. At the core of the dimensional modelling are fact tables that contain the non- discrete, additive data.
The multidimensional views of the FIMM database and their links are shown in Table 2. Data from various views in FIMM are linked, providing the ability to produce series related reports. The links of the FIMM database are given in Figure
4. Table 2
Database building using the present invention may be a multi-step process. It preferably consists of template selection, template storage, template building, refinement (if necessary), and integration of the templates into the database.
The process of database building is given in Figure 5. The general template design structure is given in Figure 6. The process of building individual templates is given in Figure 7. Examples of various graphical user interfaces are given in Figures 8, 9 and 10.
Each database may have at least three dimensions selected from the list consisting of, but not limited to, sequence structural data, sequence functional data, gene expression data, protein expression data, relevant pathology associations, evolutionary data, data on biologically active sites within biomolecular sequences, data on biochemically active sites within biomolecular sequences, pharmacological data, and sequence patterns and motifs.
To now refer to Figure 5, the steps are to:
• select a set of templates from a master list of templates to be integrated into the database. Refer to Figure 8 for the GUI;
• store the set of selected templates for the database;
• complete the specifications for each template. Each template consists of a set of sub-specifications (refer to the Figure 6). Further details of this step are given in Figure 7. Refer to the Figure 9 for the GUI;
• store the specifications and other required information;
• integration is then confirmed and specifications corrected if necessary; then
• conduct integration and building of the database based on the information collected from the storing step described above
In Figure 6 there is shown the design structure for the definition and implementation a template. Each of the templates may contain the full set of sub-specifications, or a partial set (i.e. not all templates will have data input-output).
To refer now to Figure 7 the user interface for the template is usually a HTML page which collects the input parameters from the users. An example is given in Figure 10 for the main template.
The source and the format of the input data for the template are then specified.
Since the input data will be in heterogeneous formats, they may need to be reformatted before storing into the database. The data output format refers to the format of the stored data records. The format of the record to be displayed by the template after processing is then specified. The default format is based on the data output.
The tools and the procedures to be used to process the data are then specified. A set of tools serves as a master copy for the tool specification as a result, the system then generates the integration logic of the template with other templates based on the specifications.
The specifications are finally confirmed.
Figure 8 shows a graphical user interface for the template selection. The graphical user interface may have a first polygonal area and a plurality of contained polygonal areas and/or textual links within the first polygonal area; the contained polygonal areas including at least one second polygonal area to enable available templates to be listed, and a third polygonal area to enable the selected templates to be listed. The second polygonal area can display titles of selected templates, and the third polygonal area can specify the data to be entered on the database.
A fourth polygonal area may be provided for specifying additional data to be entered on the database. The second polygonal area may be in a plurality of segments, with there being one segment for each template title. The first, second, third and fourth polygonal areas are preferably rectangular, as are each of the plurality of segments.
The template shown in Figure 10 shows a preferred form of a graphical user interface for the specification of the input parameters for the main template. It has a first polygonal area which contains a second polygonal area for displaying a Hst of selected templates, and a third polygonal area for specifying the data to be entered on the database.
A contained fourth polygonal area for specifying additional data to be entered on the database may also be provided. The second polygonal area is preferably in a plurality of segments, there being one segment for each template title. It is preferred that the first, second, third and fourth polygonal areas are rectangular, as are the segments. However, other shapes may be used, if desired.
Figure 9 shows the GUI for template building . Like the GUIs of Figures 8 and 10, it has a first polygonal area, preferably rectangular, and two contained polygonal areas, which are also preferably rectangular. The first contained area is used to select the template, and the second contained area is used to list the sub-specifications of the template selected.
Whilst there has been described in the foregoing description preferred embodiments of the present invention, it will be understood by those skilled in the technology that many variations or modifications in the specific details may be made without departing from the present invention.

Claims

The Claims
1) A computer system for creation of at least one bioinformatics database, other than creating subsets of an existing database, wherein: a) the bioinformatics database has records that comprise sequence records using a dimensional model identifying at least one view to data, b) a user interface allowing the extraction of information and analysis of data in the bioinformatics database, and, c) a library of re-usable templates for establishing structure for the bioinformatics database.
2) The computer system of claim 1 , wherein a new structure for the bioinformatics database can be produced by combining templates.
3) The computer system of claim 1, wherein new entries are added to the at least one bioinformatics database with new entries being linked by using update templates.
4) A computer system as claimed in claim 1, wherein a new structure the bioinformatics database can be created by combining templates; an new entries are added to the at least one bioinformatics database with new entries being linked by using update templates.
5) The computer system of claim 1, wherein the system is used to produce bioinformatics data warehouses.
6) The computer system of claim 1, wherein the at least one bioinformatics database is used for the purposes selected from the list comprising one or more of knowledge discovery and data mining. 7) A computer system as claimed in of claim 1, wherein the bioinformatics database is selected from the lists comprising immunological databases and MHC-molecules-related databases.
8) A computer system as claimed in claim 1, wherein there are at least three different views to data.
9) A computer system as claimed in claim 8, wherein there is a first view to data which is a nucleotide or protein sequence with basic annotation.
10) A computer system as claimed in claim 9, wherein there are second and subsequent views to data each of which contains at least one view in relation to the sequence of the first view, the second and subsequent views being selected from the list comprising structural data, functional data, peptide data, references, disease association, gene expression, relevant pathology associations, evolutionary data, MHC data, active sites data, pharmacological data, and biological pathways.
11) A computer system as claimed in claim 1, wherein each bioinformatics database has at least three dimensions selected from the list consisting of sequence structural data, sequence functional data, gene expression data, protein expression data, relevant pathology associations, evolutionary data, data on biologically active sites within biomolecular sequences, data on biochemically active sites within biomolecular sequences, pharmacological data, sequence patterns and motifs, and biological pathways.
12) A method for creating multiple related bioinformatics databases, other than creating subsets of existing databases, including: a) selecting a main template; b) defining a main interface page outlay; c) establishing a library of re-usable templates to enable a structure for the bioinformatics database to be established; and d) linking the templates.
13) The method of claim 12, wherein a new structure for the bioinformatics database is produced by combining templates.
14) The method of claim 12, wherein new entries are added to the at least one bioinformatics database with new entries being linked by using updated templates.
15) The method of claim 12, wherein a new structure the bioinformatics database is created by combining templates; and new entries are added to the at least one bioinformatics database with new entries being linked by using update templates.
16) The method of claim 12, wherein the method is used to produce bioinformatics data warehouses.
17) The method of claim 12, wherein the at least one bioinformatics database is used for the purposes selected from the list comprising one or more of knowledge discovery and data mining.
18) The method of claim 12, wherein the bioinformatics database is selected from the list comprising immunological databases and MHC-molecules-related databases.
19) The method of claim 12, wherein there are at least three different views to data.
20) The method of claim 19, wherein there is a first view to data which is a nucleotide or protein sequence with basic annotation.
21) The method of claim 20, wherein there are second and subsequent views to data each of which contains at least one view in relation to the sequence of the first view, the second and subsequent views being selected from the list comprising structural data, functional data, peptide data, references, disease association, gene expression, relevant pathology associations, evolutionary data, MHC data, active sites data, pharmacological data, and biological pathways.
22) The method of claim 12, wherein each bioinformatics database has at least three dimensions selected from the list consisting of sequence structural data, sequence functional data, gene expression data, protein expression data, relevant pathology associations, evolutionary data, data on biologically active sites within biomolecular sequences, data on biochemically active sites within biomolecular sequences, pharmacological data, sequence patterns and motifs, and biological pathways.
23) A graphical user interface for use in creating multiple related bioinformatics databases, the graphical user interface having a first polygonal area and a plurality of contained polygonal areas within the first polygonal area; the contained polygonal areas including at least one second polygonal area to enable available templates to be listed, and a third polygonal area to enable the selected templates to be listed.
24) A graphical user interface for use in creating multiple related bioinformatics databases, the interface having a first polygonal area and a plurality of contained polygonal areas within the first polygonal area; the contained polygonal areas including a second polygonal area for displaying fitter of selected templates, a third polygonal area for specifying the data to be entered on the database.
25) A graphical user interface as claimed in claim 25, wherein the contained polygonal areas include a fourth polygonal area for specifying additional data to be entered on the database. 26) A graphical user interface as claimed in claim 25, wherein the second polygonal area is in a plurality of segments, there being one segment for each template title.
27) A graphical user interface as claimed in claim 24, wherein the first, second and third polygonal areas are rectangular.
28) A graphical user interface as claimed in claim 26, wherein the fourth polygonal area is rectangular.
29) A graphical user interface as claimed in claim 27, wherein each of the plurality of segments is rectangular.
PCT/SG2000/000155 2000-09-25 2000-09-25 A system, method and interface for building biological databases using templates WO2002025564A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/SG2000/000155 WO2002025564A1 (en) 2000-09-25 2000-09-25 A system, method and interface for building biological databases using templates
GB0306836A GB2383452B (en) 2000-09-25 2000-09-25 A system,method and interface for building biological databases using templates

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/SG2000/000155 WO2002025564A1 (en) 2000-09-25 2000-09-25 A system, method and interface for building biological databases using templates

Publications (1)

Publication Number Publication Date
WO2002025564A1 true WO2002025564A1 (en) 2002-03-28

Family

ID=20428867

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SG2000/000155 WO2002025564A1 (en) 2000-09-25 2000-09-25 A system, method and interface for building biological databases using templates

Country Status (2)

Country Link
GB (1) GB2383452B (en)
WO (1) WO2002025564A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006136055A1 (en) * 2005-06-22 2006-12-28 Zte Corporation A text data mining method
WO2007001195A1 (en) * 2005-06-27 2007-01-04 Biomatters Limited Methods for the maintenance and analysis of biological data
US9070106B2 (en) * 2008-07-14 2015-06-30 International Business Machines Corporation System and method for dynamic structuring of process annotations

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2409916A (en) * 2003-07-04 2005-07-13 Intellidos Ltd Joining query templates to query collated data

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5692107A (en) * 1994-03-15 1997-11-25 Lockheed Missiles & Space Company, Inc. Method for generating predictive models in a computer system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5734883A (en) * 1995-04-27 1998-03-31 Michael Umen & Co., Inc. Drug document production system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5692107A (en) * 1994-03-15 1997-11-25 Lockheed Missiles & Space Company, Inc. Method for generating predictive models in a computer system

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
CHEN ET AL: "Constructing and maintaining scientific database views in the framework of the object-protocol model", SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, 1997. PROCEEDINGS., NINTH INTERNATIONAL CONFERENCE ON OLYMPIA, WA, USA 11-13 AUG. 1997, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, 11 August 1997 (1997-08-11), pages 237 - 248, XP010245177, ISBN: 0-8186-7952-2 *
CHEN ET AL: "Developing and accessing scientific databases with the OPM data management tools", DATA ENGINEERING, 1997. PROCEEDINGS. 13TH INTERNATIONAL CONFERENCE ON BIRMINGHAM, UK 7-11 APRIL 1997, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, 7 April 1997 (1997-04-07), pages 580, XP010218583, ISBN: 0-8186-7807-0 *
CHUNG S Y ET AL: "Kleisli: a new tool for data integration in biology", TRENDS IN BIOTECHNOLOGY, ELSEVIER, AMSTERDAM, NL, vol. 17, no. 9, 1 September 1999 (1999-09-01), pages 351 - 355, XP004179984, ISSN: 0167-7799 *
DAVIDSON S B ET AL: "BioKleisli: a digital library for biomedical researchers", INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES, HEIDELBERG, DE, vol. 1, no. 1, April 1997 (1997-04-01), pages 36 - 53, XP002167355, ISSN: 1432-5012 *
JASON WANG ET AL: "Pattern Discovery in Biomolecular Data", PATTERN DISCOVERY IN BIOMOLECULAR DATA: TOOLS, TECHNIQUES, AND APPLICATIONS, NEW YORK: OXFORD UNIVERSITY PRESS, US, 1999, pages 161,165,172,175, XP002168720, ISBN: 0-19-511940-1 *
PATON ET AL: "Query processing in the TAMBIS bioinformatics source integration system", SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, 1999. ELEVENTH INTERNATIONAL CONFERENCE ON CLEVELAND, OH, USA 28-30 JULY 1999, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, 28 July 1999 (1999-07-28), pages 138 - 147, XP010348735, ISBN: 0-7695-0046-3 *
RIECHE ET AL: "A federated DBMS-based integrated environment for molecular biology", SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, 1994. PROCEEDINGS., SEVENTH INTERNATIONAL WORKING CONFERENCE ON CHARLOTTESVILLE, VA, USA 28-30 SEPT. 1994, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, 28 September 1994 (1994-09-28), pages 118 - 127, XP010100536, ISBN: 0-8186-6610-2 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006136055A1 (en) * 2005-06-22 2006-12-28 Zte Corporation A text data mining method
CN101151843B (en) * 2005-06-22 2010-05-12 中兴通讯股份有限公司 Text data digging method
WO2007001195A1 (en) * 2005-06-27 2007-01-04 Biomatters Limited Methods for the maintenance and analysis of biological data
US9070106B2 (en) * 2008-07-14 2015-06-30 International Business Machines Corporation System and method for dynamic structuring of process annotations

Also Published As

Publication number Publication date
GB0306836D0 (en) 2003-04-30
GB2383452A (en) 2003-06-25
GB2383452B (en) 2005-03-30

Similar Documents

Publication Publication Date Title
Gibas et al. Developing bioinformatics computer skills
Jagadish et al. Database management for life sciences research
Krallinger et al. Text-mining and information-retrieval services for molecular biology
US7058643B2 (en) System, tools and methods to facilitate identification and organization of new information based on context of user&#39;s existing information
US20050039123A1 (en) Method and system for importing, creating and/or manipulating biological diagrams
Fasman et al. The GDB TM Human Genome Data Base anno 1994
Birkland et al. BIOZON: a hub of heterogeneous biological data
Shaker et al. The biomediator system as a tool for integrating biologic databases on the web
Baker et al. Recent developments in biological sequence databases
Cannataro et al. Proteus, a grid based problem solving environment for bioinformatics: Architecture and experiments
García-Sancho From Metaphor to Practices: the Introduction of" Information Engineers" into the First DNA Sequence Database1
Nazipova et al. Big Data in bioinformatics
Miller et al. IMAGE cDNA clones, UniGene clustering, and ACeDB: an integrated resource for expressed sequence information
EP1221126A2 (en) Graphical user interface for display and analysis of biological sequence data
WO2002025564A1 (en) A system, method and interface for building biological databases using templates
Valencia Search and retrieve
Shoop et al. MetaFam: a unified classification of protein families. II. Schema and query capabilities
Masseroli et al. Bio-SeCo: Integration and global ranking of biomedical search results
Moussouni et al. Database Challenges for Genome Information in the Post Sequencing Phase Moussouni
US20050004785A1 (en) System, method and computer product for predicting biological pathways
Benton Integrated access to genomic and other bioinformation: an essential ingredient of the drug discovery process
Borovska et al. Intelligent integrated r&e big OMICS data and in silico knowledge hub “nova paradigma”
Coessens et al. Ontology guided data integration for computational prioritization of disease genes
Navathe et al. Genomic and proteomic databases: Foundations, current status and future applications
Masseroli et al. MyWEST: My Web Extraction Software Tool for effective mining of annotations from web-based databanks

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): GB SG US

ENP Entry into the national phase

Ref document number: 0306836

Country of ref document: GB

Kind code of ref document: A

Free format text: PCT FILING DATE = 20000925

Format of ref document f/p: F

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)