WO2005062202A2

WO2005062202A2 - Knowledge management system with ontology based methods for knowledge extraction and knowledge search

Info

Publication number: WO2005062202A2
Application number: PCT/DK2004/000856
Authority: WO
Inventors: Thomas Eskebaek
Original assignee: Thomas Eskebaek
Priority date: 2003-12-23
Filing date: 2004-12-10
Publication date: 2005-07-07
Also published as: DK200301926A; WO2005062202A3

Abstract

A method of performing knowledge extraction from natural language text documents including the steps of reading an input, using semantic based means for extracting concepts and their interrelations from said input text, transforming said input text into a machine understandable knowledge representation so as to provide knowledge libraries from said documents and optionally storing said libraries. The method uses a defined ontology to specify possible semantic relations. Furthermore, the method provides knowledge structures consisting of an arbitrary number of concepts and their interrelations. The method utilises a predetermined mapping between words in the input and concepts and relations in the ontology. These features make a very precise knowledge extraction possible. Thereby, very specific searches can be carried out using the generated knowledge base or library. Additionally, the method is not dependent on the grammatical structures of the sentences in the input text. In that sense, the method for extracting knowledge is language independent.

Description

Knowledge Management System with Ontology Based Methods for Knowledge Extraction and Knowledge Search

The present invention relates to a method of knowledge extraction and search f om natural language text documents including the steps of reading an input text, using semantic based means for extracting concepts and their interrelations from said input text, transforming said input text into a machine understandable knowledge representation so as to provide knowledge libraries from said documents, and optionally storing of said libraries.

The amount of available information has become immense during the past many years, especially with the development and ever expanding use of the Internet. Even the simplest attempt to find information often results in information overload. Methods for refining a search are therefore needed in order to find the desired information.

A similar situation exists for a business organisation or an institution, where knowledge in natural language documents represents a substantial resource for that business or institution. However, that resource is only as valuable as the ability to extract knowledge from these documents.

Considerable effort has expended in the development of software for extraction of information or knowledge from natural language documents. Such software is generally in the field of natural language processing or computational linguistics and uses such techniques as parsing and classifying.

Allowing computers to relate to knowledge has endless usages. One of the best examples of such usage is the matter of searching for knowledge. Presently, searching for knowledge is achieved either by searching for specific words (such as for example searching with Google and other online search engines) or by using a categorised index (an example of this could be the categories under which patents are classified). Enabling computers to relate to knowledge on our terms, i.e. in the form of natural language, makes it possible to search for conceptual meanings in natural language instead of specific words or categorised indices. Searching for the complex concept lack of vitamins will, for instance, in the known systems return a list of all documents containing the word lack and the word vitamins (for instance in a Google search). Similarly, by using a categorised index, a search (depending on the categorisation used) will result in a list of all vita- mins in that category or a list of different types of lack.

Enormous efforts have been put into attempts to create systems for extracting knowledge from natural language texts. Such a system is described in international patent application WO 02/10980 concerning a concept-based search and retrieval system. This system parses the text in documents and creates a library of predicate structures describing the different concepts in the text. Each of these predicate structures consists of a predicate, which is either a verb or a preposition, and a set of arguments, which can be any part-of-speech. The system utilises a linguistic parser to generate a linguistic parse tree, i.e. structures representing grammatical, syntactic relations be- tween the concepts in the text based on syntactically tagging of concepts. In the above patent application, ontology means a data structure providing a context for the lexical meaning of concepts. Finally, the system uses a grammatical system in which rule probabilities and conflicting ontological descriptions are used to resolve some ambiguities in the possible syntactic parses of sentences.

WO 02/10980 also describes a method of knowledge search. The method consists of searching for predicate structures, i.e. knowledge structures consisting of a single predicate (or relation) and its arguments.

The linguistic method of creating predicate structures based on grammatical relations between words, i.e., part-of-speech, has the disadvantage that it cannot compensate for grammatical errors in the input text, nor for the incompleteness of current linguistic grammars. Furthermore, since the predicate structures are based on all parts of the text that contains a predicate and an argument, the built library of predicate structures can be very large and in the same process the probability of ambiguities is increased. Finally, the search method lacks the ability to refine the search with respects to specialisation of specific concepts. For instance, the search for the complex concept lack of vitamins will not necessarily include the complex concept lack of ascorbic acid (lack of vitamin C).

Another system for knowledge extraction is described in US patent no. 6,263,335 concerning a system for extracting concept-relation-concept triples. This system uses linguistic knowledge for extracting knowledge and considers proper nouns for the concepts. This invention has the disadvantage that it only extracts concept-relation- triples and therefore has some limitations in respect to how precise a knowledge extraction and search that can be performed. Furthermore, the knowledge extraction is also limited in the sense that only nouns are considered for the extraction of concepts.

US patent no. 6,006,221 describes a multilingual document and retrieval system and method of using semantic vector matching. In short, the described system generates a vector describing the conceptual contents for every document in the system. When a user performs a query, this query is also transformed into a vector. The system subsequently compares the query vector with the vectors of the documents and returns the documents, whose vectors match the query vector above a certain threshold. The system, however, only conceptualises proper nouns and does not conceptualise the relations between the concepts. Furthermore, it uses a fϊxed-size vector for each document and query, which limits the precision of the knowledge extraction and later search.

US patent no. 6,038,560 relates to a system for document categorisation. The system uses a knowledge base for storing categories, wherein said categories depict subject matter used for classification of information. I.e. the system uses a hierarchy of themes. A number of terms are associated to each theme. This is related to the conversion of query terms into query themes, whereby the query can be extended to comprise a theme instead of the specific query term. Thus, the system according to the said US patent is a categorisation mechanism allowing queries in categories. It is the object of the invention to provide a method of extracting knowledge from documents (and optionally creating a knowledge base) enabling a computer to relate to the knowledge in the documents. A search for lack of vitamins will for instance result in a list of all the documents in which the complex concept lack of vitamins oc- curs. However, this list must also include documents discussing lack of specific vitamins and documents discussing what is caused by lack of vitamins. Using the present invention, it becomes possible to search for complex concepts instead of only individual or sets of terms.

The above object and numerous other objects, features and advantages, which will be evident from the following description, are obtained by using a defined ontology to specify possible semantic relations, providing knowledge structures consisting of an arbitrary number of concepts and their interrelations, and using a predetermined mapping between words in the input text and concepts and relations in the ontology.

These features enable a building of knowledge structures consisting of concepts and their interrelations and representing knowledge extracted from the input text. As a result a very precise knowledge extraction is possible, allowing specific searches to be carried out using the generated knowledge base or library. Additionally, the prede- termined mapping between words in the input text and possible concepts and relations in the ontology reduces the number of ambiguities in the optionally generated knowledge library. The method according to the invention does not need statistical, neural network, or other machine learning methods for knowledge extraction or knowledge search. Consequently, it does not require any "training", which is nor- mally the case for machine learning based methods. As a result, the deployment of the method according to the invention is faster than common methods or systems. Furthermore, the invention provides the building of knowledge structures consisting of an arbitrary number of concepts and their interrelations, which allows for a far more precise knowledge extraction and later search for this knowledge. Finally, the method is not dependent on the grammatical structures of the sentences in the input text. In that sense, the method for extracting knowledge is language independent. In a preferred embodiment of the method according to the invention, the applied ontology ensures that the knowledge search also includes concepts that are specialisations or generalisations of the search.

In another preferred embodiment of the method, the predetermined mapping of words to relations and concepts allows for skipping irrelevant words or equivalently relations and concepts, thereby reducing the complexity and size of the knowledge base and of the extraction process itself.

In yet another preferred embodiment, the knowledge extraction method utilises non- word relations. Thereby it is, for example, possible to extract a semantic relation between two concepts, e.g., an adjective and a noun.

Other preferred embodiments have been stated in the dependent claims.

The invention also relates to a method for iteratively searching the generated knowledge structures with use of the defined ontology according to claim 22.

The invention is now to be described in greater detail with reference to the drawings, in which

Fig. 1 is a flow chart illustrating the knowledge extraction method according to the invention, /

Fig. 2 is a flow chart illustrating the function of a text preparator used in the knowledge extraction method,

Fig. 3 is a flow chart illustrating a knowledge base inserter used in the knowledge extraction method,

Fig. 4 is a flow chart illustrating a search method according to the invention, Fig. 5 is a flow chart illustrating a search process in a knowledge base in connection with the search method, and

Fig. 6 is a visual representation of an example skeleton ontology used in connection with a specific knowledge extraction or a search process.

Fig. 7 is a visual representation of a example populated ontology (or knowledge base) used in the description of the knowledge search method.

In the following detailed discussion of the present invention, numerous terms, specific to the subject matter, are used. In order to provide complete understanding of the present invention, the meaning of these terms is set forth below as follows:

Knowledge: The term knowledge is difficult to define due to its many usages; how- ever, for the present document the term knowledge will be used for the semantic contents of two or more interrelated concepts.

Concept: The term concept is used to describe the semantic (knowledge) content of a single or small group of words. For example, the word lack is an instantiation of the concept by words such as lack, lacking, and being without.

Complex Concept: Given the above definition of a concept, the term complex concept will be used to describe two or more concepts that are combined via specified relations. A complex concept will typically represent the semantic contents of more than one word, although this is not a requirement. A complex concept is a kind of concept and thus, for the remainder of this document, whenever the term concept is used, it may be a complex concept unless specifically stated otherwise. An example of a complex concept is lack of vitamin, which is the concept for lack related to the concept for vitamins by the relation of

Concept Instantiation: The term concept instantiation covers the situation, where a word can semantically represent a given concept. If this is the case, this word instan- tiates the concept. A given word can instantiate more than one concept and, in rare cases, instantiates no concepts.

Sentence Element: A sentence element holds all the concepts that a given word instantiates. During the knowledge extraction from an input text, a sentence element is created for all words (or set of words) that instantiates one or more concepts.

Relations: These may also be called semantic relations. This term describes a concept with the special feature that it has the ability to relate to one or more non- relation concepts. In the example of the complex concept lack of vitamin, the concept of is a relation (concept).

Knowledge Structure: A knowledge structure is computer understandable representation of a, possibly complex, concept. A knowledge structure will typically exist in the form of a data structure used by the system incorporating the invention.

Descriptor: The term descriptor is used to describe a knowledge structure that additionally has the property that it accurately represents the semantic knowledge of a given piece of text (typically a sentence). Thus, it is a computer understandable rep- resentation of a complex concept that describes the semantic contents of a given piece of input text.

Concept Specialisation: The term concept specialisation describes the situation in which a given concept is a specific type of another concept. In this case, the first con- cept specialises the other concept. An example of such concept specialisation may be found by considering the concepts vitamin and vitamin-C. Here, the concept vitamin describes all the individual vitamins as well as vitamins in general; similarly, the concept vitamin-C covers all the individual C-vitamins and C-vitamins in general. In this example, the concept vitamin-C specialises the concept vitamin as it describes a set of more specific concepts, where the set is a subset of what is described by the concept vitamin. Concept Generalisation: For the exact opposite of concept specialisation, the term concept generalisation is used. Thus, the concept vitamin is a generalisation of the concept vitamin-C.

Subsumption: The term subsumption is used when comparing complex concepts (and thus also when comparing knowledge structures or descriptors). The term describes the situation where a given complex concept is more general or covers everything and more than another complex concept; in this case, the first concept subsumes the other concept. Thus, subsumption is a form of concept generalisation be- tween complex concepts.

Knowledge Base: This term is a generic term for any kind of information storing structure used for storing knowledge. In this document, the term will be used to describe the system that stores the knowledge extracted using the method described in this document. Furthermore, the knowledge base is subject to searches by the search method also described herein.

Ontology: The term ontology has been used to describe a number of features within philosophy and computational linguistics. For the present document, the term ontol- ogy is used to describe a special kind of knowledge base consisting of two primary parts. The first part is a concept hierarchy (also called a taxonomy), which describes how concepts are related to one another in terms of which concepts specialises which concepts and which concepts are generalisations of which concepts. Such a taxonomy typically has an upside down tree structure; however, as we are dealing with concepts, it may occur that a node in such a tree has more than one parent. This is due to the fact that a given concept may in fact be a specialisation of two (or more) other concepts, possible the merging of these two. An example of such a concept is amphibious vehicle as this is both a specialisation of the concept car and the concept boat. This part of the knowledge base (or ontology) is called the skeleton ontology (see below). The other part of the ontology consists of complex concepts that have been instantiated to represent some of the knowledge stored in the knowledge base. Thus, the second part of the ontology adds new nodes to the hierarchy tree below the existing nodes; each of the new nodes represent some combination of the concepts in the hierarchy bound together by specific interrelations. When knowledge is added to an ontology it becomes a populated ontology (see below).

Skeleton Ontology: As mentioned above, the term skeleton ontology is used to describe a hierarchy of concepts. Furthermore, a skeleton ontology also specifies what relations are possible between the concepts, beyond the specialisation and generalisation relations, i.e. a predefined set of concepts and their possible interrelations. With the complex concept lack of vitamin discussed above, the skeleton ontology may specify that the concept lack can be related to the concept vitamin via the o relation. The specification and generalisation of these possible relations between concepts is also included in the ontology.

Populated Ontology: A populated ontology describes an ontology that contains knowledge beyond what is in the skeleton ontology, i.e., it contains actual knowledge in the form of instantiated complex concepts. As mentioned, an ontology becomes populated when some knowledge is added (or inserted) into it; this typically happens when a descriptor has been created for a given piece of text and one wishes for the knowledge in this text to be part of the knowledge base. The descriptor is then added to the ontology, thus populating it. The ontology afterwards includes the knowledge in the descriptor or equivalently the knowledge represented by the input text.

As previously stated, the invention relates to a method of extracting knowledge from natural language text and searching this extracted knowledge.

The knowledge extraction method takes natural language text and produces descriptors representing the knowledge contained in the input text. A flow chart of the knowledge extraction method is shown in Fig. 1. This figure illustrates the input text, which is sent to a text preparator 200. The output of the text preparator 200 is sent to an iterator 110. The output from the iterator 110 is handled by a combination method 150, which uses a skeleton ontology 10. The output from the combination method is sent to a stack 140, which finally builds descriptors. All of the elements in the figure are described in detail in the following. The knowledge extraction method utilises a skeleton ontology containing information about the hierarchy of concepts and the possible interrelations of these concepts.

The method is best explained by showing an example. In the following, the knowledge extraction method uses the skeleton ontology shown in Fig. 6 to extract knowledge from the sentence: lackofnicotinamide causes pellagra.

As defined by the skeleton ontology (conf. Fig. 6), nicotinamide is a substance (sig- nified by the concept <SUBSTANCE>), which in turn is a concretion, represented by the concept <CONCRETION>. Similarly, we see that pellagra is an illness (<ILL- NESS>) and that lack is a state (<STATE>), both of which are a special type of occurrence (<OCCURRENCE>). At the top of the concept hierarchy is the all- encompassing <ANYTHING> concept.

Beyond defining the hierarchical relations between the concepts, the skeleton ontology also defines a number of possible non-hierarchic interrelations between the concepts. These are represented by diamonds in the figure. The words and letters inside the diamonds specify the type of possible relations between the concepts, which are related via the diamond. Furthermore, the arrow at one end of the relations specifies in which direction the relation may occur. Although any combinations of letters or names can be used for the relations (as well as for the concepts), the semantic meanings of a number of relations, including those used in the example skeleton ontology, are defined according to those proposed in [Jørgen Fischer Nielsen, A Logico- Algebraic Framework for Ontologies, Proceedings of First International OntoQuery Workshop, Department of Business Communication and Information Science, University of Southern Denmark, Kolding, 2001], said paper also inluding a listing of other proposed relations and their meanings. The semantic meanings of certain relations are listed in Tab. 1. The present invention can also use other kinds of semantic relations.

Table 1 Though not explicitly stated in the graphical representation, the skeleton ontology also contains information about the words that are able to instantiate the relations.

This correspondence between words and relations is given in addition to the skeleton ontology via the predetermined mapping of concepts and relations and the words that instantiate them. For the example skeleton ontology, the relations and some of the words are given in Tab. 2.

Table 2 12

Some relations are special in the sense that they do not need a word in order to be instantiated. Such relations are called non-word relations. An example of the use of a non-word relation can be found by considering the text a red car. The descriptor for this text would be car * [CHR : red] . In this descriptor, the CHR relation was instantiated without a word in the text corresponding thereto. Thus, in this usage, the CHR relation is a non-word relation. The skeleton ontology specifies which relations can occur as non-word relations.

When constructing the output descriptors, the method uses a stack, which is updated continuously. The extraction method iterates over the words in the input text one by one. In each iteration, the conceptual knowledge of the current word is combined with the conceptual knowledge of all the previous words, i.e., the descriptors representing the knowledge contained in the previous words. As a result, the stack always contains the descriptors representing the knowledge contained in the input text up to the current word. At the beginning, the stack is empty. However, when the method iterates over the first word of the input text, the conceptual representation of this word is added to the stack.

For all iterations over the words in the input text, the first step of the method according to the invention is to generate the conceptual representation of the current word. The conceptual representation of a word is generated by the Text Preparator 200 shown in Fig. 1. This conceptual representation is contained in a so-called sentence element. Such a sentence element includes the set of all the possible concepts and re- lations the word in question can instantiate. In the present example, the word lack instantiates the concept <STATE> and the word σ/instantiates the relation WRT. Some words may instantiate more than one concept or relation; they may even instantiate both concepts and relations. When a word instantiates a relation, both the relations and the concepts that it can relate are included in the sentence element of the word. Thus, when the word o/instantiates the relation WRT, it is in fact all the different possible usages, as specified in the skeleton ontology, of the WRT relation it instantiates. This is shown in Tab. 3, where the sentence elements of each of the words in the input text of the given example are shown.

Table 3

It appears from Tab. 3 that the two words that instantiate relations have some relatively complex sentence elements. For the word of, the sentence element defines that it instantiates the relation WRT, which can occur only between a <STATE> concept and an <ANYTHING> concept or specialisations thereof. The word causes, in turn, instantiates a concept, which can act as a relation between two pairs of concepts, namely between a <CONCRETION> and a <CONCRETION> or between an <OC- CURRENCE> and another <OCCURRENCE>. This is all specified by the skeleton ontology in Fig. 6. Also note the syntax used to describe the relation between two concepts. This syntax is part of ONTOLOG, the logico-algebraic framework pro- posed by Jørgen Fischer Nilsson (conf. the previous cited reference). A multitude of other syntaxes could be used to represent this information; however, the ONTOLOG syntax is very easy to read for humans and computers alike. In each of the sentence elements in Tab. 3, one concept or relation is marked with an exclamation mark ( ! ). This marked concept or relation is the concept or relation that has been instantiated by the word. This marking is very important to the method, which will be described in the following.

The discussion above concerns the preparations needed for the method, particularly the preparation of the input text. The following discussion concerns the iterations performed by the method when constructing the descriptors, i.e. the output of the knowledge extraction method. The process described in the following is depicted in Fig. 1. Each of the iterations described in detail below is initiated by the iterator 110. Every iteration uses the combination method 150 (described in detail be- low), which in turn uses the skeleton ontology 10 and the stack 140. The final result of the process is a set of descriptors. The iteration may be sequential or nonsequential (i.e. use the word order or not); in the following example, only the sequential iteration is described.

As mentioned above, the workspace of the method is a stack. After the first iteration, the stack contains the conceptual representation of the semantic content of the first word of the input text, i.e. the word lack. Thus, the stack contains a single element, namely the concept <STATE>. This iteration is not that interesting as the only thing that happens is that the method adds all concepts and relations instantiated in the sen- tence element of the word to the stack. After the first iteration, the stack looks as follows:

<STATE> !

The second iteration is more interesting. In this iteration, there are unfinished descriptors on the stack (in the given example, only a single descriptor). Thus, the method must combine the sentence element of the current word with the descriptors already on the stack. The current word is ø/and the sentence element for this word contains <STATE> * [WR ! : <ANYTHING>] . Given that the descriptor on the stack currently is only <STATE> ! , it is very simple to combine the current sentence element with the descriptor on the stack. This combination results in the following descriptor, which is now on the stack:

<STATE> ! * [WRT ! : <ANYTHING>] This is where the marking of concepts and relations is interesting. For the present descriptor, the markings tell us, that the concept <STATE> and the relation WRT are both instantiated by a word in the input text and thus are immutable. Conversely, the concept <ANYTHING> is not instantiated by words in the input; there- fore, it is mutable in the following iterations.

In the third iteration, the method combines the descriptor on the stack with the sentence element for the word nicotinamide, which contains the concept <SUB- STANCE> ! . The descriptor in the stack currently looks like <STATE> ! * [WRT ! : <ANYTHING>] , so this is not an immediate match. Therefore, the method uses the skeleton ontology to find the concepts subsuming the concept < SUBSTANCES Since the skeleton ontology is a taxonomy (concept hierarchy), it is implicit that a given concept is a special type of any concept that subsumes it. In the present example, the concept <SUBSTANCE> is subsumed by the concept <CON- CRETION Thus, a <SUBSTANCE> is a specialisation of <CONCRETION>. Similarly, in the given exemplified skeleton ontology, a <CONCRETION> is a special kind of <ANYTHING>, and thus we are_. able to combine the sentence element with the descriptor on the stack. As a result, the stack will contain the following descriptor:

<STATE> ! * [WRT ! : <ANYTHING> ! ]

The fourth iteration brings new challenges. In this iteration, the word from the input text has instantiated a relation with two possible usages. These must be combined with the descriptor in the stack. The first usage of the instantiated relation is <CONCRETION> * [CAU ! : <CONCRETION>] . Though the descriptor on the stack looks like this: <STATE> ! * [WRT ! : <ANYTHING> ! ] , the word that instantiated <ANYTHING> was in fact a <SUBSTANCE>, which in turn is a <CONCRETION>. Thus, we have a combination leading to the following descriptor, which is added to the stack: <STATE>! * [WRT!: <ANYTHING>! * [CAU!: <CONCRETION>] ]

The second usage of the instantiated relation is <OCCURRENCE> * [ CAU : <OC- CURRENCE>] . This usage cannot be combined with the last part of the descriptor, i.e. the <ANYTHING> stemming from a <SUBSTANCE>. It can, however, be combined with the first part of the descriptor, namely the <STATE>, because a <STATE> is a special kind of an <OCCURRENCE>. Combining the relation with the first part of the descriptor thus gives a new descriptor:

[<STATE> ! * [WRT ! : <ANYTHING> ! ] ] * [CAU ! : <OCCURENCE>]

Going to the fifth iteration, the stack now has two elements, namely: <STATE> ! * [WRT ! : <ANYTHING> ! * [CAU ! : <CONCRETION>] ] [<STATE> ! * [WRT ! : <ANYTHING> ! ] ] * [CAU ! : <OCCURRENCE>]

In the fifth iteration, the sentence element for the woτdpellagra is combined with the descriptors in the stack. The concept instantiated by pellagra is <ILLNESS>, which in turn is an <OCCURRENCE>. Combining this with the first descriptor on the stack is impossible, as this descriptor is "looking for" a <CONCRETION>. Thus, the first descriptor is removed from the stack. Combining with the second descriptor results in a new descriptor, namely: [<STATE>! * [WRT!: <ANYTHING> ! ] ] * [CAU!: <OCCURRENCE> ! ]

At this point, there is no more input and thus no more sentence elements to be combined with the stack. This means that the knowledge extraction process is complete and the method results in the descriptors left on the stack. In the given example, only one descriptor is left on the stack. A rewriting gives a knowledge structure containing the following:

[lack* [WRT : nicotinamide] ] * [CAU : pellagra]

This accurately expresses the knowledge in the original input sentence, namely that lack of the substance nicotinamide can cause pellagra. Returning this descriptor completes the knowledge extraction process for the given input text and skeleton ontology.

The very simple example above shows the basic idea of the knowledge extraction method. In short, the method uses the skeleton ontology to find the possible concept instantiations for all words in the input text. The method then combines these, thus constructing a descriptor. The method used for the combination of sentence elements and descriptors in the stack is an essential part of the knowledge extraction method and has great influence on the efficiency of the method. The combination method is described in detail below. Another very important influence on efficiency is the skeleton ontology.

When the knowledge extraction method is used in, for example, a document management system, the extracted knowledge must be inserted into a knowledge base 300. Figure 3 shows the process of inserting descriptors and document information into a knowledge base. This process of insertion includes the deconstraction of the descriptors handled by a descriptor deconstructor 310 and the insertion of the decon- structed descriptors handled by a insert KB structure 320.

The deconstructor 310 is the process that splits up a descriptor and creates a representation, which is suitable for insertion into a chosen storage mechanism 20. The insertion of knowledge base structures 320 is the process that takes this new represen- tation and inserts it into the storage mechanism 20. The combination method has two primary modes of operation. The first mode is when it combines a concept with a descriptor and the second mode is when it combines a relation with a descriptor. These two modes are due to the fact that concepts and relations must be handled very differently to ensure performance. The environment in which the combination method 150 functions is shown in Fig. 1.

For both operating modes of the combination method, the primary condition for a combination is that the skeleton ontology specifies that such a combination is possible. Checking for such compatibility between concepts and relations is performed via the skeleton ontology. To recognise this, it is important to remember that the skeleton ontology provides a hierarchy of concepts, where every child is a specialisation of its parent, while also being an instance of the parent. Another aspect that is common for both operating modes is that whenever a combination fails (for whatever reason), the current descriptor is considered finished and the stack is reset with the contents of the sentence element of the current word.

When combining a concept with the descriptors in a stack, the combination method first looks for the place in the descriptor, where the sentence element for the word immediately before the current one is attached. If the word before the current word was a relation, it checks whether the relation has a target, i.e., whether the relation is "relating to anything". In most cases (see the following discussions), the relation will not have a target as nothing has been combined with the descriptor between the previous word and the current word. If the relation is empty, i.e., if it does not have a target, the combination method attempts to match the current concept as the target of the relation. If a match is obtained, the concept is attached, and the method is finished with the current word and can proceed to the next word.

If a match could not be obtained, an attempt to combine the concept with the descriptor using one of the non-word relations is performed. As these non-word relations can be instantiated without a corresponding word in the input text, the concept and the descriptor can be combined in the instances, where a match was not possible (provided the skeleton ontology allows this). In this case, the combination method attempts to 'attach' a non-word relation to the concept in the descriptor, whose instantiating word is closest to the current word in the input text. It then tries the other concepts in order of the distance of the instantiating word to the current word in the input text. If the current concept cannot be attached using a non-word concept, the current descriptor is considered finished and the stack is reset with the concepts in the sentence element of the current word.

In the second mode of the combination method, a relation is instantiated by the current word and must be combined with the descriptors in the stack. A relation is com- bined differently from a concept in the sense that a relation can be attached to any of the concepts in the descriptor. Thus, it is not limited to attaching to the concept representing the previous word. This allows the method to extract knowledge more precisely. An example of an attachment to a concept, which is not the one immediately before, is seen in the above-mentioned example. The relation CAU is attached to the concept <LACK> in the second usage of the relation. This allows the method to specify that it is not nicotinamide (via the concept <SUBSTANCE>) that causes pellagra, but rather, it is the complex concept lack * [WRT : nicotinamide] that causes pellagra.

The combination method used for knowledge extraction has great impact on the efficiency of the extraction method. The above describes the basic functionality of the combination method; however, many expansions are possible and some of these are discussed in the following. Also, the following presents a number of expansions to the knowledge extraction method itself. Each of the expansions to the basic descrip- tion of the knowledge extraction method makes the extraction more accurate by enabling the method to extract more complex knowledge structures. All of the expansions below can be used in any combination to provide a more accurate knowledge extraction.

The first expansion to the knowledge extraction method is to allow several descriptors for a given input text. This enables the method to operate concurrently on several knowledge structures (in the form of descriptors) for a single input text. In the given example, this is shown in the fourth iteration, where the stack is expanded to contain two elements, both representing completely different knowledge. This feature of the knowledge extraction method is used for many other knowledge construc- tions and allows for extraction of knowledge from ambiguous pieces of text.

A second expansion is to enable the method to break off a descriptor in the middle of the input text. The method then starts up a new descriptor where it left off on the first descriptor. As a result, the method becomes more robust as it is now able to extract all the knowledge it can get from a given text, while skipping the gaps for which it cannot extract knowledge. When this feature is used, several descriptors are returned for a given input text.

The second expansion leads to the third expansion. This expansion enables the method to connect two descriptors from consecutive parts of the same text. Thus, if the method extracts two descriptors for a given input text, this expansion enables it to combine these two descriptors into a single, larger, and therefore more precise, descriptor representing the knowledge of both the previous descriptors.

The following expansion is related primarily to the combination method. The skeleton ontology may specify that certain concepts or relations can be skipped. This expansion allows the combination method to make use of this by enabling it to skip a sentence element, thus not combining it with the descriptors in the stack. The method will do this, if combining the sentence element with the stack resulted in an empty stack. In such a case, the method may chose to skip the current sentence element entirely.

Similarly to the previous expansion, this expansion is related to the combination method. However, instead of skipping a sentence element, this expansion delays a sentence element and thereby also the forming of a descriptor. It is therefore possible for the following sentence element to combine with descriptor on the stack. Thus, if a sentence element is marked by the skeleton ontology as being a possible delay ele- ment, the combination method will run two parallel tracks. In the first track, the delayed sentence element is combined with the stack as usual, hi the second track, the sentence element is delayed, while the following sentence element is combined with the stack. Once the following element has been combined with the stack, the de- layed sentence element is also combined with the descriptors on the stack that result from the combination with the element. If several delayed elements follow each other, the method can delay them all until a non-delayed element is combined with the stack.

In response to the precision of the skeleton ontology and to the used expansions, the knowledge extraction method may result in more than one descriptor. Although all of the descriptors represent specific interpretations of the knowledge content of the input text, some descriptors are better than others in the sense that they are closer to the intuitive human interpretation of the knowledge content. The semantic distance ex- pansion and the following two expansions are designed to support the method in order to find the best descriptors for a given input text. All three expansions do this by assigning a score to the possible descriptors.

The semantic distance expansion returns a score for a descriptor based on how pre- cise the descriptor is, i.e., how "far into" in the skeleton ontology the concepts and relations of the descriptor is placed. For example, a descriptor consisting of the concept <ANYTHING> can represent any text input; thus, such a descriptor is not very precise (and therefore placed very high in the skeleton ontology). The more precise a descriptor is, the more accurately it represents the knowledge in the text.

The precision of a given descriptor is measured by counting the number of generalisations, which are necessary in order to construct the descriptor. In the example above, a generalisation of the concept <ILLNESS> to the concept <OCCURRENCE> was necessary in order to match with the relation CAU. In general, the more precise a descriptor is, the fewer generalisations are necessary in order to construct it. Thus, we want the semantic distance score of a descriptor to be as low as possible. Similar to the semantic distance expansion above, the text or sentence distance expansion supplies a score for a descriptor. This expansion, however, counts the number of words the relations in the descriptor had to "move" in order to attach to a con- cept. In short, the expansion is a measure of how much moving around of the words in the input text was necessary in order to construct the descriptor.

While the linguistic distance expansion also provides a score for a descriptor, it is much more difficult to calculate. Briefly explained, this expansion is a measure of the linguistic correctness of a descriptor. The scores are calculated by looking up the possible parts-of-speech for the words in the input text. The structure defined by the descriptor is then used as a parse tree for a linguistic grammar. Every place where two elements are specified to be connected by the structure of the descriptor, the rules of the linguistic grammar are applied to see whether the parts-of-speech of ) these two elements can be combined to generate a third part-of-speech. If not, the linguistic distance score is increased. Again, the linguistic distance score should be as low as possible.

Another expansion to the knowledge extraction method enables the invention to ex- tract several words as a single concept. Such concepts are called multi-word concepts. Using this expansion, the knowledge extraction engine attempts to put several adjacent words of the input text together to form a single object. If this object is found in the ontology, the object takes the place of the words from which it was formed. Correspondingly, the sentence element is generated from the object instead of for the individual words.

Although the knowledge extraction method expects all words in the input text to be linked to concepts in the skeleton ontology via the mapping of words to relations and concepts, it may come across a word which is not. The expansion suggested here en- ables the knowledge extraction method to expand its knowledge of words with the help of a lexicon. An addition of a new word includes an addition of a new concept for the word as well. The method achieves this by extracting the knowledge from the lexicon entry of the word looking for the special ISA relation. It then uses the knowledge about what the word is (as specified by the ISA relation) and subsequently defines the place of the word and its corresponding concept in the skeleton ontology.

Additionally, there exist certain words that effectively divide the text semantically on both sides of where it occurs. An example of this is the word is. The semantic effect of this word is that what occurs before it is defined in terms of what occurs after it. Semantically, a concept or relation occurring after such a special word cannot attach to something that occurred before the special word.

The concepts and relations instantiated by these special words are called pivot concepts. The present expansion to the knowledge extraction method implements the handling of these pivot concepts. It does so by defining a pivot point for a descriptor whenever a pivot concept is combined with the descriptor. This pivot point then serves as a boundary, ensuring that no concepts or relations can cross it when they are combined with the descriptor.

The following expansion to the knowledge extraction method includes provisions for adding a neural network (or other machine learning method) to the knowledge extraction method. This neural network is trained for a given skeleton ontology to specify the usage of all possible relations between two concepts. Once trained, the neural network can be used to score a descriptor based on how common the use of the relations (and their contexts) in the descriptor is.

Another expansion to the knowledge extraction method includes provisions for adding a linguistic (or grammatical) parsing mechanism (or other natural language processing method) to the knowledge extraction method. The parsing mechanism aids the knowledge extraction method to extract correct descriptors, by providing a linguisti- cally based means of comparing the knowledge described in the descriptors with the knowledge described in the text. This expansion can be viewed as an expansion to the "linguistic score" expansion discussed above. In the above is described the part of the invention that comprises a method for knowledge extraction. In the following, the other part of the invention, namely the method for knowledge search is described in detail. This example of implementation of a knowledge base serves as a basis for the description of the search method, better illustrating its working principles.

The search method described in the following is capable of searching for complex concepts in a knowledge base containing knowledge structures in a machine under- standable representation. The representation of the extracted knowledge using the above knowledge extraction method is one such searchable knowledge representation.

For a user to perform a search, the user needs some way of constructing a complex concept to serve as the search query. The present invention accepts queries in many forms; however, it provides two main ways for the user to construct such a search concept, namely stating the query in natural language or constructing a complex concept using a specially designed user interface.

The first way uses the knowledge extraction method described above. As such, the natural language query from the user is subject to the knowledge extraction method and the resulting descriptors for the query are used as search concepts for the search method. In the second way, the user constructs the actual search concept directly, either using a graphical interface or by writing the descriptor corresponding to a com- plex concept. Independent of how the search query is defined, the search method uses the complex concept(s) to perform its search.

The following example provides an abstract description of the functionality of the search method in the present invention. In the example, the skeleton ontology (Fig. 6) from the knowledge extraction method example is reused. The knowledge extraction example discussed above showed the extraction of a descriptor for the sentence lack of nicotinamide causes pellagra. The resulting descriptor, shown below, is added to the knowledge base (making it a populated ontology). The entire populated ontology (or knowledge base) is shown in Fig. 7.

[lack* [WRT: nicotinamide] ] * [CAU: pellagra]

With the examples of a knowledge base and skeleton ontology described, consider a search for the complex concept lack of substances using the present inventions knowledge search method. The descriptor for the search query is: Lack * [WRT : substances ]

The search method starts by using the ontology to find the two concepts of the query in the skeleton ontology. From these, the search method moves downward through the populated ontology (see Fig. 7).

When moving through the populated ontology, the search method follows the hierarchical relations of the skeleton ontology and the arrows of the populated ontology (i.e. of the actual knowledge structures in the populated ontology). These are all shown in Fig. 7. Following a hierarchical relation corresponds to selecting a speciali- sation of the current concept. Following a blank arrow corresponds to moving into a knowledge structure (or complex concept) in which the current concept is the "primary concept" (i.e. the one related to, or specialised by relations to, other concepts). Semantically, this can also be viewed as a specialisation of the current concept, as it is being specialised by being related to another concept.

When moving through the populated ontology (or knowledge base) in search of a concept which is being related to in the query complex concept (e.g. the concept substances in the current example), the search method will follow the hierarchical relations and blank arrows just as before. However, it will also follow any arrows tagged with the same relation as the one that points to the concept in the search query. This is exemplified in the following. Having found the two concepts of the search query complex concept, the search method moves down through the populated ontology. As can be seen in Fig. 7, the search method follows the "blank" arrow (i.e. the arrow without a tag) from the concept lack. This leads it into a knowledge structure with lack as the "primary" concept. Simultaneously, the search method follows the hierarchical (ISA) relation of the skeleton ontology from the concept substance to the concept nicotinamide. From here, it then follows the WRT tagged arrow, as, in the query complex concept, the concept substance is being related to by a WRT relation.

The two "branches" of the search method have now met in a knowledge structure, meaning that the present knowledge structure represents a piece of knowledge equivalent to (or subsumed by) the search query complex concept. Thus, the search method has found a result to the query. However, before returning the result, the search method follows any blank arrows downwards. Here it finds an even more complex concept, which, beyond representing knowledge equivalent to the query, also represents more knowledge. Semantically this more complex knowledge structure represents a specific context for the queried knowledge, which the search method can then (also) return to the user. The above example illustrates the basic idea of the knowledge search method of the present invention; the actual implemen- tation depends on several factors, among these the implementation and storage mechanism of the knowledge base (or populated ontology). An example of a actual implementation is given below.

The knowledge base into which the extracted knowledge is inserted can be imple- mented in a number of ways using a number of strategies. What is important is that it supplies a way for storing the knowledge structures (descriptors) that allows for searching through them. Furthermore, depending on the usage of the entire invention, the knowledge base should also provide storage for other information related to the knowledge structures. For example, if the invention described in this document was to be used as a web page indexer (like www.google.com), the knowledge base would have to provide storage of information about where a given piece of knowledge was extracted from (the URL), the title of the docuement, etc. In an example implementation, a relational database has been used as the underlying data storage mechanism for the knowledge base. On top of the relational database, a number of wrapping methods are defined, which serve as the interface of the knowl- edge base. The wrapper methods handle the translating of knowledge structures to and from a representation suitable for insertion into the relational database.

As, in the present example, the knowledge base is implemented using a relational database, the knowledge structures are stored in a number of relations (database tables). In the present implementation, the concepts, their relations, and the attachments of the relations in a given knowledge structure are all stored in separate relations; the concepts table, the relations table, and the attachments table, respectively. While the two first mentioned tables contain concepts and relations, respectively, the attachments table contains information about how the records in the two first tables are connected.

When inserting descriptors into the knowledge base, the wrapping methods deconstruct the descriptors into concepts and relations while maintaining the knowledge of how they are interrelated. The concepts and the relations are then stored in their re- spective tables, after which the structure of the descriptor is stored as a number of records in the attachments table, where all records specifies a given interrelation between a pair of concepts and the relation binding them together. Thus, the knowledge represented in the descriptor is converted into a representation suitable for storage in a relational database after which it is inserted into the database.

With the brief description of the knowledge base above, the following describes the basic method for knowledge search (also called conceptual search). The goal of the knowledge search method is to locate in the knowledge base the instances of a given complex concept. Depending on the purpose of the knowledge base, the instance contains various information about the extracted knowledge. In the example of a document management system, the instance may contain information about which document or documents the given piece of knowledge was extracted from, from where the referred document can be found, etc.

A search method 400 is depicted in Fig. 4. As the figure indicates, the search can ei- ther be a search concept or a natural language search query. In the latter case, a knowledge extractor 100 is used to create a descriptor for the query which then serves as the search concept. Along with a skeleton ontology 10, the search concept is used by a knowledge base search 500 to perform the actual search in an underlying storage mechanism 20. The knowledge base search process 500 is shown in Fig. 5. The search descriptor is passed on to a search specialiser 510. The search specialiser 510 uses the skeleton ontology 10 to implement all the expansions to the knowledge search method described below. For example, if specified by the searcher, the search concept is modified by using the skeleton ontology 10 to further specialise or generalise the search concept (or to implement any of the other expansions to the search method). The, possibly modified, search descriptor is then passed to a descriptor de- constructor 310, which splits up the search concept and create a suitable representation for search in the underlying storage mechanism 20. This representation is then passed to a search iterator 520, which iteratively searches the underlying storage mechanism 20. The details of the processes depicted in Fig. 4 and 5 are described in the following.

One way to perform such searching is to use the algebraic properties of knowledge represented using the ONTOLOG framework (see previous reference by Nilsson). The chosen method, however, is based on searching the relational database underly- ing the knowledge base. Specifically, the search method uses the relational j oin operator. This operator merges two tables based on a specification of interrelations between records in the two tables. Thus, the basic search method is based on well known relational database operations.

In order to give an example to describe the search for a complex concept, a description of the search for a simple concept is given firstly. Searching for a simple concept is very easy, as it consists of only one operation. This operation simply performs a scan of the concepts table looking for the given simple concept. In a document management system, if the concept is found, a list of the documents in which the given concept occurs is returned. The method is similar for a search for a relation- concept; however, instead of searching the concepts table, the method searches the relations table.

Finding a complex concept consists of several iterations of finding a simple concept. For each such simple search however, the returned list of occurrences of the given concept is merged with the list of occurrences from the previous searches. From a set theoretical point of view, this results in the union of all the sets of returned occurrences. This union is the set containing the occurrences, which were common for all the concepts and relations of the sought complex concept. The attachments table is subsequently used to sort out the documents, in which the complex concept occurs.

Above is given an example of how the search method may find a specific simple and complex concept. Specific functionalities, which can be useful in different situations, are described in the following.

This expansion and the following expansion both make use of the skeleton ontology when performing a search. A concept specialisation enables the search method to return the occurrences of a given concept and all the concepts it subsumes. This search expansion is useful when all the specialisations of a concept are desired. For example, when searching for vitamins, this search expansion enables the search method to return all the occurrences of not only the concept <VITAMIN> but also all its sub- concepts such as nicotinamide, ascorbic acid or vitamin-c. The concept specialisation functionality of the search method could be user controlled. For example, the user could specify the number of specialisations he or she wants, i.e. the number of branches in the skeleton ontology the search should follow while compiling the result. Concept generalisation is the opposite of concept specialisation. This expansion enables the search method to generalise concepts in the search query. Thus, it can search for concepts, which subsume the concepts in the query. Again, this functionality can be user controlled by allowing the user to specify the number of generalisa- tions to be performed during search. This expansion is useful for the case, where the user knows a very specific concept for the search topic, but wishes more general results. In this case, the search method can generalise the user query when searching.

In the example of the implementation of the knowledge base, the concept specialisa- tion and concept generalisation expansions are both implemented by adding to the number of simple concept searches performed while searching. For all of the concepts in the query, the set of occurrences of the individual concepts and all their subsumed concepts are used instead of the set of occurrences of the individual concepts alone.

Another expansion to the search method enables the user to specify a search concept along with the word instantiating that concept. The search method then only returns the occurrences of the search concept that were instantiated by the given word. This is useful, for example, when the user is looking for a specific word for a specific concept in a complex search concept. As the knowledge base stores information about which words instantiated the concepts, this expansion is easily implemented.

Another expansion to the search method is based on the fact that certain relations have corresponding inverted relations. For example, the relation CAU, with the se- mantic understanding of something causing something else, is the inverse of the relation CBY, which signifies something caused by something else. When a user searches for a causality between two concepts, all the occurrences of causality between those two concepts should be returned. Thus, the search method must be able to invert certain relations.

Some relations can themselves be expressed as concepts with relations. In order to precisely search these, the search method must be able to translate one form to an- other. For example, the relation CAU signifies a causal relationship between two concepts. However, this may also be expressed via the concept <CAUSALITY> with relations to the two concepts specifying which is the agent and which is the result. Thus when searching for the relation CAU, the search method must also search for the <CAUSALITY> concept (with the proper relations), and vice versa. The present expansion enables the search method to make this translation and extend its search to include such cases of semantic equivalence of concepts and relations.

With the descriptions of the knowledge extraction method and the knowledge search method above, it is believed that modifications, variations, and changes will be suggested by others skilled within the field of discourse. It is therefore understood, that all such variations, modifications and changes are believed to fall within the scope of the invention as defined in the appended claims. Furthermore, other usages of the described methods than the ones mentioned in this document are also believed to fall within the scope of the present invention.

Claims

C L A I M S

1. A method of performing knowledge extraction from natural language text documents including the steps of: reading an input text; transforming said input text into a machine understandable knowledge representation so as to provide knowledge libraries from said documents; and optionally storing said libraries using a defined ontology to specify possible semantic relations;

characterised by using semantic based means for extracting concepts and their interrelations from said input text;

- providing knowledge structures consisting of an arbitrary number of concepts and their interrelations; and using a predetermined mapping between words in the input text and concepts and relations in the ontology.

2. A method according to claim 1 characterised by using said ontology to control the knowledge extraction process.

3. A method according to claim 1 or 2 characterised in that the used ontology being adapted to determine if a concept is a generalisation or a specialisation of another concept.

4. A method according to any of claims 1 to 3 characterised in that the used ontology being adapted to specify non-hierarchic relations between concepts.

5. A method according to any of the preceding claims characterised by stor- ing of stacks during said knowledge extraction.

6. A method according to claim 5 characterised by connecting said stacks to each other.

7. A method according to any of the preceding claims characterised by using descriptors to represent the semantic knowledge in pieces of said input text.

8. A method according to claim 7 characterised in that a number of descriptors are allowed for given pieces of input text.

9. A method according to claims 7 or 8 characterised in that the building of at least one of the descriptors is broken off at the middle of said input text and a new descriptor is created for the remainder of the input text.

10. A method according to any of claims 7 to 9 characterised by connecting descriptors from consecutive parts of said input text.

11. A method according to any of claims 7 to 10 characterised by defining a pivot point for a descriptor whenever a pivot concept is combined with said descrip- tor.

12. A method according to any of claims 7 to 11 characterised by said descriptors being graded according to a semantic distance scoring.

13. A method according to any of claims 7 to 12 characterised by said descriptors being graded according to a sentence distance scoring.

14. A method according to any of claims 7 to 13 characterised in that said descriptors being graded according to a linguistic grammar scoring.

15. A method according to any of the preceding claims characterised by ex- tracting several adjacent words as a single concept.

16. A method according to any of the preceding claims characterised by words and their conceptual meaning being added to the ontology.

17. A method according to claim 16 characterised by words and their conceptual meaning automatically being added to the ontology by means of a lexical inquiry.

18. A method according to any of the preceding claims characterised by said predetermined mapping allowing for skipping of irrelevant words.

19. A method according to any of the preceding claims characterised by using non-word relations.

20. A method according to any of the preceding claims characterised by skipping at least one sentence element.

21. A method according to any of the preceding claims characterised by delaying a sentence element in order to firstly combine with a following sentence element.

22. A method according to any of the preceding claims characterised by adding a neural network to the knowledge extraction method.

23. A method according to any of the preceding claims characterised by adding a grammatical parsing mechanism to the knowledge extraction method.

24. A method for iteratively searching knowledge structures characterised by performing a conceptual searching by use of said defined ontology.

25. A method according to claim 24 characterised in that a knowledge search returns occurrences of a given concept and concepts that it subsumes.

26. A method according to claim 24 or 25 characterised in that a knowledge search returns occurrences of a given concept and concepts by which it is subsumed.

27. A method according to any of claims 24 to 26 characterised in that a knowledge search for a concept is performed in accordance with, preferably constrained by, a word that instantiates the concept.

28. A method according to any of claims 24 to 26 characterised in that a knowledge search includes inverted relations.

29. A method according to any of claims 24 to 28 characterised in that a search includes the semantic equivalences of concepts and relations.