WO2005073908A1

WO2005073908A1 - Ontological knowledge base and information retrieval method for a natural language request

Info

Publication number: WO2005073908A1
Application number: PCT/FR2005/000063
Authority: WO
Inventors: Louis Chevallier; Anahide Tchertchian
Original assignee: Thomson Licensing
Priority date: 2004-01-12
Filing date: 2005-01-11
Publication date: 2005-08-11
Also published as: FR2865055A1

Abstract

The invention relates to a knowledge base for a predetermined area of knowledge comprising at least one ontological base (12) consisting of formalised concepts and functions which are exposed to a totality of semantic constraints formulated according to a predetermined description logic and an instance base (14) pertinent to the concepts. The inventive knowledge base also comprises a knowledge base (11) pertinent to an area of knowledge comprising at least one token base for creating a query consisting of representative key-words in the form of questions and syntactical patterns, the representative key-words in the form of questions being associated with a predetermined totality of classes of query syntactic structures and to a predetermined totality of concepts and query object functions. Said invention can be used for any natural language.

Description

Ontological knowledge base and method of extracting information from a query in natural language. The present invention relates to a knowledge base relating to a predetermined domain and to a method for extracting data from it interrogated by a request in natural language. More particularly, the present invention relates to a knowledge base relating to a predetermined domain of knowledge, this knowledge base comprising at least one ontology base consisting of concepts and formalized roles subject to a set of semantic constraints and a base of concept instances. More particularly, the present invention relates to a method for extracting data from a knowledge base. comprising at least one ontology base made up of formalized concepts and roles subject to a set of constraints, a base of instances relating to the concepts and verifying the constraints, and a base of keywords relating to the domain and representative of the type of questions from a set of valid answer question types and / or syntactic structures. There exist in the state of the art methods for interrogating textual databases. In the case of a textual database made up of documents shared on the Internet, the interrogation methods generally consist in estimating the relevance of a document based on the number of words shared between the request formulated by the user and document. The problem posed by this type of process is due to the fact of considering words only as objects which are substantially independent of one another, and documents shared on the Internet as sequences of words. Therefore, the number of documents provided in response is generally very large and, typically, the user hopes to find the information which interests him by carrying out himself the sorting among the returned responses. Conventionally, other methods such as natural language query analysis methods carry out a syntactic analysis in the most exact way possible in order to remove ambiguities in the text of the request, for example homonymies and / or synonymies, and / or to extract relations between the words of the request to allow to eliminate irrelevant documents. However, these prior art methods perform an analysis of the request only from the purely syntactic point of view and do not access the semantic sense of the latter. The list of returned responses is usually very large, includes many off-topic documents, and these may fail to reveal important documents simply because they are not presented in a good form. When the knowledge domain relating to data is closed, for example when it relates to a finite set of data relating to wines, or museums, or a sport, etc., it is possible to construct a semantic modeling of the domain, c that is, to define a finite set of concepts, a set of semantic relationships relating to these, or "ontology" of the domain, and a finite set of instances relating to the concepts. We know from documents US-A 5,555,408 and US-A-5,995,955 a knowledge base modeling a domain of knowledge. This knowledge base consists solely of data structuring of the “concept network” type. This type of modeling does not allow the direct use of formal calculations on the data in the knowledge base, an additional algorithmic modeling being therefore necessary in order to exploit it. There are methods of interrogating such knowledge bases, hereinafter referred to as "ontological knowledge bases", which are based on a preliminary semantic analysis of the elements of the question as a function of the semantic relationships of the knowledge base. However, this type of process consists first of all in a conventional manner of carrying out a precise syntactic analysis and of rejecting requests not formulated in a satisfactory manner from the point of view of the syntax used, but perfectly valid from the semantic point of view adopted in the database. ontological knowledge. Thus, when a process performs a preliminary filtering of requests in natural language by syntactic analysis, it lacks flexibility, and in fact forces the user to formulate his request in the right form, which a priori restricts the field of possible responses. . Other known methods consist in correlating the request to questions previously recorded in the knowledge base associated with predetermined answers stored in the knowledge base. If a semantically valid request cannot be correlated to one of these questions, these methods fail to return any response. The object of the present invention is to solve the above-mentioned problems. It relates to a process for extracting data from an ontological knowledge base which, in particular, determines a complete set of structurally viable relationships in the ontology from the query in natural language and which determines the valid responses of the query by eliminating the queries not semantically supported by the ontological knowledge base. Thus the probability of failure to provide a relevant response to the request and the probability of providing an irrelevant response are low since the relevance of a response is studied from the point of view of the formalized ontology of the base. of knowledge and does not depend in practice on the form in which the request is ultimately formulated according to the criteria of the natural language used. In particular, another object of the present invention is the implementation of a method of extracting data and information supported by these data from a knowledge base specific to any field, by means of a mechanism simulating a logical reasoning, research process - decision, totally independent of the domain considered and of the information sought. Another object of the present invention is the implementation of a method of extracting data from a knowledge base substantially independent of the natural language used to formulate the query, for natural languages of equivalent syntax. To this end, the subject of the present invention is a method of extracting data from a knowledge base relating to a domain interrogated by a query in natural language, the knowledge base comprising at least one ontology base consisting of concepts and formal roles subject to a set of semantic constraints formulated in accordance with a predetermined description logic, a base of instances relating to concepts, and a base of keywords relating to the domain and representative of types of questions among a set of types questions and / or syntactic structures. It is remarkable in that it comprises at least the steps: - of lexical analysis of the query in natural language consisting in identifying the signifying lexical units of the query and in labeling each of the lexical units by at least one concept, one role , an instance or a keyword of the knowledge base in order to generate at least one labeled request made up of labeled lexical units; - syntactic analysis of each of the at least one labeled request comprising the steps: -o of creation of elementary semantic units made up of at least two labeled lexical units, each of these lexical units being labeled by a concept or a role or an instance, the associated concepts, roles and instances _. to each of these elementary semantic units together verifying a tuple configuration of a predetermined set of tuple configurations; and - ° of target identification consisting in identifying at least one syntactic relation between the lexical units labeled by a concept or a role or an instance and the lexical units labeled by a keyword representative of a type of question in order to determine at least one target interrogation constraint verifying a question among the set of question types; - semantic analysis of each labeled request comprising at least the steps: - validation of each of the elementary semantic units according to the constraints of the knowledge base, in order to obtain a set of validated elementary semantic units; - validation of the target interrogation constraints as a function of the associated validated elementary semantic units and / or of the knowledge base constraints, in order to obtain a set of validated interrogation targets; and - data extraction consisting in extracting from the knowledge base the instances of the base of instances verifying the validated elementary semantic units via the validated target constraints. The subject of the invention is also a knowledge base relating to a predetermined domain of knowledge, this knowledge base comprising at least one ontology base consisting of formalized concepts and roles subject to a set of semantic constraints formulated in accordance with a logic of predetermined description and a base of instances relating to the concepts, characterized in that it further comprises a knowledge base relating to the field of knowledge comprising at least one base of lexical units of the interrogation constructor type consisting of words- keys representative of types of questions and syntactic patterns, the keywords representative of types of question being associated with a predetermined set of classes of syntactic structures of interrogation and with a predetermined set of concepts and roles object of interrogation. The present invention will be better understood on reading the description which follows, given solely by way of example, and made in relation to the appended drawings in which: - Figure 1 is a schematic diagram of the structure of a base of knowledge according to the invention associated with interrogation means; and FIG. 2 is a flow diagram of the steps of the method according to the invention. It will first be described, in relation to FIG. 1, schematically the structure of a knowledge base according to the invention. The knowledge base 8 according to the invention comprises a conventional knowledge base relating to a field of knowledge predetermined, referenced by the number 10, and a knowledge base relating to the domain of the query, referenced by the number 11. The knowledge base relating to the domain of knowledge 10 is a semantic modeling of it, carried out based on on a predetermined description logic supporting at least the definition (designated by the symbol s), the negation (designated by the symbol -β), the subsumption (designated by the symbol ç), the disjunction (designated by the symbol u), the conjunction (designated by the symbol n), the universal quantification (designated by the symbol V) and the existential quantification (designated by the symbol 3). This ontological knowledge base 10 conventionally comprises an ontology base 12, or "T-box", and an instance base 14, or "A-box". The ontology database 12 includes a concept database 16 and a role database 18, and the instance database 14 includes an instance database 20 relating to the concepts of the concept database 12 Each concept, role and instance is referenced in a unique way in the database, for example by a number, and associated in a unique way, for the purpose of formalization in a predetermined natural language, with at least one predetermined lexical unit of a database of lexical units 21. Conventionally, concepts and roles are subject to a predetermined set of semantic constraints formulated in accordance with the description logic which is implemented by a logical core 22 having in particular the function of guarantee the integrity of the knowledge base 10 with regard to the description logic. These semantic constraints relating to the concepts and the roles of the databases of concepts 16 and of roles 18 are for example stored in a database of ontological constraints 23 and consist at least of constraints of definition of concepts as a function of atomic concepts, defining roles according to atomic roles, subsumption between concepts and subsumption between roles, the term atomic characterizing the concepts and elementary roles used for the definition of the other concepts and roles of the knowledge base 10. An additional type of semantic constraints relates to both the concepts and the roles. Classically, a role is a binary semantic relationship between a starting domain, designated "domain", and an arrival domain, designated "range" in the technical field of building knowledge bases. The starting domain and the ending domain are formalized by logical expressions, supported by the description logic, relating to the concepts of the concept database. Instances of the instance database 20 are also subject to a predetermined set of constraints, stored for example in an assertion constraint database 26 of the instance database 14, constraints such as, in particular, assertions on concepts, that is to say the belonging of an instance to a concept, and assertions on roles, making it possible to link together instances of the instance database. The knowledge base 10 also includes a database 24 of synonyms connected to the lexical unit database 21. The database 24 consists of a predetermined set of synonyms of the lexical units used to formalize the concepts, the database roles, instances 16, 18, and 20 of concepts, roles, and instances. Advantageously, the knowledge base 10 relating to the predetermined knowledge domain is connected to the knowledge base 11, hereinafter referred to as “interrogative” knowledge base which models, based on the predetermined description logic, the field of questioning. The interrogative knowledge base 11 comprises a database of key lexical units relating to the interrogation 30, hereinafter designated lexicon. These key lexical units are made up of a predetermined set of constructors and markers. Constructors are made up of keywords and syntactic patterns representative of types of questions. Typically, for a language like French for example, the key words are the interrogative pronouns, "qui", "que", "quoi", "which", "which", etc., and the interrogative adverbs "when" , "Where", "how much", etc. , and the interrogative phrases "against whom", "with what", etc. The syntactic grounds are meanings specific to the field of interrogation such as "is-that-that", "is-there- does it "," is it ", etc. and are used to identify the type of question submitted by a query made in natural language, by a user of knowledge base 8, as will be explained in more detail below . Markers are made up of keywords associated with syntactic relationships and carrying meaning with regard to semantics. Typically, the key words markers consist of prepositions like "in front", "behind", "in", "in", etc., and prepositive phrases like "above", "long after", etc. ... In a classic way, constructors and markers are used to only reveal the syntactic structure of an interrogative query and to assign to each word of it a syntactic role to remove ambiguities of a homonymic or synonymic order or to identify a syntactic relation, analogous to that of the query, in a text entering into the constitution of a textual database. In accordance with a first embodiment of the interrogative knowledge base 11, each builder keyword and lexicon marker 30 is associated with at least one concept and / or an “universal” atomic role in a database of concepts 32 and a role database 34 respectively, the universal term associated with a concept or a role qualifying the semantic fact that this concept or this role is substantially necessarily used for the modeling of any domain of knowledge. Among the universal concepts, it is possible to cite the concepts of place, date, person, object and event, the latter designating an object associated with any one of the four preceding concepts. Typically, the constructor keywords are associated with atomic concepts which are objects of interrogation. Thus the interrogative pronoun "who" is associated with the atomic concept of "Person", "what" with the atomic concept of "Thing", "where" with the atomic concept of "Place", "when" with the atomic concept of "Date", "How much" to the atomic concept of "Quantity". Certain keywords are associated with several atomic concepts, such as for example the marker key marker "to" which is associated with the atomic concepts of "Place" and "Date". Certain keywords can also be associated with atomic roles, such as for example, the keyword marker "à" or the builder keyword "when" which are associated with the atomic role of "a_eu_Lieu à", designating the occurrence of a concept and / or an instance in a place or on a date. The concepts and roles of the knowledge base 11 are subject to a predetermined set of semantic constraints, for example stored in a constraint database 36. The semantic constraints stored in the constraint database 36 relate in particular to the subsumption of concepts and roles of the knowledge base relating to domain 10 by concepts and roles of the interrogative knowledge base 11, if the concepts and roles of the knowledge base relating to the knowledge domain have not been defined by relation to the above-mentioned universal atomic concepts. It is recalled here that the notion of subsumption covers in a hierarchical classification of structured information belonging to a knowledge base, the logical action consisting in transferring one of the information, classified in a given category, in a more general category. Another embodiment of the interrogative knowledge base 11 consists in directly associating the keywords of the lexicon 30 with the concepts, roles and instances of the knowledge base, without using the universal atomic concepts and roles, which makes it possible to dedicate the interrogative knowledge base specifically to the knowledge base relating to the knowledge domain. This mode of implementation has the advantage of speeding up the extraction of data from the knowledge base 10. Conventionally, knowledge bases are based on the universal atomic concepts previously described to model the knowledge domain so that it is not necessary to define the subsumption constraints between concepts and roles of the knowledge base 10 and the concepts and roles of the interrogative knowledge base 11. Advantageously, the interrogative knowledge base 11 is then independent of the knowledge base relating to the domain and adapted to all the knowledge bases relating to a specific domain modeled according to universal atomic concepts and roles. Furthermore, each of the constructor keywords, which are representative of types of questions, is associated with at least one class of syntactic interrogation structures from a predetermined set of classes of syntactic interrogation structures to which a natural language query. The predetermined set of classes of syntactic structures for interrogation, for example stored in a database of syntactic structures of interrogation 38, comprises at least the classes of syntactic structures of interrogation of the response type: - “binary”, that is to say an interrogation structure conjecturing, according to a first aspect, on the existence of a semantic relationship contained in the request in natural language. These are typically syntactic query structures with a qualitative response of the “yes” or “no” type, such as the structure of the query “Did Agassi play Rolland Garros? ", And syntactic query structures with quantitative responses, such as the structure of the query" How many games has Agassi played at Rolland Garros? "Whose associated response extraction process consists in returning the number of times the semantic relationship between Agassi, played and Rolland Garros is verified; - "enumerative", that is to say an interrogation structure conjecturing a response made up of at least one instance of a concept object of the interrogation, involved and identified in a semantic relationship with a role and a concept or an instance of the request in natural language; and - "relational", that is to say an interrogation structure conjecturing a response made up of at least one instance of concept satisfying a semantic constraint between a concept or an instance and a role whose starting domain subsumes this concept or this instance and whose destination domain subsumes the instances of the response. Typically, the key words "who", "that", "what", "when" and "where" are associated with syntactic interrogation-response types such as "enumerative" and "relational", the word- key "how much" is associated with the syntactic structure of interrogation with response of the type "binary". Advantageously, the database of syntactic query structures 38 further comprises, for each class, a predetermined set of syntactically syntactically equivalent query structures. These sets are for example used during a step of identifying the class of interrogation syntactic structures to which the request in natural language belongs. Specifically, the lexical units of lexicon 30 are formalized by lexical units based on a predetermined natural language for the purposes of querying the knowledge base relating to knowledge domain 10. However, as can be seen , the structure and content of knowledge bases 10 and 11, except the databases of lexical units 21, synonyms 24 and lexicon 30, as well as the extraction process described below, are independent of the language natural used. Indeed, the set of concepts, roles and instances is referenced by a universal referent, an arbitrary number for example, and logically linked according to the rules of the description logic of the domain of the knowledge base, independently of any relation to a natural language. Advantageously, the databases of lexical units 21, synonyms 24 and the lexicon 30 are removable and interchangeable with databases of lexical units, synonyms and a lexicon formulated in another natural language, so that the knowledge base relating to knowledge domain 10 can be queried in another natural language without this modifying either the structure, the arrangement of data, or the content of the other elements of knowledge bases 10 and 11, or even, in short, the process which is the subject of the invention. Finally, the knowledge base relating to the knowledge domain 10 and the interrogative knowledge base 11 are connected to an interrogation module 40 capable of interrogating the knowledge base 10 by implementing the method which is the subject of the invention. We understand of course that the number and definition of concepts, roles, instances, keywords, constraints, syntactic structures of knowledge bases 10 and 11 depend on the desired degree of finesse in modeling knowledge and interrogative domains, so that the size and complexity of each of the bases 10 and 11 is a function of the aforementioned degree of finesse. For the sole purpose of illustration, a knowledge base relating to the field of tennis is described. Of course, the structure of the knowledge base and the method of extracting data according to the invention are completely independent not only of the type of data processed, but also of the nature of the information supported by them. It is recalled here that the aforementioned lexical units can be chosen arbitrarily, but that these present for the user a semantically significant one-to-one value in natural language. It is understood, for example, that the concept "Joueur_de_Tennis" can be replaced by any equivalent different value, for example "Joueur / de / Tennis" or "Joueur de Tennis". The knowledge base given as an example relating to the field of tennis is based on the universal atomic concepts of "People", "Date", "Place", "Object" and "Event". * For the tennis field, it is also possible to define the concepts "Man", "Joueur_de_Tennis", "Joueur_de_Tennis_Homme", "Paire_Joueurs_de_Tennis", "Tournois", "Match", "Vainqueur", "3 a_Gagné.Tournoi", etc. Possible roles are "a_eu_Lieu_à", "a_eu_Lieu_le", "a_Battu", "a_Gagné", "a_pour_Joueur", "a_Joué_à", etc. Possible instances are "Agassi""Rolland_Garros","Paris","Rolland_Garros_2003". For example, a semantic definition constraint is the definition of the concept "Winner" according to the relation: "Winner" = "Joueur_de_Tennis" n "3 a_Gagné. Tournament ”. A semantic concept subsumption constraint is for example the subsumption "Joueur_de_Tennis_Homme" ςz

"Tennis_ Player" ç "Person". For example, a semantic constraint defining the starting and ending domain of a role is a constraint on the role "a_Joué_à" is: "a_Joué_à" ≡ (AND "Person" (OR "Joueur_de_Tennis" "Paire_Joueurs_de_Tennis")) where "AND" and "OR" represent the logical operators AND and OR respectively. For example, a semantic assertion constraint on a concept is the membership of the “Agassi” instance in the concept

"Male_Tennis_ Player". An assertion on a role is for example the relation between the instance “Agassi” and the instance “Rolland_Garros_1999” linked by the role “won”. The process which is the subject of the invention is now described, in relation to FIG. 2. The process firstly consists in carrying out a lexical analysis of a request in natural language formulated by a user in order to identify lexical units associated with concepts, roles, times and keywords in the knowledge base. To this end, in a step 52, the method identifies and eliminates non-meaningful words included in a predetermined set of words, such as definite and indefinite articles, conjunctions of subordination, etc. The method then consists in testing, in a step 54, whether the remaining words of the request, that is to say the words carrying meaning, are supported by the knowledge bases 10 and 11, that is to say say exist in these. If so, the query in natural language is, by definition, said to be consistent with the knowledge base. If, on the contrary, the result of this test is negative, the user's request is rejected. If the test result is positive, a step 56 of identifying the concepts, roles, instances, keywords contained in the request in natural language is then triggered. The method consists in determining the set of possible combinations of concepts, roles, instances and keywords of the knowledge base included in the query, for example by implementing a decision tree search algorithm which traverses the base of knowledge 8 in search of lexical units of the query which are associated with concepts, roles, instances and keywords of knowledge base 8. The method thus generates a set of labeled queries made up of lexical units according to the words carrying meaning of the query in natural language, each of the lexical units being labeled by a concept, a role, an instance or a keyword of the knowledge base. Thereafter, a concept, a role, an instance or a keyword associated with a lexical unit is designated by the term "label" of the lexical unit. For example, considering the natural language query "How many left-handed players did Agassi beat in Paris?" », Several labeled queries of labeled lexical units are possible according to the signifying words of the request, as illustrated by table 1. The first line of table 1 lists the signifying words of the request in natural language, the word« of » not being meaningful. The rest of Table 1 lists and classifies possible lexical units deduced by the decision tree search algorithm into concepts, roles, instances and associated keywords.

4

) Unit how many players does Agassi have Paris i lexical beaten left-handers

Table 1

Of course, other possibilities can also be used. The example developed above illustrates the fact that for a natural language query, several combinations of labeled lexical units, or labeled queries, are possible. In a first embodiment of the method according to the invention, certain combinations can be eliminated as a function of the tag keywords. For example, the keyword marker "à" is associated with and followed by a concept of place or date, so that the queries labeled including the lexical unit "Paris" labeled by the instance "Paris_Roger" of the concept "Joueur_de Tennis_Homme" and corresponding to the tennis player named Roger Paris can be eliminated because the concept "Joueur_de TennisJHomme" is neither the concept "Date" nor the concept "Place" and is not subsumed by any of them . The number of labeled requests used in the following steps of the method is then reduced, which consequently reduces the calculation time associated with the implementation of the method according to the invention. In a second embodiment, all of the labeled requests are kept, which is particularly advantageous when the user has committed a syntax error in the request in natural language for example. In general, the method according to the invention is particularly flexible with regard to this type of fault, which allows, for example, a foreign user, whose lingua franca is not the language used to formalize the knowledge base, to be able to interrogate it by committing certain specific faults without this detracting from the relevance of the responses established using the method according to the invention, as will be described below. Indeed, the rejection or acceptance of a request is based solely on the ontology of the knowledge base relating to knowledge domain 10. For the implementation of a syntactic analysis process, the process object of the invention then consists, in a step 58, in carrying out a sorting among the labeled requests. Different predetermined forms of labeled requests are recognized by the process which is the subject of the invention which identifies the structure of each labeled request and eliminates those whose structure is not supported by the description logic used by the knowledge base. Typically, a labeled request comprising two adjacent and labeled role-labeled lexical units is eliminated because it does not conform to the description logic. The next step 60 represented in FIG. 2 consists, for each labeled request, of generating a set of elementary semantic units made up of at least two labeled lexical units verifying a triplet configuration of a predetermined set of triplet configurations. The process implemented in step 60 determines a set of syntactic relations between the labeled lexical units, that is to say a set of relations which are formally correct from the point of view of the description logic used in the database. of knowledge, without for the moment being judged on their semantic meaning, that is to say ultimately of their existence as a semantic constraint coded in the knowledge base 10. More particularly, the process determines, for each labeled request, a set of elementary semantic units of at least two lexical units. A first form of elementary semantic unit is a triplet of distinct lexical units consisting of two first lexical units labeled by a concept and / or an instance and a second lexical unit labeled by a role which links the labels of the first two units lexical, that is to say a triplet of lexical units whose labels together verify a configuration of the type {concept, role, concept}, {concept.rôle, instance}, {instance, role, concept}, {instance , role, instance}. A second form of elementary semantic unit is a couple of distinct lexical units labeled by a concept or an instance which are likely to be linked by an unidentified role N, the labels of the lexical units and the unidentified role N checking together a configuration of the lexical unit triplet type {concept.role, concept}, {concept, role, instance}, {instance, role, concept}, {instance, role.instance}. It is particularly advantageous to consider such pairs of lexical units. Indeed a first lexical unit labeled by a concept or an instance is likely to be linked to a second lexical unit labeled by a concept or an instance by an implicit role, therefore unidentified, contained in the request in natural language. For example, when considering the request "who played in the final against Agassi in Paris?"", The role" a_Joué_contre "is explicitly apparent from the lexical unit" played against ". However, there is an implicit role contained in the request, namely the role “a Lieu à”, between the “final” lexical unit labeled by the “Final” instance of the “Match” concept and the “Paris” lexical unit "Labeled by the" Paris "authority of the" Place "concept. The search for a semantic relationship implicitly included in the request in natural language, by the introduction of the indeterminism associated with the role not yet identified, typically constitutes a specific type of extraction of semantic meaning in the request. This process is particularly advantageous, because in general, the query formulator user performs shortcuts semantics that it is necessary to identify in order to extract the real object of the request. In the example described above, a semantically well formulated request is effectively “who played at least one final match of the Paris Tournament against Agassi? "And is generally unconsciously shortened to" who played in the final against Agassi in Paris? ". The following step 62 of the method consists, for each labeled request, in identifying, at least one syntactic relationship between the lexical units labeled by a concept or a role or an instance and the lexical units labeled by a keyword representative of a type of questions. The process 62 first identifies to which class of interrogation syntactic structures belongs the request in natural language, and therefore also the labeled requests. This identification is carried out as a function of the constructor keywords and of the constructive syntactic patterns of the lexicon 30 contained in the request in natural language and of the syntactic structures of interrogation equivalent to the database of syntactic structures 38. Next, the process 64 identifies at least one syntactic relationship between the lexical units labeled by a concept or a role or an instance and the lexical units labeled by a keyword representative of a type of question More particularly, when the labeled query belongs to the class of syntactic structures from interrogation to response: - binary: the interrogative logical constraint is a constraint on the existence of elementary semantic units and additionally, when the expected response is quantitative, a constraint on the number of times the existence of elementary semantic units is checked in the knowledge base these relating to knowledge area 10; - enumerative: the logical constraint relates to at least one target concept, which is selected as being that which is associated with the lexical unit labeled by the builder keyword or one of the concepts of the knowledge base relating to domain 10 subsumed by this one . For example, the lexical unit labeled with a ^" query builder keyword" which has won Rolland Garros in 1990? And the word "who" which is associated with the concept "Person". The concept of "person" therefore constitutes a target concept. In addition, the concept “tennis-player” is subsumed by the concept “Person” and therefore also constitutes a possible target concept. 5 - relational: the method determines at least one target constraint triplet of the type {CI, R, Cι _nd } where Cl designates a concept C or an instance I, R a role and C _d a target concept of interrogation equal to the concept associated with the builder keyword or a concept subsumed by it. Concept C and instance I, realization of a concept C, are a concept or a

10 label instance of a lexical unit of the query, or a concept or instance of a concept subsuming a label of a lexical unit of the query. The role R is a role labeling a lexical unit or subsuming the label of a lexical unit. Following the parsing process performed by the setting the invention, a set of relations, that is to say triplets, couples and target constraints, were generated for each labeled request. The method according to the invention then consists in semantically analyzing each labeled request, from steps 66 and 68

20 represented in FIG. 2. The next step 66 of the method is a step of validation, for each labeled request, of each of the elementary semantic units as a function of the constraints of the knowledge base, in order to obtain a set of semantic units elementary validated in the base of 5 knowledge. The process of step 66 performs the validation of an elementary semantic unit differently depending on whether it is a triplet or a couple. When the semantic unit is a triplet, for example (PULCI, ULR, SULCI), where PUL and SULCI respectively designate the first and second 0 lexical unit labeled by a concept or an instance, and ULR designates the lexical unit labeled by a role, the triplet is validated if each of the pairs (PULCI, ULR) and (ULR, SULCI) is valid in knowledge base 10. More particularly, R designating the label of ULR: - if PULCI is labeled by a concept C, the pair (PULCI, ULR) is validated if concept C is subsumed by the starting domain of R; - if PULCI is labeled by an instance I, the pair (PULCI, ULR) is validated if at least one concept C of the knowledge base, of which I is an instance, is subsumed by the starting domain of R; - if SULCI is labeled by a concept C, the pair (ULR, SULCI) is validated if concept C is subsumed by the arrival domain of R; and - if SULCI is labeled by an instance I, the pair (ULR, SULCI) is validated if at least one concept C of the knowledge base, of which I is an instance, is subsumed by the arrival domain of R. If none of the pairs (PULCI, ULR) and (ULR, SULCI) is valid then the corresponding elementary semantic unit (PULCI, ULR, SULCI) is eliminated. If the couple (ULR, SULCI) is valid and the couple (PULCI, ULR) is invalid, process 66: - generates and validates a triplet (DDULR, ULR, SULCI), where DDULR is the starting domain of the role R ; and - determines if there is a role R1 of the knowledge base such that the triplet (PULCI, R1, DDULR) is valid and validates such a triplet if the role R1 exists. If the couple (PULCI, ULR) is valid and the couple (ULR, SULCI) is invalid, the process: - generates and validates a triplet (PULCI, R.DAULR), where DAULR is the arrival domain of the role R ; and - determines if there is an R2 role in the knowledge base such that the triplet (DAULR, R2.SULCI) is valid and validates such a triplet if the R2 role exists. When the elementary semantic unit is a couple, for example

(PULCI, SULCI), where PULCI and SULCI respectively designate the first and the second lexical unit labeled by a concept or an instance, process 66 validates the couple if there exists a role R of the knowledge base such as the triplet ( PULCI, R, SULCI) is valid. More specifically: -if PULCI and SULCI are labeled by concepts C and C respectively, the process 66 traverses the role database for the first time and selects, if it exists, a role R whose starting domain is the concept C and the arrival domain is the concept C. Process 66 then replaces the pair (PULCI, SULCI) with the triplet (PULCI, R.SULCI) and validates it. If such a role R does not exist, the process 66 scans the role database a second time and selects, if it exists, a role R 'such that its starting domain subsumes the concept C and its domain d arrival subsumes concept C Process 66 then replaces the pair (PULCI, SULCI) with the triplet (PULCI, R ', SULCI) and validates it. Finally, if such a role R 'does not exist, the couple (PULCI, SULCI) is then eliminated. - if PULSI or SULCI is labeled by an instance I, the method repeats the process described above, considering instead of the lexical unit of label I, the concept C of which I is the most specific instance. If a role R as described above exists, the method replaces the pair (PULCI, SULCI) with the triplet (PULCI, R.SULCI) and validates this triplet. If such a role R does not exist and if a role R 'as described above exists, the method replaces the pair (PULCI, SULCI) by the triplet (PULCI, R', SULCI), validates this triplet. It eliminates the couple (PULCI, SULCI) if such a role R 'does not exist. The formation of triplets not explicitly contained in the request in natural language typically constitutes a type of extraction of semantic meaning in the request and makes it possible, by the introduction of this semantic indeterminism, to identify the semantic shortcuts formulated by the user. An additional embodiment of the method according to the invention also consists in carrying out a sequence of valid triplets from an invalid triplet. For example, considering the triplet generated (PULCI, ULR, DAULR) described above, a triplet (DAULR, R ', C) is generated and validated, where R' denotes a role in the ontology database, the domain of which of departure is DAULR and the arrival domain C. It is then still possible to repeat the process for C. Preferably, the aforementioned iteration is advantageously limited to two successive stages of generation of triplets. In a similar way, a symmetric sequence to that described above is produced for the triplet generated (DDULR, ULR, SULCI) described above by generating and validating a new triplet (C, R ', DDULR) from the domain of DDULR start of the role R. In another embodiment, the identification of the interrogation targets, in particular for the syntactic structures of interrogation with enumerative and relational response, is carried out simultaneously at the stage of validation of the triplets. The target constraints associated with the syntactic structure with relational response, are selected among the validated triplets which contain a concept associated with the keyword constructor or a concept subsumed by this concept. In this embodiment, the syntactic and semantic analysis are performed simultaneously assuming the existence of at least one semantic relationship implicitly contained in the request in natural language. If no elementary semantic unit of the tagged request has been validated, process 66 rejects this request because it does not comply with the knowledge base. If no labeled request has a validated elementary semantic unit, the process 66 rejects the request formulated by the user because it does not comply with the knowledge base. The next step 68 of the semantic analysis process is a validation step, for each labeled request, of the target interrogation constraints as a function of the validated elementary semantic units and / or of the knowledge base constraints, in order to obtain a set of validated interrogation targets. The process 68 validates the target constraints according to their type: - if a target constraint is a constraint associated with a syntactic structure of interrogation with binary response, this is automatically validated because it relates to the existence of triplets; - if a target constraint is a constraint associated with a syntactic structure of interrogation with an enumerative response, it is validated if the concept that it brings into play is present in the validated semantic units, otherwise it is eliminated; and - if a target constraint is a constraint associated with a syntactic structure of interrogation with relational response, it is validated if the relation which it brings into play is valid and that its elements are present in the validated semantic units, otherwise the constraint target is eliminated. If no target constraint has been validated for the tagged request, process 68 rejects this request because it does not comply with the knowledge base. If no tagged request has a validated target constraint, the process 68 rejects the request formulated by the user because it does not comply with the knowledge base. When the elementary semantic units and the target constraints have been validated, the method which is the subject of the invention then proceeds to extract the support data from the information sought from the knowledge base 10. In a step 70, the process extracts, for each labeled request, instances conforming to validated elementary semantic units and form a list of extracted instances initially empty. More particularly, by designating the labels of the elementary semantic units validated by C and C for two concepts, R for a role, I and I 'for two instances, the process 70 extracts the conforming instances by successively considering: - the validated elementary semantic units labels (l, R, l ') of the type {instance, role, instance}: instances I and I' are added to the end of the list of extracted instances; - the validated elementary semantic units of labels (l, R, C) of the type {instance, role, concept}: the instances of C are added at the end of the list of extracted instances; - the validated elementary semantic units of labels (C, R, I) of the type {concept.role, instance}, the instances of C are added at the end of the list of extracted instances; and - the validated elementary semantic units of labels (C, R, C) of the type {concept, role, concept}: the instances of C and C are added to the end of the list of extracted instances, any instance common to C and C being added only once to the end of the list. As we can see, an instance can appear several times in the list of extracted instances. The following step 72 of the method performs, for each labeled request, a first filtering and generates a list of validated extracted instances. Any instance I of a concept C which is not present as many times in the list of extracted instances as the concept C is not present in the validated elementary semantic units is considered as incorrect and eliminated, if not it is added to the list of validated extracted instances. The process then extracts, in a step 74, the response to the request formulated by the user. Process 74 returns as an answer the instances of the list of validated instances which satisfy the target constraints. If the validated triples did not make it possible to extract any instance, or if the target constraints do not return any instance, this means that the response to the request made by the user is not present in the knowledge base. Indeed, target constraints and elementary semantic units having been validated, this means that the request formulated by the user has a meaning in the knowledge base. Such a situation may, for example, correspond to an erroneous request, in which the question in natural language of knowing whether a female tennis player has won the male tournament, lexically and syntactically correct, cannot contain a conforming semantic answer, unless a compete in all categories of genres combined. Finally, in a step 76, the method classifies the instances of the list of validated instances. Typically the process sorts the instances chronologically or alphabetically. In another embodiment, the method returns as a response a predetermined number of validated instances, for example the ten most recent. In another embodiment, the method returns the number of validated instances. We have thus described a method and a system for extracting information support data based on the creation of triplets or pairs of lexical units from a query. It is also possible to create semantic units of higher dimension, to take into account for example semantic relations relating to more than three elements. The associated steps of the process are then deduced simply from those described above.

Claims

CLAIMS 1. Knowledge base (8) relating to a predetermined domain of knowledge, this knowledge base comprising at least one ontology base (12) made up of formalized concepts and roles subject to a set of semantic constraints formulated in accordance with a predetermined description logic and a base of instances (14) relating to concepts, characterized in that it further comprises a knowledge base (11) relating to the field of knowledge comprising at least one base of lexical units (90) of the interrogation constructor type consisting of keywords representative of types of questions and of syntactic patterns, the keywords representative of types of question being associated with a predetermined set of classes of syntactic structures of interrogation and with a predetermined set of concepts and roles interrogation objects.

2. Knowledge base according to claim 1, characterized in that the concepts and roles interrogation objects are concepts and roles of the knowledge base relating to the knowledge domain.

3. Knowledge base according to claim 1, characterized in that the concepts and the roles objects of interrogation are universal concepts and roles subsuming a predetermined set of concepts and roles of the knowledge base relating to the domain.

4. Knowledge base according to any one of the preceding claims, characterized in that the lexical unit base (30) further comprises a predetermined set of lexical units of the syntax marker type, the lexical units of the marker type of syntax being associated with a predetermined set of concepts and universal roles subsuming concepts and roles of the knowledge base relating to the domain of knowledge.

5. Knowledge base according to any one of the preceding claims, characterized in that the lexical unit base (30) further comprises a predetermined set of lexical units of the syntax marker type, the lexical units of the marker type of syntax being associated with knowledge base concepts and roles relating to the knowledge domain.

6. Method for extracting data from a knowledge base relating to a domain questioned by a query in natural language, the knowledge base comprising at least one ontology base made up of formalized concepts and roles subjected to a set of semantic constraints formulated in accordance with a predetermined description logic, a base of instances relating to concepts, and a base of keywords relating to the domain and representative of types of questions among a set of types of questions and / or syntactic structures, characterized in that it comprises at least the steps: - of lexical analysis (52, 54, 56) of the request in natural language consisting in identifying the signifying lexical units of the request and in labeling each of the lexical units by at least a concept, a role, an instance or a keyword of the knowledge base in order to generate at least one labeled query made up of lexical units labeled; - syntactic analysis (58,60,62,64) of each of the at least one labeled request comprising the steps: - ° of creation (60) of elementary semantic units made up of at least two labeled lexical units, each of these lexical units being labeled by a concept or a role or an instance, the concepts, roles and instances associated with each of these elementary semantic units together verifying a configuration of tuple of a predetermined set of configurations of tuple; and - identification identification (62) of target constraints consisting in identifying at least one syntactic relation between the lexical units labeled by a concept or a role or an instance and the lexical units labeled by a keyword representative of a type of question in order to determine at least one target interrogation constraint verifying a question among the set of question types; - semantic analysis (66,68) of each labeled request comprising at least the steps: - validation (68) of each of the elementary semantic units according to the constraints of the knowledge base, in order to obtain a set of 'validated elementary semantic units; - ° validation (68) of the target interrogation constraints as a function of the associated validated elementary semantic units and / or of the knowledge base constraints, in order to obtain a set of validated interrogation targets; and - data extraction (70,72,74,76) consisting in extracting from the knowledge base the instances of the base of instances verifying the validated elementary semantic units by means of the validated interrogation target constraints.

7. Method according to claim 6, characterized in that the step of syntactic analysis and the step of semantic analysis are carried out simultaneously on the basis of the existence of a semantic relationship implicitly contained in the request in natural language .

8. Method according to claim 6, characterized in that the predetermined set of n-tuplet configurations is a predetermined set of triplet configurations.

9. Method according to claim 8, characterized in that the step (60) of creation of elementary semantic units consists in creating a set of syntactically valid elementary semantic units of two or three distinct labeled lexical units, the semantic units to two labeled lexical units consisting of two distinct lexical units labeled by a concept or instance and the semantic units with three labeled lexical units consisting of two distinct lexical units labeled by a concept and an instance and a lexical unit labeled by a role, and what each elementary semantic unit created verifies any one of the triplet configurations among the set of triplet configurations {concept, role, concept}, {concept.rôle, instance}, {instance.rôle.concept}, {instance , role, instance}.

10. Method according to claim 9, characterized in that the step (66) of validation of each of the elementary semantic units of the step of semantic analysis consists: - in validating an elementary semantic unit (PULCI, ULR, SULCI) to three labeled lexical units, where PUL and SULCI respectively designate the first and second lexical unit labeled by a concept or an instance of the elementary semantic unit, and ULR designates the lexical unit labeled by a role of the elementary semantic unit, if the first pair of lexical units (PULCI, ULR) and the second pair of lexical units (ULR, SULCI) from the elementary semantic unit each verify a constraint of the knowledge base, and - to validate a reduced elementary semantic unit (PULCI, SULCI) to two labeled lexical units, where PULCI and SULCI respectively designate the first and second labeled lexical unit by a concept or an instance of the elementary semantic unit, if there exists at least one role R of the knowledge base such that the couples (PULCI, R) and (R.SULCI) each verify a constraint of the base of knowledge, - to replace the reduced elementary semantic unit, if it is validated, by an elementary semantic unit reconstructed with three units (PULCI, Rmin, SULCI), where R _m i _n denotes a role minimum knowledge base for the reduced elementary semantic unit (PULCI, SULCI).

11. Method according to claim 10, characterized in that the step (66) of validation of each of the elementary semantic units also consists: - when only the first pair (PULCI, ULR) of the elementary semantic unit with three units labeled lexicals does not check any knowledge base constraint: - ° to determine and validate an elementary semantic unit reconstructed with three units (DDRULR, ULR, SULCI) formed by the starting domain of the lexical unit labeled by a DDRULR role, the lexical unit labeled by a ULR role and the second lexical unit labeled by a SULCI concept or instance of the elementary semantic unit, and - ° determining and validating, if it exists, an elementary semantic unit (PULCI, R1, DDRULR) where R1 designates a role of the knowledge base such that the pairs (PULCI, R1) and (R1, DDRULR) each verify a constraint of the knowledge base, and - when only the second pair (ULR, SULCI) of the elementary semantic unit with three labeled lexical units does not check any knowledge base constraint: - ° to determine and validate an elementary semantic unit

(PULCI, ULR, DAULR) formed by the arrival domain of the lexical unit labeled by a DAULR role, the lexical unit labeled by a ULR role and the first lexical unit labeled by a PULCI concept or instance of the elementary semantic unit, and - ° to determine and validate, if it exists, an elementary semantic unit (DAULR, R2, SULCI) where R2 designates a role of the knowledge base such as couples (DAULR, R2) and (R2, SULCI) each verify a constraint of the knowledge base.

12. Method according to claim 11, characterized in that the validation step (60) of each of the elementary semantic units also consists in carrying out a sequence of at least one valid triplet from the triplet (PULCI, ULR, DAULR ) and a sequence of at least one valid triplet from the triplet (DDULR, ULR, SULCI).

13. Method according to claim 6, characterized in that the step

(62) identifying the target of the parsing step comprises the steps: of identifying a syntactic question structure of the labeled request from among a predetermined set of question syntactic structures; and - identification of at least one target logical interrogation constraint to which the syntactic structure of the question identified is subjected as a function of the lexical units labeled by a keyword representative of the type of question.

14. Method according to claim 13, characterized in that the step

(68) validation of the target constraints of the semantic analysis step consists in validating a target interrogation constraint when it exists in the knowledge base and when the concepts and / or instances it brings into play are present in validated elementary semantic units, a validated target constraint then defining a constraint that must be verified by any valid response of the labeled request.

15. Method according to any one of the preceding claims, characterized in that the method the extraction step (70,72,74,76) consists of in addition to eliminating the instances extracted a number of times less than the number of times that their associated concept is present in the associated validated semantic units, and in selecting from the instances not eliminated the instances verifying at least one of the target interrogation constraints associated with the validated elementary semantic units from which they are extracted.

16. Method according to any one of the preceding claims, characterized in that, in order to execute the data extraction step (70,72,74,76), it also consists in returning, as a final response to the request in natural language, the result of a predetermined count and / or sorting and / or a predetermined selection of specific instances as a response to the request in natural language.

17. System (40) for extracting data from a knowledge base relating to a domain interrogated by a query in natural language, the knowledge base comprising at least one ontology base made up of formalized concepts and roles subject to a set of constraints, a base of instances relating to concepts, and a base of keywords relating to the domain and representative of types of questions among a set of types of questions and / or syntactic structures, characterized in that it is suitable for implementing the method according to any one of claims 6 to 16.