WO1998025217A1 - Method and apparatus for natural language querying and semantic searching of an information database - Google Patents
Method and apparatus for natural language querying and semantic searching of an information database Download PDFInfo
- Publication number
- WO1998025217A1 WO1998025217A1 PCT/US1997/022943 US9722943W WO9825217A1 WO 1998025217 A1 WO1998025217 A1 WO 1998025217A1 US 9722943 W US9722943 W US 9722943W WO 9825217 A1 WO9825217 A1 WO 9825217A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sentence
- thematic
- question
- verb
- candidate
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2452—Query translation
- G06F16/24522—Translation of natural language queries to structured queries
Definitions
- This invention relates generally to the computerized searching of information databases and, more specifically, to searching the World Wide Web for answers to a natural language question.
- tlBSTITIITE SKST (RULE 26) information is contained in this multitude of electronic documents, how is the user to find the information that is desired?
- the simplest keyword-based method is keyword indexing.
- the method begins by representing a document as a collection of words or strings of symbols rather than as an ordered sequence of meaningful propositions. Sophisticated techniques are then employed to match a query for information, represented as a set of strings of letters, to documents, again represented as sets of strings of letters.
- words in a document collection are indexed to the documents that contain them.
- users present a system with some collection of keywords and are returned references or pointers to documents that contain those keywords.
- More sophisticated variants of this technique allow users to specify further constraints on relevant documents beyond containing at least one of the keywords listed. For example, the user may specify keywords that must not appear in returned documents, the proximity of keywords within a document, the presence of multi-keyword phrases, and other Boolean conditions on the words that are contained in a relevant document.
- Boolean keyword searching techniques allow the user to make sophisticated constraints on documents that are to be considered relevant to the query. These techniques increase the precision of the returned document set by allowing the user to request a more-focused set of documents from the information retrieval system.
- a different class of keyword-based techniques automatically expands the user's query for documents to include equivalent variants of the user's query in order to increase the number of documents returned. This results in an increase in the information retrieval system's recall .
- Examples of these techniques include "stemming" query terms to a root form so that all of the morphological variants of a keyword are matched in a document. For example, stemming "computation” and "computer” to the same root will return documents containing either term when queried with one of them.
- Another technique involves adding synonyms of query terms to the query so that, for example, a query on "3M" automatically returns documents that contain the string
- Keyword-based techniques represent documents only as collections of words rather than as meaningful expressions arranged into text for some communicative purpose.
- a sentence is more than a set of words; the structure of the sentence does most of the work in determining the meaning of the sentence.
- Both of the previous classes of techniques fundamentally represent documents as collections of alphanumeric characters, i.e. combinations of letters, and use a combination of the user's input and the system's design to return relevant documents on the basis of the words they contain. Unless precise information about word order is specified, they will, therefore, fail to distinguish between "the man bit the dog" and "the man was bitten by the dog.”
- An alternative class of information retrieval techniques addresses this issue by representing the meaning of a document rather than merely the words it contains.
- These techniques involve marking up stored text (or even non-textual data) with a representation of the meaning or content of the document in some formalism or other.
- an implementation of such techniques would be to provide a set of photographs with a set of keywords representing what the photographs depict. This would have to be done manually.
- an implementation of these techniques might involve marking up documents pertaining to financial transactions, say, with some representation of who is buying, who is selling, what is sold, and for how much. Documents could then be retrieved on the basis of their marked up annotations alone (e.g., "What did Company X buy?") or by means of their annotations as well as by keyword indexing .
- the content markup approach has the advantage of pointing the user towards documents, or sections of documents, on the basis of the semantic or prepositional content of the document or its sections, rather than on the basis of the words that the document contains. This is certainly an improvement over the keyword approach.
- On-line information retrieval relies on a word- level representation of libraries of text to locate the information users want. There is no equivalent of a card catalog or book abstract available for on-line documents.
- Content markup approaches can index a collection of information on the basis of the propositions that the text expresses or that characterize the text rather than the words that the text contains.
- specific sorts of "metadata” data about data
- semantic querying accepts a question submitted to the information retrieval system in a natural language format .
- a semantic analysis of the question is then used to translate the question into a specialized language or otherwise reformat it to aid in information retrieval.
- Some applications of this type may translate queries submitted in English into Boolean logic, or into specialized database query languages such as SQL.
- Others parse the question into a semantic representation that can be matched against marked up content. Others analyze the question and produce a set of synonymous or near- synonymous queries, hoping to increase the recall of the system.
- the invention is a software application designed to find answers to user-submitted queries posed in English from on-line documents.
- the invention consists of three major components (application programs) that will be described in detail below: a user interface, a parser, and a sentence evaluator that determines the extent to which a given sentence answers a submitted question.
- the user begins the process of retrieving information by submitting a natural language questions (e.g., English) to the user interface. For example, the user might ask, "When was Pluto discovered?" The submitted questions should be a direct request for the information, rather than a request to find the information. That is, users tend to anthropomorphize systems, but they should not make an indirect request for the information, asking, for example, "Can you find me the date of Pluto's discovery?" The question is then parsed into a form that will be useful for comparison with the returned answers. In addition, a minimum score that returned answers must meet or exceed is determined.
- a natural language questions e.g., English
- the user interface accesses a database to identify a set of relevant documents to process for answers.
- This candidate set of documents might consist of the entire collection of a user's email messages, for example.
- the submitted question can be mapped to a keyword index query in order to narrow the range of candidate documents that may contain an answer to the submitted question.
- the invention processes each document, sentence by sentence, looking for answers to the submitted question. For each sentence, a judgment is made whether or not to parse the sentence into its thematic representation. This decision is based on the presence or absence of keywords from the query in the sentence. Sentences with no keywords from the query are discarded, and those with keywords are kept for further processing.
- a parse (a semantic as well as a syntactic representation) of the sentence is created.
- Each sentence is taken to represent an event or state, with each phrase within the sentence representing some role in that event or state.
- one phrase may represent the agents of an event, another the theme of the event (that which is acted upon) , another the instrument, and so on.
- the result is a structure consisting of an event of a specified type, plus a series of relationships specifying exactly how the participants in the event participate in the event. For example, "Brutus stabbed Caesar" would be represented as expressing the existence of a stabbing event, with Brutus as the agent of the event, and Caesar as the "theme” (undergoer) of the event.
- Answer Me! relies on detailed knowledge of the syntax and semantics of verbs (events) in contrast to other Artificial Intelligence systems that have been based on a detailed representation of the relationships of nouns
- objects -- indicating, for example, that a foot is a part of a leg, and so on.
- the class of English verbs is much smaller than the class of nouns, and they have a smaller range of meanings. Therefore, a detailed representation of the nature of objects requires much more storage space and is much more complex than a detailed representation of the nature of events .
- the similarity of the semantic representation of the candidate sentence to the similarly parsed question is evaluated. To what extent do they represent similar events? If there is a close enough match, the sentence is returned as an answer to the submitted question.
- the above steps are repeated for all sentences in the document, and all documents in the set of candidate documents.
- a metric reflecting the scoring of each document is presented to the user and can be used to order the answers .
- the metric is derived with the aid of a data structure that represents the relationship of various events on the basis of the verbs that express them.
- the data structure divides thousands of English verbs into various semantic classes and subclasses, representing relationships of synonymy and near synonymy between verbs.
- a metric of the closeness of meaning between any two verbs can be determined on the basis of their relative relationship within the data structure.
- a hypertext link to the documents from which the answers came can also be provided.
- the invention contains a knowledge base of event types, and so, can recognize that a hunting event, for example, is semantically close to a seeking event, while recognizing the distinct syntactic characteristics of the verbs "hunt” and "seek.” Lastly, and most importantly, by rapidly processing all of the sentences of a document semantically -- not just the submitted question -- the invention radically speeds up the process of finding answers to specific questions with a high degree of both precision and recall.
- Figure 1 is a high level block diagram of a computer adapted to perform the method of the invention.
- Figures 2A - 2B show a flow diagram of the major functions that combine to enable the method of the invention.
- Figures 3A - 3C show a flow diagram indicating the functioning of the parser that lies at the heart of the invention .
- Figures 4A - 4D show a representative response, to a user question, in the form of an automatically generated HTML page.
- Answer Me! locates text relevant to a user-specified question posed in a natural language format; analyzes that text for sentences that may provide answers to the question; identifies those sentences that are deemed to answer the question; and evaluates the degree to which those sentences actually constitute answers to the question.
- FIG. 1 shows a block diagram of a computer system adapted to perform the method of the invention.
- the central processing unit (“CPU") 110 Via a bus 105, the central processing unit (“CPU") 110 is connected to a random access memory 115, a user interface terminal 120, and a local storage 125.
- application programs (conceptually, a user interface, parser and evalua tor - - described more fully below) are downloaded from local storage 125 to RAM 115 for execution by CPU 110.
- Local storage 125 may be a magnetic, optical, magneto-optical, electronic, or any other device capable of storing the necessary application programs.
- the term “electronically accessible” shall be understood to encompass all such storage devices.
- the local storage 125 also stores the documents which are to be searched for the answer to the user's question.
- documents could actually originate from a remote site, and the search resources used may be remote search resources as well.
- local storage 125 need not be a traditional mass storage device, but need only be a memory device of sufficient capacity to receive and buffer the desired information.
- the local storage 125 or the remote site could serve as a document depository.
- the documents to be searched reside on the World Wide Web ("WWW") and may be accessed via an Internet browser (e.g., Netscape) ; currently, a stand-alone user interface allows the user to input questions and view the progress of the processing, but user interaction may be accomplished in a variety of ways.
- WWW World Wide Web
- Internet browser e.g., Netscape
- a stand-alone user interface allows the user to input questions and view the progress of the processing, but user interaction may be accomplished in a variety of ways.
- distributed computing technologies e.g. Sun's Java applets, Next's Web objects or EOlas' Weblets
- the great operational flexibility afforded by a networked computing enrollment allows the system to be deployed for client, server, or hybrid operation depending on the local or remote provisioning of the application programs and/or documents.
- Answer Me! is implemented as follows:
- Hardware 486 or higher-speed Intel processor.
- Operating system Windows 95 or Windows NT Workstation or Server v.3.5.1.
- Search resources Digital Equipment Corporation's Alta Vista search engine and InfoSeek' s FAQ search service.
- Answer Me! was written in Visual C++ and compiled using Microsoft's Visual C++ Developer Studio v. 4.2.
- the above-mentioned application programs are implemented in a binary executable file (answerer.exe, -1.5 Mb) and five associated data files. These include a part of speech data file (lexicon.dat, -250 KB), a verb classes index (evca93.txt, -75KB) , a thematic grid index (theta.txt, -15 KB) , and two initialization files (ansrme.ini, 1 KB, and wlres.ini, 2 KB).
- the method of the invention will now be described with respect to the above example of a WWW searcher.
- the invention is usable for electronic document searching generally.
- the user inputs the question and, at step 210, the question is parsed into a form useful for comparison with the returned answers.
- This form which is known to those skilled in the art as neo- Davidsonian logical form, is also used during parsing of the returned answers, and will be described in detail in the following section entitled "Sentence Parsing.”
- a minimum score requirement is calculated for the question, which any answer returned to the user must exceed.
- an appropriate search-engine query (e.g., the traditional keyboard-based, content markup or query analysis technique described in the BACKGROUND OF THE INVENTION) is constructed on the basis of the question.
- the search engine query is submitted to the search engine or engines (e.g., WWW sites such as Alta Vista or InfoSeek) with which the invention has been enabled to communicate. These search engines return a set of pointers, to text documents, consisting of the documents' Uniform Resource Locations ("URLs”), i.e., their addresses on the WWW.
- URLs Uniform Resource Locations
- each search result (a document at its returned URL) is downloaded to local storage 125 for linguistic processing.
- the document is tokenized to yield a collection of sentences.
- a document in a computer-readable format e.g., a web page in HTML
- the tokenizer includes a routine for detecting abbreviations so that sentences are not ended prematurely.
- each sentence is parsed as will be described in more detail with respect to Figures 3A-3D (collectively referred to as " Figure 3") below.
- the parsing of each sentence of Answer Me! comprises the following three steps:
- Steps (a) and (b) may be thought of as grammatical (or syntactical) steps, while step (c) may be thought of as a semantic operation. Together, the three steps constitute what shall be referred to herein as semantic analysis, the term "semantic" as a matter of convenience being used throughout this document to include both semantic and syntactic operations .
- the outcome of the parser is a mapping of each sentence (whatever its mood -- indicative, interrogative, or imperative) into a representation of its logical or semantic form.
- the semantic formalism used is a member of the family known to those skilled in the art as neo -Davidsonian logical form.
- Neo-Davidsonian analyses of the logical form of sentences go one step further than Davidsonian analyses in treating arguments and adjuncts equally as conjuncts existentially bound to the same event variable.
- Arguments are analyzed as bearing a particular thematic (or ⁇ theta") role within an event.
- the argument is sometimes said to bear a thematic rela tion to an event.
- the sentence "John buttered the toast" would be assigned the logical form:
- Predicate (pred) John is a barber; John appointed Mary president
- Agent (ag) John gave the book to Mary
- Source John took the book from Mary
- Goal (goal) John gave the book to Mary
- Path (path) The planet circles around the sun
- Manner (manner) : The bread cuts easily; John cooked the spaghetti by boiling it
- Purpose/Reason John skipped school to see the game .
- Measure (measure) It snowed a foot, John weighed
- Possessor (poss) John has the flu
- Possessed (posd) John has the flu
- Thematic roles are to be distinguished from grammatical roles such as subject and direct object, which refer to a phrase's position, rather than the way in which its referent participates in an event. Grammatical roles distinguish the syntactic position of a phrase in a sentence in relation to the sentence's verb or some other phrase- heading element. Two phrases may have different grammatical roles, but bear the same thematic relation as in the sentences :
- sentence (3) logically entails sentence (4) , and vice versa.
- Davidsonian analyses cannot account for these entail ents directly: in order to explain an inference from (3) to (4) or from (4) to (3) , additional inference rules would be necessary in a Davidsonian theory. This is a clear advantage for neo-Davidsonian accounts since most verbs are like "give" in that they allow their arguments to embody a variety of grammatical and thematic roles.
- the essential features of a neo-Davidsonian account of logical form are (i) quantification over implicit event variables, (ii) thematic role analysis of the semantics of argument positions, and (iii) the treatment of verbs as one- place predicates of events.
- the invention thus embodies a neo-Davidsonian analysis of sentences.
- the analysis differs from discussions of neo-Davidsonian logical form in the literature only in how these analyses are encoded.
- the invention maps parsed sentences into C++ data structures (called "objects") stored in computer memory.
- the analysis could equally well map sentences into other computational data structures, such as assertions in Prolog or Java classes.
- each C++ object mapped to a sentence represents an event; the type of event and the thematic roles identifying the participants in the event are given as member variables within that object.
- Neo-Davidsonian logical form is described in more detail in the following references: Terence Parsons, Events in the Semantics of English : A Study in Subatomic Semantics, MIT Press (1990) ; Gabriel Segal and Richard Larson, Knowledge of Meaning: An Introduction to Semantic Theory, MIT Press (1995); James Higginbotham, "On Semantics,” Linguistic Inquiry 16:547-593 (1983). Further references on linking theory include: Edwin Williams, Thematic Structure in Syntax, MIT Press (1994) ; David Pesetsky, Zero Syntax: Experiencers and Cascades, MIT Press (1995) .
- the invention relies on and develops recent theoretical work on the relationship of grammatical roles and thematic roles in the literature of the rubrics of neo- Davidsonian logical form and linking theory, as described in Brian Edward Ulicny, Issues in the Philosophical Foundations of Lexical Semantics , Chapter 3, Doctoral Dissertation, Department of Linguistics and Philosophy, Massachusetts Institute of Technology (accepted May, 1993) and Douglas A. Jones, Robert C. Benwick, Franklin Cho, Zeeshan R. Khan,
- the Al Lab Memo describes a crude parser that implemented a function from sentences to a grammatical i ty judgment of good or bad. A sentence was deemed good (grammatical) if it had at least one acceptable parse and was bad otherwise. Thus, the Al Lab Memo parser did not attempt to find a unique parse for a sentence, and its resulting logical form analysis would effectively reflect several thematic assignments.
- parser of the present invention represents a significant improvement over the Al Lab Memo parser's deficiencies in each of the above-mentioned semantical and operational respects.
- Figure 3 the present invention's process of parsing a sentence summarized as step 250 in Figure 2 is explained in greater detail .
- the sentence in the buffer is passed, one word at a time, to a part of speech tagger for morphological analysis. Morphological variants of words result, for example, from situations such as prefix/suffix addition, inflection for past tense, etc.
- the part of speech tagger assigns a part of speech tag to each word of the sentence based upon the word's context and occurrence.
- the part of speech tag is selected from a total of 48 predetermined choices, such as: noun, verb, preposition, adjective, etc.
- the actual tag assigned depends at least in part on the last tag assigned, which reflects the selection properties of the preceding word. For example, if a word can be both a noun and a verb (e.g.
- the tagger will return NOUN if the preceding word was a determiner (as in the book) or VERB if the preceding word was an auxiliary (as in might book the hotel room) .
- Reference to the context of a word in tagging is especially crucial for English, which has an inordinate number of verbs that are homonymous with nouns and adjectives that are homonymous with verbs; morphology won't distinguish the part-of-speech in these cases.
- Case Theory which is a subtheory of the school of syntactic analysis known as Government and Binding Theory, asserts that every phrase of a sentence must paired one-to-one with a case assigner.
- Parts of speech that are case assigners include: TENSE (INFLECTION), VERBS and PREPOSITIONS.
- a noun phrase is assigned the (default) feature N.
- a prepositional phrase's head preposition is assigned as its phrase feature; the properties of this preposition will determine the role it plays in the sentence.
- features are assigned to a phrase on the basis of linguistic rules. For example, in the sentence "John cooked the spaghetti by boiling it", the phrase "by boiling it” would be assigned the feature MANNER because it describes how the event described by the main verb was accomplished. Verbs are inserted into the sentence's phrasal constituents and assigned the feature V.
- the invention contains a data structure that associates each verb with the set of "thematic grids" it can select as arguments.
- a thematic grid is a vector of thematic roles. Since verbs may assume several forms, based on their inflection for tense and agreement, the index is based on a stemmed form of the verb.
- the stemmed form of the verb is derived by means of an algorithm known to those skilled in the art as the Porter stemming algorithm, although other well- known stemming techniques would work equally well.
- certain specialized conditions on thematic roles may be used by the linking algorithm for greater specificity.
- One such specialized thematic role involving a slash i.e., A/B
- This is useful for verbs that select particular prepositions in certain contexts, or for thematic roles which are usually headed by prepositions but sometimes appear as plain noun phrases, as with "N/goal" above.
- the verb "spray” is associated with thematic grid ⁇ agent, N/location, with/theme>, among others.
- a phrase headed with the preposition "with” must be the second argument to the right of the verb when it is linked. It will be assigned the thematic role Theme .
- Another specialized thematic role "V/theta” (for some thematic role theta) is used to indicate the presence of verbs that incorporate nouns playing a thematic role.
- Identification of candidate thematic grids involves the use of two indices.
- the first index is a classification of all possible verbs by verb classes.
- the second index is a listing of all possible thematic roles selected by each verb class of the first index.
- Every verb in the verb class index is assigned to some class (or classes) based on the verb's meaning and the syntactic behavior of the verb's arguments.
- class or classes
- the verbs dig, jab, pierce, poke, prick, and stick . Their behavior contrasts with those of the Touch class, for instance, which does not allow through or into phrases as arguments. That is, the sentence "Carrie touched the stick through/into the cat" is ungrammatical. This would seem to indicate that proper usage of the Poke verbs necessarily involves some sort of directed motion, which can be expressed by a through or into phrase, whereas the Touch verbs do not .
- the Touch verbs simply express a relationship between two objects, while the Poke verbs specify something about the relationship of the instrument used in the poking to the material substance of the thing poked.
- the second index maps a verb class to the thematic grids associated with the various alternations in which the verbs in that verb class may participate. For example, four of the thematic grids that clear (and other verbs of its class) project are:
- each verb in the sentence is looked up in the first index to yield one or more verb classes. Then, for each verb class, all its possible thematic grids are determined from the second index. In this way, all possible candidate thematic grids for a given verb are determined.
- Passivized verbs do not deploy the thematic role of their subject (except, optionally, in a Jby-phrase) ; thus, a thematic role for the non-passivized verb's subject should be assigned only if there is a by-phrase (e.g., "Brutus was stabbed by Caesar. " ) .
- the linker steps through the thematic grids one at a time and, at step 360, attempts to assign each verb and its arguments to a best thematic grid based on the verb's semantics. That is, the linker tries assigning thematic roles to phrases, starting with the leftmost verb in the sentence. If the thematic roles in the verb's thematic grid (e.g.
- ⁇ Agent , Theme> are compatible with the features of the phrases immediately to the left of the verb and following the verb, then the phrases are assigned to those roles. For example, if a verb selects an Agent as its subject, or external argument, the linker will consider a phrase immediately to the left of the verb to be compatible if it is a noun phrase (has feature N) or has some other feature compatible with the Agent role.
- Agent roles One particular situation involving Agent roles deserves special consideration. Referential dependencies between pronouns and overt noun phrases within a sentence or due to ellipses, both of which might result in a multiplicity of phrases assigned to certain thematic roles, are handled by allowing only one set of thematic roles to be assigned.
- the second agent is appended to the previously assigned Agent phrase.
- This allows the system to recognize some anaphoric relations that might otherwise be missed. For example, the sentence "If a man owns a donkey, he beats it," contains two Agents ("a man,” “he") and two Themes ("a donkey,” “it”) .
- the system will be able to recognize the donkey sentence as an answer to the question "Does a man beat a donkey?"
- the invention's treatment of multiple verbs can be thought of as an endorsement of the axiom that for every group of basic level events, there is an event (a super-event) that consists of the occurrence of all those events.
- the agents of those constituent events are the agents of the super event; the themes of the constituent events are the themes of the super event, and so on.
- step 365 if all of the phrases are assigned a thematic role and all of the thematic grid' s thematic roles have been assigned, we say that the linking has converged; otherwise it has crashed .
- step 370 the first assignment of thematic roles that converges, if one occurs, is retained. Otherwise, at step 375, the assignment of thematic roles to phrases that comes closest to converging is retained. That is, the assignment that crashes the least badly is retained if none has converged. Phrases that don't get assigned a thematic role are assigned a special "adjunct" role. If convergence has not occurred, the process of steps 360-370 is repeated for the remaining candidate thematic grids, in an attempt to find a candidate grid that actually converges. Finally, at step 380, the best thematic grid is outputted for score evaluation (steps 255-260 of Figure 2) .
- step 255 once the best parse has been found for a sentence, it is evaluated as to the degree to which it answers the submitted question. Answerhood is measured by a graded relation between a sentence and a question.
- a sentence may either be a full answer, a partial answer, or a non-answer to a submitted question. Partial answers with scores greater than a predetermined minimum score for the question are returned as well as full answers .
- step 210 of Figure 2 After a submitted question has been parsed (in an identical manner to that described for sentences) , it is assigned a minimum score that must be met or exceeded by a sentence, if that sentence is to be deemed an answer to the question. This score is given by the number of thematic roles assigned to the question. Thus, the questions “Who discovered Pluto?" and “Did Tombaugh discover Pluto?” would both be assigned a minimum score of 2 , because they both contain an Agent ( “Who” and “Tombaugh", respectively) and a Theme (“Pluto").
- a sentence is evaluated with respect to the submitted question as follows. First, a comparator determines, for each thematic role in the sentence, whether the phrase assigned to that thematic role in the question (e.g., the question's Agent) literally (exactly) matches a substring of the phrase assigned to that thematic role in the sentence. For each such match, a scorer increases the score by one. A sentence having a score of zero is discarded.
- the phrase assigned to that thematic role in the question e.g., the question's Agent
- the comparator and scorer determine whether any of the question's thematic roles and if so, are occupied by wh- phrases (e.g., interrogatives such as "who,” “which,” “why,” “how,” “when,” “where,” and “what") in the sentence and, if so, increases the score by one per occupancy.
- wh- phrases e.g., interrogatives such as "who,” “which,” “why,” “how,” “when,” “where,” and “what”
- wh- phrases e.g., interrogatives such as "who,” "which,” “why,” “how,” “when,” “where,” and “what”
- wh- phrases e.g., interrogatives such as "who,” "which,” “why,” “how,” “when,” “where,” and “what”
- verb-based comparisons are made between questions and each sentence. Any such match, either between verbs or between verb classes, also increases the score by one per match.
- the semantic distance between the sentences and the question's verb classes could be quantified as the numerical distance between them in the tree.
- Such a numerical distance could be used an inverse measure of answerhood, with shorter distances causing larger score increases .
- sentences that score higher than the minimum score associated with the question are presented to the user, or otherwise stored, as answers to the submitted question.
- an HTML page with the answers, their scores, and a hypertext link to the page from which they were extracted is constructed and automatically updated for the user, who accesses it through a Web browser. Processing continues until all sentences (step 245) of all the documents returned by the search query (step 270) have been evaluated.
- Figures 4A-4D show a representative response to a user question of the form "When was Pluto discovered?" in the form of an automatically generated HTML page.
- processing of the documents is done on an as-needed basis.
- the parses are not stored for future use. It would present no technical difficulty to store the data structures that result from the parsing in a database so that question-answering could directly access the stored parses, rather than parsing the sentences as needed.
Abstract
Description
Claims
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP97950947A EP1016003A1 (en) | 1996-12-04 | 1997-12-04 | Method and apparatus for natural language querying and semantic searching of an information database |
CA002273902A CA2273902A1 (en) | 1996-12-04 | 1997-12-04 | Method and apparatus for natural language querying and semantic searching of an information database |
AU53816/98A AU5381698A (en) | 1996-12-04 | 1997-12-04 | Method and apparatus for natural language querying and semantic searching of an information database |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US76069196A | 1996-12-04 | 1996-12-04 | |
US08/760,691 | 1996-12-04 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO1998025217A1 true WO1998025217A1 (en) | 1998-06-11 |
Family
ID=25059885
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US1997/022943 WO1998025217A1 (en) | 1996-12-04 | 1997-12-04 | Method and apparatus for natural language querying and semantic searching of an information database |
Country Status (4)
Country | Link |
---|---|
EP (1) | EP1016003A1 (en) |
AU (1) | AU5381698A (en) |
CA (1) | CA2273902A1 (en) |
WO (1) | WO1998025217A1 (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2000075807A1 (en) * | 1999-06-08 | 2000-12-14 | Albert-Inc. S.A. | System and method for enhancing e-commerce using natural language interface for searching database |
WO2000075806A1 (en) * | 1999-06-08 | 2000-12-14 | Albert-Inc. S.A. | System and method for enhancing online support services using natural language interface for searching database |
WO2000079437A2 (en) * | 1999-06-18 | 2000-12-28 | Microsoft Corporation | System for identifying the relations of constituents in information retrieval-type tasks |
WO2001069455A2 (en) * | 2000-03-16 | 2001-09-20 | Poly Vista, Inc. | A system and method for analyzing a query and generating results and related questions |
WO2001082114A2 (en) * | 2000-04-26 | 2001-11-01 | Global Information Research And Technologies Llc | System for fulfilling an information need |
WO2001084376A2 (en) * | 2000-04-28 | 2001-11-08 | Global Information Research And Technologies Llc | System for answering natural language questions |
WO2002041171A1 (en) * | 2000-11-17 | 2002-05-23 | Supportus Aps | Method for finding answers in a database to natural language questions |
WO2002046970A2 (en) * | 2000-12-05 | 2002-06-13 | Global Information Research And Technologies Llc | System for fulfilling an information need using extended matching techniques |
WO2002080036A1 (en) * | 2001-03-30 | 2002-10-10 | Hapax Ltd | Method of finding answers to questions |
WO2002091237A1 (en) * | 2001-05-07 | 2002-11-14 | Global Information Research And Technologies Llc | System for answering natural language questions |
US6598039B1 (en) | 1999-06-08 | 2003-07-22 | Albert-Inc. S.A. | Natural language interface for searching database |
WO2011051970A2 (en) * | 2009-10-28 | 2011-05-05 | Tata Consultancy Services Ltd. | Method and system for obtaining semantically valid chunks for natural language applications |
US9396235B1 (en) | 2013-12-13 | 2016-07-19 | Google Inc. | Search ranking based on natural language query patterns |
US9471559B2 (en) | 2012-12-10 | 2016-10-18 | International Business Machines Corporation | Deep analysis of natural language questions for question answering system |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0610760A2 (en) * | 1993-01-28 | 1994-08-17 | Kabushiki Kaisha Toshiba | Document detection system with improved document detection efficiency |
EP0631244A2 (en) * | 1993-06-24 | 1994-12-28 | Xerox Corporation | A method and system of information retrieval |
-
1997
- 1997-12-04 AU AU53816/98A patent/AU5381698A/en not_active Abandoned
- 1997-12-04 WO PCT/US1997/022943 patent/WO1998025217A1/en not_active Application Discontinuation
- 1997-12-04 EP EP97950947A patent/EP1016003A1/en not_active Withdrawn
- 1997-12-04 CA CA002273902A patent/CA2273902A1/en not_active Abandoned
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0610760A2 (en) * | 1993-01-28 | 1994-08-17 | Kabushiki Kaisha Toshiba | Document detection system with improved document detection efficiency |
EP0631244A2 (en) * | 1993-06-24 | 1994-12-28 | Xerox Corporation | A method and system of information retrieval |
Non-Patent Citations (3)
Title |
---|
BALDAZO R: "NAVIGATING WITH A WEB COMPASS", BYTE, vol. 21, no. 3, 1 March 1996 (1996-03-01), pages 97/98, XP000600179 * |
QUINTANA Y ET AL: "Graph-based retrieval of information in hypertext systems", SIGDOC '92. THE 10TH ANNUAL INTERNATIONAL CONFERENCE. CONFERENCE PROCEEDINGS. GOING ONLINE. THE NEW WORLD OF MULTIMEDIA DOCUMENTATION, PROCEEDINGS OF SIGDOC '92: 10TH ANNUAL ACM CONFERENCE ON SYSTEMS DOCUMENTATION, OTTAWA, ONT., CANADA, 13-16 OCT. 19, ISBN 0-89791-532-1, 1992, NEW YORK, NY, USA, ACM, USA, pages 157 - 168, XP002062197 * |
RAYNER M ET AL: "Temporal relations and logic grammars", ECAI '86. 7TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE. PROCEEDINGS, BRIGHTON, UK, 21-25 JULY 1986, 1986, LONDON, UK, CONFERENCE SERVICES, UK, pages 9 - 14 vol.2, XP002062196 * |
Cited By (32)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6446064B1 (en) | 1999-06-08 | 2002-09-03 | Albert Holding Sa | System and method for enhancing e-commerce using natural language interface for searching database |
WO2000075806A1 (en) * | 1999-06-08 | 2000-12-14 | Albert-Inc. S.A. | System and method for enhancing online support services using natural language interface for searching database |
WO2000075807A1 (en) * | 1999-06-08 | 2000-12-14 | Albert-Inc. S.A. | System and method for enhancing e-commerce using natural language interface for searching database |
US6598039B1 (en) | 1999-06-08 | 2003-07-22 | Albert-Inc. S.A. | Natural language interface for searching database |
US6594657B1 (en) | 1999-06-08 | 2003-07-15 | Albert-Inc. Sa | System and method for enhancing online support services using natural language interface for searching database |
US7290004B2 (en) | 1999-06-18 | 2007-10-30 | Microsoft Corporation | System for improving the performance of information retrieval-type tasks by identifying the relations of constituents |
WO2000079437A3 (en) * | 1999-06-18 | 2003-12-18 | Microsoft Corp | System for identifying the relations of constituents in information retrieval-type tasks |
US7536397B2 (en) | 1999-06-18 | 2009-05-19 | Microsoft Corporation | System for improving the performance of information retrieval-type tasks by identifying the relations of constituents |
US7299238B2 (en) | 1999-06-18 | 2007-11-20 | Microsoft Corporation | System for improving the performance of information retrieval-type tasks by identifying the relations of constituents |
WO2000079437A2 (en) * | 1999-06-18 | 2000-12-28 | Microsoft Corporation | System for identifying the relations of constituents in information retrieval-type tasks |
US7290005B2 (en) | 1999-06-18 | 2007-10-30 | Microsoft Corporation | System for improving the performance of information retrieval-type tasks by identifying the relations of constituents |
US7269594B2 (en) | 1999-06-18 | 2007-09-11 | Microsoft Corporation | System for improving the performance of information retrieval-type tasks by identifying the relations of constituents |
US7206787B2 (en) | 1999-06-18 | 2007-04-17 | Microsoft Corporation | System for improving the performance of information retrieval-type tasks by identifying the relations of constituents |
US6901402B1 (en) | 1999-06-18 | 2005-05-31 | Microsoft Corporation | System for improving the performance of information retrieval-type tasks by identifying the relations of constituents |
WO2001069455A2 (en) * | 2000-03-16 | 2001-09-20 | Poly Vista, Inc. | A system and method for analyzing a query and generating results and related questions |
WO2001069455A3 (en) * | 2000-03-16 | 2003-11-20 | Poly Vista Inc | A system and method for analyzing a query and generating results and related questions |
WO2001082114A2 (en) * | 2000-04-26 | 2001-11-01 | Global Information Research And Technologies Llc | System for fulfilling an information need |
US6859800B1 (en) | 2000-04-26 | 2005-02-22 | Global Information Research And Technologies Llc | System for fulfilling an information need |
WO2001082114A3 (en) * | 2000-04-26 | 2003-09-12 | Global Information Res And Tec | System for fulfilling an information need |
WO2001084376A2 (en) * | 2000-04-28 | 2001-11-08 | Global Information Research And Technologies Llc | System for answering natural language questions |
WO2001084376A3 (en) * | 2000-04-28 | 2002-07-25 | Global Information Res And Tec | System for answering natural language questions |
WO2002041171A1 (en) * | 2000-11-17 | 2002-05-23 | Supportus Aps | Method for finding answers in a database to natural language questions |
WO2002046970A3 (en) * | 2000-12-05 | 2004-02-26 | Global Information Res And Tec | System for fulfilling an information need using extended matching techniques |
WO2002046970A2 (en) * | 2000-12-05 | 2002-06-13 | Global Information Research And Technologies Llc | System for fulfilling an information need using extended matching techniques |
US7058564B2 (en) | 2001-03-30 | 2006-06-06 | Hapax Limited | Method of finding answers to questions |
WO2002080036A1 (en) * | 2001-03-30 | 2002-10-10 | Hapax Ltd | Method of finding answers to questions |
US7707023B2 (en) | 2001-03-30 | 2010-04-27 | Hapax Limited | Method of finding answers to questions |
WO2002091237A1 (en) * | 2001-05-07 | 2002-11-14 | Global Information Research And Technologies Llc | System for answering natural language questions |
WO2011051970A2 (en) * | 2009-10-28 | 2011-05-05 | Tata Consultancy Services Ltd. | Method and system for obtaining semantically valid chunks for natural language applications |
WO2011051970A3 (en) * | 2009-10-28 | 2011-07-07 | Tata Consultancy Services Ltd. | Method and system for obtaining semantically valid chunks for natural language applications |
US9471559B2 (en) | 2012-12-10 | 2016-10-18 | International Business Machines Corporation | Deep analysis of natural language questions for question answering system |
US9396235B1 (en) | 2013-12-13 | 2016-07-19 | Google Inc. | Search ranking based on natural language query patterns |
Also Published As
Publication number | Publication date |
---|---|
EP1016003A1 (en) | 2000-07-05 |
CA2273902A1 (en) | 1998-06-11 |
AU5381698A (en) | 1998-06-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP4892130B2 (en) | Text input processing system using natural language processing technique | |
US6859800B1 (en) | System for fulfilling an information need | |
US6678677B2 (en) | Apparatus and method for information retrieval using self-appending semantic lattice | |
US6665666B1 (en) | System, method and program product for answering questions using a search engine | |
US6678694B1 (en) | Indexed, extensible, interactive document retrieval system | |
US20040117352A1 (en) | System for answering natural language questions | |
Bernardini et al. | A WaCky introduction | |
CA2754006A1 (en) | Systems, methods, and software for hyperlinking names | |
EP1016003A1 (en) | Method and apparatus for natural language querying and semantic searching of an information database | |
Bradshaw et al. | Guiding people to information: providing an interface to a digital library using reference as a basis for indexing | |
Hung et al. | Applying word sense disambiguation to question answering system for e-learning | |
Radev et al. | Evaluation of text summarization in a cross-lingual information retrieval framework | |
Smeaton et al. | User-chosen phrases in interactive query formulation for information retrieval | |
Katz et al. | Answering Multiple Questions on a Topic From Heterogeneous Resources. | |
Billerbeck | Efficient query expansion | |
KR20030006201A (en) | Integrated Natural Language Question-Answering System for Automatic Retrieving of Homepage | |
Litkowski | Question Answering Using XML-Tagged Documents. | |
Anick | The automatic construction of faceted terminological feedback for interactive document retrieval | |
Milić-Frayling | Text processing and information retrieval | |
Salton | Abstracts of Articles in the Information Retrieval Area Selected by Gerard Salton | |
Lancaster | Mechanized document control: A review of some recent research | |
Beaulieu et al. | Concept-based Interactive Query Expansion Support Tool (CIQUEST) | |
WO2002046970A2 (en) | System for fulfilling an information need using extended matching techniques | |
Kelledy | Query space reduction in information retrieval | |
Ramanand et al. | Data Engineering |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AT AU BG BR CA CH CN CZ DE DK ES FI GB HU IL IS JP KR LC LU MK MX NO NZ PL PT RO RU SE SG SI SK VN YU |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL PT SE |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
ENP | Entry into the national phase |
Ref document number: 2273902 Country of ref document: CA |
|
WWE | Wipo information: entry into national phase |
Ref document number: 1997950947 Country of ref document: EP |
|
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
WWP | Wipo information: published in national office |
Ref document number: 1997950947 Country of ref document: EP |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: 1997950947 Country of ref document: EP |