US20040073874A1 - Device for retrieving data from a knowledge-based text - Google Patents

Device for retrieving data from a knowledge-based text Download PDF

Info

Publication number
US20040073874A1
US20040073874A1 US10/467,937 US46793703A US2004073874A1 US 20040073874 A1 US20040073874 A1 US 20040073874A1 US 46793703 A US46793703 A US 46793703A US 2004073874 A1 US2004073874 A1 US 2004073874A1
Authority
US
United States
Prior art keywords
information extraction
module
text
selection
learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US10/467,937
Inventor
Thierry Poibeau
C?eacute;lestin Sedogbo
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Thales SA
Original Assignee
Thales SA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Thales SA filed Critical Thales SA
Assigned to THALES reassignment THALES ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: POIBEAU, THIERRY, SEDOGBO, CELESTIN
Publication of US20040073874A1 publication Critical patent/US20040073874A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing

Definitions

  • the present invention is in the field of extraction of information from unstructured texts. More specifically, it enables the formation and enrichment of a database of knowledge specific to a domain, improving the effectiveness of the extraction.
  • Information extraction is distinct from information retrieval (IR).
  • Information retrieval involves finding texts containing a combination of words that are the object of the search or, where necessary, a combination close to the original, the degree of closeness being used to arrange the collection of texts containing said combination in order of relevance.
  • Information retrieval is used especially in document searches and, increasingly, by the general public (use of search engines on the Internet).
  • Information extraction involves searching through a collection of unstructured texts for all the information (and only that information) having an attribute (for example all proper names, company heads, heads of state, etc.) and arranging all instances of the attribute in a database so as to then process them.
  • Information extraction is used especially in business intelligence and in civilian or military intelligence.
  • U.S. Pat. Nos. 5,796,926 and 5,841,895 disclose the use of certain learning processes for programming in a semi-automatic manner the finite state machine algorithms.
  • the processes of this prior art are limited to the learning of the syntactic relations in the context of a sentence, which involves the need to resort again in a very important way to manual programming.
  • the present invention solves this problem by enabling the learning of other types of relations and by extending the field of the learning to the whole of a collection of texts of a domain.
  • the invention proposes a device for extracting information from a text including an extraction module and a learning module cooperating with each other and comprising means for automatically selecting in the text the contexts of instance of classes/entities of information to be extracted, for automatically selecting from these contexts those which are relevant for a domain and for enabling the user to modify this latter selection such that the learning module will improve the next output of the extraction module, characterized in that the extraction module additionally includes means for identifying relations existing in the text between the relevant entities at the output of the means.
  • the invention also proposes a method for extracting information from a text including a learning process and a selection process, the selection process including a step of automatic selection in the text of contexts of instance classes/entities of the information to be extracted, a step of automatic selection from these contexts of those which are relevant for a domain and a step of modification by the user of outputs of the previous step, the modified outputs being taken into account in a learning process to improve the next result of the selection process, characterized in that the selection process additionally includes steps to identify the relations existing in the text between the relevant entities at the output of the steps of the selection process.
  • FIG. 1 discloses a hardware embodiment of the device
  • FIG. 2 shows the architecture of the device according to the invention
  • FIG. 3 shows the flowchart for conflict resolution according to the context
  • FIG. 4 shows the sequencing of the steps of the method according to the invention
  • FIG. 5 shows the flowchart of the relations between the entities
  • FIG. 6 shows an example morphosyntactic analysis
  • FIG. 7 illustrates an example of transduction
  • FIG. 8 illustrates the sequencing of selection steps on an example
  • FIG. 9 illustrates the sequencing of learning steps on another example.
  • REUTERS will be used as the agency name (SOURCE).
  • SOURCE agency name
  • REUTERS is a character string represented by corresponding bytes.
  • Tagging is also an established operation which, purely by way of nonlimiting example, is illustrated by the language XML.
  • the device may include a central processing unit and its associated memory (CPU/RAM) with a keyboard and monitor.
  • the central processing unit will be advantageously connected to a local area network, itself possibly connected to a public or private wide area network (DISPLAY), if necessary by secured links.
  • the collections of texts to be processed will be available in several types of alphanumeric format (processing and text, HTML or XML) on storage means (ST_ 1 , ST_ 2 ) which will for example be redundant disks connected to the local area network.
  • These storage means will also include texts that have undergone processing according to the invention (TAG_TEXT) and various corpora of texts by domain (DOM_TEXT) with the appropriate indexes. Also stored on these disks will be the database(s) (FACT_DB) fed by the information extraction.
  • the database will advantageously be of the relational type or object type.
  • the data structure will be defined in a manner known to those skilled in the art according to the application specification or generated by the application (see for example the FACT_DB window in FIG. 4).
  • the texts to be processed can be imported to the storage means (ST_ 1 , ST_ 2 ) by diskette or any other removable storage means or they can come from the wide area network, directly in a format compatible with the PREPROC_MOD sub-module (FIG. 2).
  • the computer peripheral equipment enabling this capture and the software used to convert them to text format are commercially available.
  • intelligence applications it may be useful to carry out an interception and a real-time processing of documents exchanged over wired or wireless communication networks.
  • the specific listening devices will be integrated in the system upstream of the capture peripheral equipment.
  • the device according to the invention includes an extraction module ( 20 ) or “EXT_MOD” to which the text to be processed (“TEXT”, 10 ) is presented.
  • Said extraction module ( 20 ) includes a first preprocessing program (“PREPROC_MOD”, 211 ) which recognizes the structure of the document in order to extract information from it.
  • Structured documents enable simple extraction, without linguistic analysis, since they have headers or characteristic structures (electronic mail headers, agency dispatch block).
  • the agency dispatch block in the STR_TEXT window includes:
  • the extraction module ( 20 ) also includes a second program to extract the entities (“ENT_EXT”, 212 ), that is to say to recognize the names of persons, of company locations and the expressions specified in the domain considered.
  • the block of the TAG_TEXT window of FIG. 4 shows the entities/expressions with the class that has been attributed to them by tags: “Bridgestone Sports” ⁇ > COMPANY “vendredi” ⁇ > DATE “Taiwan” ⁇ > LOCATION “une mineral locale” ⁇ > COMPANY “clubs de golf” ⁇ > PRODUCT “Japon” ⁇ > LOCATION “Bridgestone Sports Taiwan ⁇ > COMPANY “20 millions de technical dollars taiwanais” ⁇ > CAPITAL “janvier 1990” ⁇ > DATE “clubs en acier et en bois-metal” ⁇ > PRODUCT
  • the recognition will also use a grammar (KB 4 , 414 ), which itself is fed by general knowledge (KB1, 411 ) and learned knowledge (KB 2 , 412 ).
  • KB4 , 414 general knowledge
  • KB 2 , 412 learned knowledge
  • “Bridgestone Sports” and “Bridgestone Sports Ta ⁇ wan” are recognized as instances of the entity COMPANY since they appear in the structure of two sentences as qualifiers of the word “compagnie” (meaning “company”).
  • “clubs de golf” and “clubs en acier” et en “bois-metal”” are recognized as instances of the entity “PRODUCT” since they are respectively direct objects of the verb “terrorism” (“to produce”) and adjuncts of the verb “débuter” having the subject “production”.
  • the reuse of partial rules; the method described uses the elements already found and the grammar rules for recognizing proper names in order to extend the coverage of the initial system. Therefore this amounts to a case of explanation-based learning.
  • the mechanism is based on grammar rules with the involvement of unknown words.
  • the grammar can recognize Mr Kassianov as being a name of a person even if Kassianov is an unknown word.
  • the isolated instances of the word can henceforth be labeled as person name.
  • the learning is in this case used as an inductive mechanism using knowledge from the system (the grammar rules) and the entities found beforehand (the set of positive examples) to improve performance;
  • discourse structures are another source for acquiring knowledge, like enumerations, easily identifiable for example by the presence of a certain number of person names, separated by connectors (commas, subordination conjunction “and” or “or” etc.).
  • Kostine is labeled as an unknown word.
  • the system infers from the context (the word Kostine appears in an enumeration of person names) that the word Kostine refers to a person name, even though in this case it is an isolated person name which cannot be typed from the dictionary or from other instances in the text.
  • Consuela Washington a longtime House staffer and an expert in securities laws, is a leading candidate to be chair woman of the Securities and Exchange Commission in the Clinton administration.
  • Consuela Washington represents a person.
  • the first instance of the word Washington is more of a problem in that the only information allowing a choice to be made in the sentence is world knowledge, viz. it is generally a person who runs an organization.
  • the dynamic typing process is limited, in the event of conflict (that is to say, if a word has received a label which is in conflict with a label recorded beforehand for this word in the dictionary; this is the case for the word Washington in the above example), to the text being analyzed and not to the corpus as a whole.
  • the system will label all isolated instances of Washington as person name in the above text, but in the next text, if an isolated instance of the word Washington appears, the system will label it as location name, according to the dictionary.
  • an arbitrary choice is then made.
  • FIG. 3 illustrates the flowchart for conflict resolution in the typing of entities.
  • the extraction module ( 20 ) includes a third program (INT_EXT, 213 ) to identify the relations between the entities for which the relevant instances have been selected by the program ( 212 ).
  • the FACT_DB window in FIG. 5 shows the relations which have been established between the entities of the TAG_TEXT window.
  • This module includes three main sub-modules, the flowchart of which is represented in FIG. 5.
  • Step ( 1310 ) (1st identification of relevant relations between entities) is automatic.
  • Step ( 1320 ) (2nd identification of relevant relations between entities—Addition/Subtraction of relevant/non-relevant relations) is semi-automatic and assumes a step ( 1330 ) of interaction with the user.
  • Step ( 1400 ) is for feeding the database (FACT_DB, 80 ) with the selected entities and the identified relations.
  • the entity and relation field names are managed automatically and the fields of the database are then filled with their instances.
  • the database ( 80 ) can in fact be operated by users who are not information processing specialists but who require structured information.
  • the device according to the invention also includes a learning module (LEARN_MOD, 30 ) which cooperates with the extraction module ( 20 ).
  • This module receives at the input, in an asynchronous manner with the operation of the module ( 20 ), a collection of texts belonging to a given domain (DOM_TEXT, 50 ).
  • This mode of asynchronous operation allows the knowledge base KB 2 , ( 412 ) to be built containing the domain-specific dictionary and the knowledge base KB 3 ( 413 ) and the grammar rules specific to the same domain. It also enables relations that are characteristic of the domain, and which are stored in a database KB 5 ( 415 ), to be formulated.
  • the module ( 30 ) cooperates with the module ( 20 ) to enrich the knowledge bases (KB 2 , KB 3 , KB 5 ) as illustrated generically in FIG. 8 and on a specific example in FIG. 9.
  • This module includes three main sub-modules for which the sequencing flowchart is represented in FIG. 5: morphosyntactic analysis sub-module, sub-module for the linguistic analysis of elements in the form and form-filling sub-module. These sub-modules are sequenced together as a cascade: the analysis supplied at one given level is retrieved and extended to the next level.
  • the morphosyntactic analysis is made up of a tokenizer, a sentence splitter, an analyzer and a morphological labeler.
  • the annotations are presented in transducer form.
  • the identification of elements of the form by linguistic analysis can be broken down into two steps: the first, generic, step is for analyzing named entities, and the second step, specific to a given corpus, is for typing the entities recognized previously and identifying other elements needed to fill the form.
  • Named entities are linked by means of more specific extraction schemes which are written by means of a set of transducers for assigning a label to a sequence of lexical items. These rules exploit the morphosyntactic analysis which took place beforehand.
  • An example transducer is given in FIG. 7.
  • the last step involves simply retrieving within the document the relevant information in order to insert it into an extraction form.
  • the partial results are merged into one single form per document.
  • step ( 1120 ) The algorithms for selecting relevant entities are enhanced during step ( 1120 ) by interaction by the user ( 1130 ) who selects the relevant contexts and the non-relevant contexts of the instances of the entities.
  • the new parameters of the algorithms are generated during step ( 2100 ) then stored during step ( 2200 ).
  • step ( 1320 ) The algorithms for identifying relevant relations are enhanced during step ( 1320 ) by interaction by the user ( 1330 ) who identifies the relevant relations and the non-relevant relations.
  • the new parameters of the algorithms are generated during step ( 2300 ) then stored during step ( 2400 ).
  • steps ( 1120 ) and ( 1130 ) are illustrated by an example in FIG. 5.
  • Window ( 3100 ) the user supplies a semantic class to the system. For example, using verbs from speech: “affirmer” (to affirm), “déclarer” (to declare), “dire” (to say), etc.
  • window ( 3200 ) this semantic class is projected onto the corpus (DOM_TEXT, 50 ) in order to gather all the contexts in which a given expression appears. Taking the example of speech verbs, this step ends with the formation of a list of all the contexts in which the verbs “affirmer” (to affirm), “déclarer” (to declare), “dire” (to say), etc. appear.
  • Window ( 3400 ) the system uses the list of examples marked positive and negative to generate, from a set of knowledge for the domain: (essentially linguistic rules), a state machine covering most of the contexts marked positively while excluding those marked negatively.
  • a transducer describes a linguistic expression and is generally read from left to right. Each box describes a linguistic item and is linked to the next element by a line.
  • a linguistic item can be a character string (que, de), a lemma ( ⁇ ubbed> may equally well denote the form a as the form insul or aurons), a syntactic category ( ⁇ V> denotes any verb), a syntactic category Accompanied by semantic lines ( ⁇ N+ProperName> denotes, within nouns, only proper names).
  • the grayed elements (à_obj) denote a call to a complex structure described in another transducer (recursivity). The elements that Are searched for are included between the tags ⁇ key> and ⁇ /key> which are introduced for later processing.
  • Window ( 3500 ) the user outputs the result state machine and if necessary makes slight alterations.
  • the learning corpus is first subject to a preprocessing which aims to eliminate non-essential complements. This step is performed by projecting onto the text (TEXT, 10 ) in delete mode (the transition of a state machine to delete mode is used to obtain a text in which the sequences recognized by the state machine have been deleted) the fixed adverb dictionaries and grammars designed to identify adjunct elements.
  • the knowledge base state machines are then, in their turn, projected onto the database of examples.
  • Two state machines ( 3510 , 3520 ) emerge from the linguistic knowledge database.
  • the states of the state machine ( 3511 , 3521 ) call on sub-graphs using indications supplied by the functional labeling, for the recognition of indirect objects introduced by the preposition “à” ( 3511 ) and inverted subjects ( 3521 ).
  • This strategy enables coverage of new positive contexts illustrated in the window ( 3600 ).
  • the state machine leads to the structure represented in the window ( 3700 ).
  • This master state machine is inferred from the examples database for the recognition of speech verbs.
  • the inferred state machine is complex. It covers the examples database and will feed the extraction system.

Abstract

The invention relates to a device and a method for extracting information from an unstructured text, said information including relevant instances of classes/entities searched for by the user and relations between these classes/entities. The device and method improve in a semi-automatic manner on a given domain. The transition from one domain to a new domain is also highly facilitated by the device and method of the invention.

Description

  • The present invention is in the field of extraction of information from unstructured texts. More specifically, it enables the formation and enrichment of a database of knowledge specific to a domain, improving the effectiveness of the extraction. [0001]
  • Information extraction (IE) is distinct from information retrieval (IR). Information retrieval involves finding texts containing a combination of words that are the object of the search or, where necessary, a combination close to the original, the degree of closeness being used to arrange the collection of texts containing said combination in order of relevance. Information retrieval is used especially in document searches and, increasingly, by the general public (use of search engines on the Internet). [0002]
  • Information extraction involves searching through a collection of unstructured texts for all the information (and only that information) having an attribute (for example all proper names, company heads, heads of state, etc.) and arranging all instances of the attribute in a database so as to then process them. Information extraction is used especially in business intelligence and in civilian or military intelligence. [0003]
  • The prior art in information extraction is well represented by the work and papers presented at the Message Understanding Conferences which take place every two years in the USA (references: Proceedings of the 5[0004] th, 6th and 7th Message Understanding Conference (MUC-5, MUC-6, MUC-7), Morgan Kaufmann, San Mateo, Calif., USA). The selection algorithms have, for a long time now, implemented finite state machines (FSMs) or finite state transducers (FSTs). See in particular U.S. Pat. Nos. 5,610,812 and 5,625,554.
  • The relevance of the results of these algorithms is however highly dependent on the semantic proximity of the texts which are processed. If semantic proximity is no longer assured, as in the case of a change of domain, the algorithms must be completely reprogrammed, which is a long and costly process. [0005]
  • U.S. Pat. Nos. 5,796,926 and 5,841,895 disclose the use of certain learning processes for programming in a semi-automatic manner the finite state machine algorithms. The processes of this prior art are limited to the learning of the syntactic relations in the context of a sentence, which involves the need to resort again in a very important way to manual programming. [0006]
  • The present invention solves this problem by enabling the learning of other types of relations and by extending the field of the learning to the whole of a collection of texts of a domain. [0007]
  • To these ends, the invention proposes a device for extracting information from a text including an extraction module and a learning module cooperating with each other and comprising means for automatically selecting in the text the contexts of instance of classes/entities of information to be extracted, for automatically selecting from these contexts those which are relevant for a domain and for enabling the user to modify this latter selection such that the learning module will improve the next output of the extraction module, characterized in that the extraction module additionally includes means for identifying relations existing in the text between the relevant entities at the output of the means. [0008]
  • The invention also proposes a method for extracting information from a text including a learning process and a selection process, the selection process including a step of automatic selection in the text of contexts of instance classes/entities of the information to be extracted, a step of automatic selection from these contexts of those which are relevant for a domain and a step of modification by the user of outputs of the previous step, the modified outputs being taken into account in a learning process to improve the next result of the selection process, characterized in that the selection process additionally includes steps to identify the relations existing in the text between the relevant entities at the output of the steps of the selection process.[0009]
  • The invention will be better understood and its various features and advantages will become apparent from the description that follows of an example embodiment and from its accompanying figures, of which: [0010]
  • FIG. 1 discloses a hardware embodiment of the device; [0011]
  • FIG. 2 shows the architecture of the device according to the invention; [0012]
  • FIG. 3 shows the flowchart for conflict resolution according to the context; [0013]
  • FIG. 4 shows the sequencing of the steps of the method according to the invention; [0014]
  • FIG. 5 shows the flowchart of the relations between the entities; [0015]
  • FIG. 6 shows an example morphosyntactic analysis; [0016]
  • FIG. 7 illustrates an example of transduction; [0017]
  • FIG. 8 illustrates the sequencing of selection steps on an example; [0018]
  • FIG. 9 illustrates the sequencing of learning steps on another example.[0019]
  • The accompanying drawings include a number of elements, in particular textual, of certain character. As a consequence, the drawings will be able to not only illustrate the description but also contribute if necessary to the definition of the invention. [0020]
  • To be more comprehensible, the detailed description deals with the file elements in natural language. For example, REUTERS will be used as the agency name (SOURCE). However, in computer science terms REUTERS is a character string represented by corresponding bytes. The same is true for the other information-processing-related objects: in particular dates, numerical values. Tagging is also an established operation which, purely by way of nonlimiting example, is illustrated by the language XML. [0021]
  • As shown in FIG. 1, the device may include a central processing unit and its associated memory (CPU/RAM) with a keyboard and monitor. The central processing unit will be advantageously connected to a local area network, itself possibly connected to a public or private wide area network (DISPLAY), if necessary by secured links. The collections of texts to be processed will be available in several types of alphanumeric format (processing and text, HTML or XML) on storage means (ST_[0022] 1, ST_2) which will for example be redundant disks connected to the local area network.
  • These storage means will also include texts that have undergone processing according to the invention (TAG_TEXT) and various corpora of texts by domain (DOM_TEXT) with the appropriate indexes. Also stored on these disks will be the database(s) (FACT_DB) fed by the information extraction. The database will advantageously be of the relational type or object type. The data structure will be defined in a manner known to those skilled in the art according to the application specification or generated by the application (see for example the FACT_DB window in FIG. 4). [0023]
  • The texts to be processed (TEXT) can be imported to the storage means (ST_[0024] 1, ST_2) by diskette or any other removable storage means or they can come from the wide area network, directly in a format compatible with the PREPROC_MOD sub-module (FIG. 2).
  • They can also be captured on one of the networks connected to the device according to the invention by capture devices. [0025]
  • This could include alphanumeric messages from for example a messaging system “text capture”, from scanned documents or faxes “fax capture” or from voice messages “voice capture”. The computer peripheral equipment enabling this capture and the software used to convert them to text format (image recognition and speech recognition) are commercially available. In the case of intelligence applications, it may be useful to carry out an interception and a real-time processing of documents exchanged over wired or wireless communication networks. In this case, the specific listening devices will be integrated in the system upstream of the capture peripheral equipment. [0026]
  • The device according to the invention, such as the one shown in block-diagram form in FIG. 2, includes an extraction module ([0027] 20) or “EXT_MOD” to which the text to be processed (“TEXT”, 10) is presented.
  • Said extraction module ([0028] 20) includes a first preprocessing program (“PREPROC_MOD”, 211) which recognizes the structure of the document in order to extract information from it. Structured documents enable simple extraction, without linguistic analysis, since they have headers or characteristic structures (electronic mail headers, agency dispatch block). Thus, in the example of FIG. 4, the agency dispatch block in the STR_TEXT window includes:
  • the agency name (SOURCE=“REUTERS”), [0029]
  • the date of dispatch (DATE_SOURCE=27-04-1987), [0030]
  • the rubric title (SECTION=“Financial news”). [0031]
  • To recognize specific entities, it is sufficient to recognize the document type (agency dispatch) from the presence of a characteristic block. The three entities are then taken from their position determined in the block. [0032]
  • The extraction module ([0033] 20) also includes a second program to extract the entities (“ENT_EXT”, 212), that is to say to recognize the names of persons, of company locations and the expressions specified in the domain considered.
  • The block of the TAG_TEXT window of FIG. 4 shows the entities/expressions with the class that has been attributed to them by tags: [0034]
    “Bridgestone Sports” −> COMPANY
    “vendredi” −> DATE
    “Taiwan” −> LOCATION
    “une entreprise locale” −> COMPANY
    “clubs de golf” −> PRODUCT
    “Japon” −> LOCATION
    “Bridgestone Sports Taiwan −> COMPANY
    “20 millions de nouveaux dollars
    taiwanais” −> CAPITAL
    “janvier 1990” −> DATE
    “clubs en acier et en bois-metal” −> PRODUCT
  • The recognition of entities/expressions will call upon the dictionary (KB[0035] 3, 413) which itself is fed by general knowledge (KB1, 411) and learned knowledge (KB2, 412).
  • For example “Taïwan” and “Japon” are location names (LOCATION) appearing in the dictionary KB[0036] 1.
  • The recognition will also use a grammar (KB[0037] 4, 414), which itself is fed by general knowledge (KB1, 411) and learned knowledge (KB2, 412). For example, “Bridgestone Sports” and “Bridgestone Sports Taïwan” are recognized as instances of the entity COMPANY since they appear in the structure of two sentences as qualifiers of the word “compagnie” (meaning “company”). Likewise, “clubs de golf” and “clubs en acier” et en “bois-metal”” are recognized as instances of the entity “PRODUCT” since they are respectively direct objects of the verb “produire” (“to produce”) and adjuncts of the verb “débuter” having the subject “production”.
  • Dictionary and grammar must be able to be combined to remove ambiguities. For example, the three words “Bridgestone Sports Taïwan” are recognized as belonging to the same instance of COMPANY although “Bridgestone Sports” has already been recognized as instance of COMPANY and “Taïwan” an instance of LOCATION and therefore both belonging to the dictionary (KB[0038] 2, 413). This is because there is no punctuation or preposition separating the two groups in the sentence. Hence it follows that a new word is being dealt with made up of two previous groups.
  • Several types of algorithms will be used at this stage. These algorithms are implemented in the selection step ([0039] 1000) represented in FIG. 3, more particularly at steps (1100) (“Selection of all instances and contexts of entities in TEXT”) and (1110) (“1st selection of relevant instances”). These steps implemented by the computer automatically, that is without user intervention, are followed by a semi-automatic step (1120) (“2nd selection of relevant instances—Addition/Subtraction of relevant/non-relevant instances”) at which the user intervenes by a step (1130) by selecting the instances/contexts of the entity which appear relevant to him. This step is displayed in the window (3300) of FIG. 5. By way of example, mention is made of:
  • the reuse of partial rules; the method described uses the elements already found and the grammar rules for recognizing proper names in order to extend the coverage of the initial system. Therefore this amounts to a case of explanation-based learning. The mechanism is based on grammar rules with the involvement of unknown words. For example, the grammar can recognize Mr Kassianov as being a name of a person even if Kassianov is an unknown word. The isolated instances of the word can henceforth be labeled as person name. The learning is in this case used as an inductive mechanism using knowledge from the system (the grammar rules) and the entities found beforehand (the set of positive examples) to improve performance; [0040]
  • the use of discourse structures; discourse structures are another source for acquiring knowledge, like enumerations, easily identifiable for example by the presence of a certain number of person names, separated by connectors (commas, subordination conjunction “and” or “or” etc.). For example, in the following sequence: <PERSON_NAME> Kassianov </PERSON_NAME>, <UNKNOWN> Kostine </UNKNOWN> and <PERSON_NAME> Primakov (/PERSON_NAME), Kostine is labeled as an unknown word. The system infers from the context (the word Kostine appears in an enumeration of person names) that the word Kostine refers to a person name, even though in this case it is an isolated person name which cannot be typed from the dictionary or from other instances in the text. [0041]
  • the management of conflicts between labeling strategies; these learning strategies lead to type conflicts, particularly when the dynamic typing has led to the assignment of a label to a word, which label contradicts the label contained in the dictionary or identified by another dynamic strategy. This is the case, for example, when a word recorded as a location name in the dictionary appears as a person name in an unambiguous instance of the text. Let us consider the following sequence: [0042]
  • @ Washington, an Exchange allyn Seems [0043]
  • @ To Be Strong Candidate to Head SEC [0044]
  • @ . . . [0045]
  • <SO> WALL STREET JOURNAL (J), PAGE A2 </SO>[0046]
  • <DATELINE> WASHINGTON </DATELINE>[0047]
  • <TXT>[0048]
  • <p>[0049]
  • Consuela Washington, a longtime House staffer and an expert in securities laws, is a leading candidate to be chairwoman of the Securities and Exchange Commission in the Clinton administration. [0050]
  • </p>[0051]
  • It is clear that in this text Consuela Washington represents a person. The first instance of the word Washington is more of a problem in that the only information allowing a choice to be made in the sentence is world knowledge, viz. it is generally a person who runs an organization. [0052]
  • To define the scope of this type of problem and avoid the propagation of errors, the dynamic typing process is limited, in the event of conflict (that is to say, if a word has received a label which is in conflict with a label recorded beforehand for this word in the dictionary; this is the case for the word Washington in the above example), to the text being analyzed and not to the corpus as a whole. For example, the system will label all isolated instances of Washington as person name in the above text, but in the next text, if an isolated instance of the word Washington appears, the system will label it as location name, according to the dictionary. When more than one label has been found dynamically in the same text, an arbitrary choice is then made. [0053]
  • FIG. 3 illustrates the flowchart for conflict resolution in the typing of entities. [0054]
  • An example pseudocode implementing this function is given in [0055] Appendix 1.
  • The extraction module ([0056] 20) includes a third program (INT_EXT, 213) to identify the relations between the entities for which the relevant instances have been selected by the program (212). The FACT_DB window in FIG. 5 shows the relations which have been established between the entities of the TAG_TEXT window.
  • This module includes three main sub-modules, the flowchart of which is represented in FIG. 5. [0057]
  • In the selection step ([0058] 1000) of the method as represented in FIG. 8, the identification of the relations between the entities are processed during steps (1310), (1320), (1330) and (1400). Step (1310) (1st identification of relevant relations between entities) is automatic. Step (1320) (2nd identification of relevant relations between entities—Addition/Subtraction of relevant/non-relevant relations) is semi-automatic and assumes a step (1330) of interaction with the user. Step (1400) is for feeding the database (FACT_DB, 80) with the selected entities and the identified relations. The entity and relation field names are managed automatically and the fields of the database are then filled with their instances. The database (80) can in fact be operated by users who are not information processing specialists but who require structured information.
  • The device according to the invention also includes a learning module (LEARN_MOD, [0059] 30) which cooperates with the extraction module (20). This module receives at the input, in an asynchronous manner with the operation of the module (20), a collection of texts belonging to a given domain (DOM_TEXT, 50). This mode of asynchronous operation allows the knowledge base KB2, (412) to be built containing the domain-specific dictionary and the knowledge base KB3 (413) and the grammar rules specific to the same domain. It also enables relations that are characteristic of the domain, and which are stored in a database KB5 (415), to be formulated.
  • The module ([0060] 30) cooperates with the module (20) to enrich the knowledge bases (KB2, KB3, KB5) as illustrated generically in FIG. 8 and on a specific example in FIG. 9.
  • This module includes three main sub-modules for which the sequencing flowchart is represented in FIG. 5: morphosyntactic analysis sub-module, sub-module for the linguistic analysis of elements in the form and form-filling sub-module. These sub-modules are sequenced together as a cascade: the analysis supplied at one given level is retrieved and extended to the next level. [0061]
  • Morphosyntactic Analysis Sub-Module
  • The morphosyntactic analysis is made up of a tokenizer, a sentence splitter, an analyzer and a morphological labeler. In the example of FIG. 6, the annotations are presented in transducer form. [0062]
  • These modules are not specific to the extraction. They can be used in any other application requiring a conventional morphosyntactic analysis. [0063]
  • Sub-Module for Local Linguistic Analysis for Identifying Information
  • The identification of elements of the form by linguistic analysis can be broken down into two steps: the first, generic, step is for analyzing named entities, and the second step, specific to a given corpus, is for typing the entities recognized previously and identifying other elements needed to fill the form. [0064]
  • Named entities are linked by means of more specific extraction schemes which are written by means of a set of transducers for assigning a label to a sequence of lexical items. These rules exploit the morphosyntactic analysis which took place beforehand. An example transducer is given in FIG. 7. [0065]
  • From a sentence such as: [0066]
  • “La compagnie Bridgestone Sports a déclaré vendredi qu'elle avait cr{acute over (ee)} une filiale commune à Taïwan avec une entreprise locale et une maison de commerce japonaise pour produire des clubs de golf à destination du Japon.”[0067]
  • This rule is used to infer the following relation: [0068]
  • Association(Bridgestone Sports, une entreprise locale). [0069]
  • The analysis, which at the start is generic, focuses gradually on certain characteristic elements of the text and transforms it into logical form. [0070]
  • Extraction-Form-Filling Sub-Module
  • The last step involves simply retrieving within the document the relevant information in order to insert it into an extraction form. The partial results are merged into one single form per document. [0071]
  • An example pseudocode implementing these functions is given in [0072] Appendix 2.
  • The algorithms for selecting relevant entities are enhanced during step ([0073] 1120) by interaction by the user (1130) who selects the relevant contexts and the non-relevant contexts of the instances of the entities. The new parameters of the algorithms are generated during step (2100) then stored during step (2200).
  • The algorithms for identifying relevant relations are enhanced during step ([0074] 1320) by interaction by the user (1330) who identifies the relevant relations and the non-relevant relations. The new parameters of the algorithms are generated during step (2300) then stored during step (2400).
  • The mechanisms of steps ([0075] 1120) and (1130) are illustrated by an example in FIG. 5.
  • 1. Window ([0076] 3100): the user supplies a semantic class to the system. For example, using verbs from speech: “affirmer” (to affirm), “déclarer” (to declare), “dire” (to say), etc.
  • 2. Window ([0077] 3200): this semantic class is projected onto the corpus (DOM_TEXT, 50) in order to gather all the contexts in which a given expression appears. Taking the example of speech verbs, this step ends with the formation of a list of all the contexts in which the verbs “affirmer” (to affirm), “déclarer” (to declare), “dire” (to say), etc. appear.
  • 3. Window ([0078] 3300): from the proposed contexts, the user distinguishes those which are relevant and those which are not relevant (such as the third item of the list).
  • 4. Window ([0079] 3400): the system uses the list of examples marked positive and negative to generate, from a set of knowledge for the domain: (essentially linguistic rules), a state machine covering most of the contexts marked positively while excluding those marked negatively.
  • A transducer describes a linguistic expression and is generally read from left to right. Each box describes a linguistic item and is linked to the next element by a line. A linguistic item can be a character string (que, de), a lemma (<avoir> may equally well denote the form a as the form avait or aurons), a syntactic category (<V> denotes any verb), a syntactic category Accompanied by semantic lines (<N+ProperName> denotes, within nouns, only proper names). The grayed elements (à_obj) denote a call to a complex structure described in another transducer (recursivity). The elements that Are searched for are included between the tags <key> and </key> which are introduced for later processing. [0080]
  • 5. Window ([0081] 3500): the user outputs the result state machine and if necessary makes slight alterations. The learning corpus is first subject to a preprocessing which aims to eliminate non-essential complements. This step is performed by projecting onto the text (TEXT, 10) in delete mode (the transition of a state machine to delete mode is used to obtain a text in which the sequences recognized by the state machine have been deleted) the fixed adverb dictionaries and grammars designed to identify adjunct elements. The knowledge base state machines are then, in their turn, projected onto the database of examples. Two state machines (3510, 3520) emerge from the linguistic knowledge database. The states of the state machine (3511, 3521) call on sub-graphs using indications supplied by the functional labeling, for the recognition of indirect objects introduced by the preposition “à” (3511) and inverted subjects (3521).
  • This strategy enables coverage of new positive contexts illustrated in the window ([0082] 3600).
  • The state machine leads to the structure represented in the window ([0083] 3700). This master state machine is inferred from the examples database for the recognition of speech verbs. The inferred state machine is complex. It covers the examples database and will feed the extraction system.
    Figure US20040073874A1-20040415-P00001
    Figure US20040073874A1-20040415-P00002
    Figure US20040073874A1-20040415-P00003
    Figure US20040073874A1-20040415-P00004
    Figure US20040073874A1-20040415-P00005
    Figure US20040073874A1-20040415-P00006
    Figure US20040073874A1-20040415-P00007
    Figure US20040073874A1-20040415-P00008
    Figure US20040073874A1-20040415-P00009

Claims (18)

1. A device for extracting information from a text (10) comprising an extraction module (20) and a learning module (30) cooperating with each other comprising means (212) for automatically selecting in the text (10) the contexts of instance of classes/entities of information to be extracted, for automatically selecting from these contexts those which are relevant for a domain and for enabling the user to modify this latter selection in a manner such that the learning module (30) will improve the next output (70, 80) of the extraction module (20), characterized in that the extraction module (20) additionally comprises means (213) for identifying relations existing in the text (10) between the relevant entities at the output of the means (212).
2. The information extraction device as claimed in claim 1, characterized in that the selection module (20) comprises a program (211) able to recognize the structure of the text (10).
3. The information extraction device as claimed in claim 1 or claim 2, characterized in that the selection module (20) simultaneously applies rules defined a priori and rules calculated by the learning module
4. The information extraction device as claimed in one of the preceding claims, characterized in that the selection module (20) is able to automatically apply similarity rules inferred from the context.
5. The information extraction device as claimed in one of the preceding claims, characterized in that the learning module (30) and the selection module (20) are able to manage homonyms belonging to different classes/entities.
6. The information extraction device as claimed in one of the preceding claims, characterized in that the learning module (30) is capable of not generating new rules from non-essential elements.
7. The information extraction device as claimed in one of the preceding claims, characterized in that the learning module (30) is able to generate new rules from positive selections and from negative selections made by the user.
8. The information extraction device as claimed in one of the preceding claims, characterized in that the outputs of the selection module can be arranged in a file or a database.
9. The information extraction device as claimed in one of the preceding claims, characterized in that the vocabulary and grammar of the domain are represented by finite state machines.
10. The information extraction device as claimed in the preceding claim, characterized in that the finite state machines are represented in the form of graphs to the user.
11. A method for extracting information from a text (10) comprising a learning process (2000) and a selection process (1000), said selection process comprising a step (1100) of automatic selection in the text of contexts of instance of classes/entities of the information to be extracted, a step (1110) of automatic selection from these contexts of those which are relevant for a domain and a step (1130) of modification by the user of outputs of the previous step, the modified outputs being taken into account in the learning process (2000) to improve the next result of the selection process (1000), characterized in that the selection process (1000) additionally comprises steps (1310, 1320, 1330) to identify the relations existing in the text (10) between the relevant entities at the output of the steps (1120, 1130) of the selection process (1000).
12. The information extraction method as claimed in claim 11, characterized in that the selection process (1000) comprises a step for recognizing the structure of the text (10).
13. The information extraction method as claimed in claim 11 or claim 12, characterized in that the selection process (1000) simultaneously applies rules defined a priori and rules calculated by the learning module (30).
14. The information extraction method as claimed in one of claims 11 to 13, characterized in that the selection process (1000) can include the automatic application of similarity rules inferred from the context.
15. The information extraction method as claimed in one of claims 11 to 14, characterized in that the learning process (2000) and the selection process (1000) enable the management of homonyms belonging to different classes.
16. The information extraction method as claimed in one of claims 11 to 15, characterized in that the learning process (2000) is capable of not generating new rules from non-essential elements.
17. The information extraction method as claimed in one of claims 11 to 16, characterized in that the learning process (2000) is able to generate new rules from positive selections and from negative selections made by the user.
18. The information extraction method as claimed in one of claims 11 to 16, characterized in that the outputs of the selection process (1000) can be arranged in a file or a database (80).
US10/467,937 2001-02-20 2002-02-19 Device for retrieving data from a knowledge-based text Abandoned US20040073874A1 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
FR01/02270 2001-02-20
FR0102270A FR2821186B1 (en) 2001-02-20 2001-02-20 KNOWLEDGE-BASED TEXT INFORMATION EXTRACTION DEVICE
PCT/FR2002/000631 WO2002067142A2 (en) 2001-02-20 2002-02-19 Device for retrieving data from a knowledge-based text

Publications (1)

Publication Number Publication Date
US20040073874A1 true US20040073874A1 (en) 2004-04-15

Family

ID=8860217

Family Applications (1)

Application Number Title Priority Date Filing Date
US10/467,937 Abandoned US20040073874A1 (en) 2001-02-20 2002-02-19 Device for retrieving data from a knowledge-based text

Country Status (4)

Country Link
US (1) US20040073874A1 (en)
EP (1) EP1364316A2 (en)
FR (1) FR2821186B1 (en)
WO (1) WO2002067142A2 (en)

Cited By (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030233232A1 (en) * 2002-06-12 2003-12-18 Lucent Technologies Inc. System and method for measuring domain independence of semantic classes
US20040015775A1 (en) * 2002-07-19 2004-01-22 Simske Steven J. Systems and methods for improved accuracy of extracted digital content
US20040167907A1 (en) * 2002-12-06 2004-08-26 Attensity Corporation Visualization of integrated structured data and extracted relational facts from free text
US20050234851A1 (en) * 2004-02-15 2005-10-20 King Martin T Automatic modification of web pages
US20050289560A1 (en) * 2002-09-27 2005-12-29 Thales Method for making user-system interaction independent from the application of interaction media
US20060085366A1 (en) * 2004-10-20 2006-04-20 International Business Machines Corporation Method and system for creating hierarchical classifiers of software components
US20060104515A1 (en) * 2004-07-19 2006-05-18 King Martin T Automatic modification of WEB pages
US20070067320A1 (en) * 2005-09-20 2007-03-22 International Business Machines Corporation Detecting relationships in unstructured text
WO2007070237A2 (en) * 2005-12-12 2007-06-21 Qin Zhang A thinking system and method
US20070250504A1 (en) * 2006-04-21 2007-10-25 Yen-Fu Chen Office System Content Prediction Based On Regular Expression Pattern Analysis
US20070250765A1 (en) * 2006-04-21 2007-10-25 Yen-Fu Chen Office System Prediction Configuration Sharing
US20080243905A1 (en) * 2007-03-30 2008-10-02 Pavlov Dmitri Y Attribute extraction using limited training data
US20090182731A1 (en) * 2008-01-10 2009-07-16 Qin Zhang Search method and system using thinking system
US7812860B2 (en) 2004-04-01 2010-10-12 Exbiblio B.V. Handheld device for capturing text from both a document printed on paper and a document displayed on a dynamic display device
US7990556B2 (en) 2004-12-03 2011-08-02 Google Inc. Association of a portable scanner with input/output and storage devices
US8081849B2 (en) 2004-12-03 2011-12-20 Google Inc. Portable scanning and memory device
US8179563B2 (en) 2004-08-23 2012-05-15 Google Inc. Portable scanning device
US8261094B2 (en) 2004-04-19 2012-09-04 Google Inc. Secure data gathering from rendered documents
US8418055B2 (en) 2009-02-18 2013-04-09 Google Inc. Identifying a document by performing spectral analysis on the contents of the document
US8442331B2 (en) 2004-02-15 2013-05-14 Google Inc. Capturing text from rendered documents using supplemental information
US8447066B2 (en) 2009-03-12 2013-05-21 Google Inc. Performing actions based on capturing information from rendered documents, such as documents under copyright
US8489624B2 (en) 2004-05-17 2013-07-16 Google, Inc. Processing techniques for text capture from a rendered document
US8505090B2 (en) 2004-04-01 2013-08-06 Google Inc. Archive of text captures from rendered documents
US8600196B2 (en) 2006-09-08 2013-12-03 Google Inc. Optical scanners, such as hand-held optical scanners
US8620083B2 (en) 2004-12-03 2013-12-31 Google Inc. Method and system for character recognition
US8713418B2 (en) 2004-04-12 2014-04-29 Google Inc. Adding value to a rendered document
US8781228B2 (en) 2004-04-01 2014-07-15 Google Inc. Triggering actions in response to optically or acoustically capturing keywords from a rendered document
US8874504B2 (en) 2004-12-03 2014-10-28 Google Inc. Processing techniques for visual capture data from a rendered document
US8892495B2 (en) 1991-12-23 2014-11-18 Blanding Hovenweep, Llc Adaptive pattern recognition based controller apparatus and method and human-interface therefore
US8990235B2 (en) 2009-03-12 2015-03-24 Google Inc. Automatically providing content associated with captured information, such as information captured in real-time
US9008447B2 (en) 2004-04-01 2015-04-14 Google Inc. Method and system for character recognition
US9081799B2 (en) 2009-12-04 2015-07-14 Google Inc. Using gestalt information to identify locations in printed information
US9116890B2 (en) 2004-04-01 2015-08-25 Google Inc. Triggering actions in response to optically or acoustically capturing keywords from a rendered document
US9143638B2 (en) 2004-04-01 2015-09-22 Google Inc. Data capture from rendered documents using handheld device
US9268852B2 (en) 2004-02-15 2016-02-23 Google Inc. Search engines and systems with handheld document data capture devices
US9275051B2 (en) 2004-07-19 2016-03-01 Google Inc. Automatic modification of web pages
US9323784B2 (en) 2009-12-09 2016-04-26 Google Inc. Image search using text-based elements within the contents of images
US9535563B2 (en) 1999-02-01 2017-01-03 Blanding Hovenweep, Llc Internet appliance system and method
US11183307B2 (en) 2015-11-05 2021-11-23 Koninklijke Philips N.V. Crowd-sourced text annotation system for use by information extraction applications

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
ATE531019T1 (en) 2008-01-21 2011-11-15 Thales Nederland Bv SECURITY AND SECURITY SYSTEM AGAINST MULTIPLE THREATS AND DETERMINATION PROCEDURES THEREFOR

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5841895A (en) * 1996-10-25 1998-11-24 Pricewaterhousecoopers, Llp Method for learning local syntactic relationships for use in example-based information-extraction-pattern learning
US6076088A (en) * 1996-02-09 2000-06-13 Paik; Woojin Information extraction system and method using concept relation concept (CRC) triples
US6965857B1 (en) * 2000-06-02 2005-11-15 Cogilex Recherches & Developpement Inc. Method and apparatus for deriving information from written text

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1072986A3 (en) * 1999-07-30 2004-10-27 Academia Sinica System and method for extracting data from semi-structured text

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6076088A (en) * 1996-02-09 2000-06-13 Paik; Woojin Information extraction system and method using concept relation concept (CRC) triples
US5841895A (en) * 1996-10-25 1998-11-24 Pricewaterhousecoopers, Llp Method for learning local syntactic relationships for use in example-based information-extraction-pattern learning
US6965857B1 (en) * 2000-06-02 2005-11-15 Cogilex Recherches & Developpement Inc. Method and apparatus for deriving information from written text

Cited By (71)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8892495B2 (en) 1991-12-23 2014-11-18 Blanding Hovenweep, Llc Adaptive pattern recognition based controller apparatus and method and human-interface therefore
US9535563B2 (en) 1999-02-01 2017-01-03 Blanding Hovenweep, Llc Internet appliance system and method
US20030233232A1 (en) * 2002-06-12 2003-12-18 Lucent Technologies Inc. System and method for measuring domain independence of semantic classes
US20040015775A1 (en) * 2002-07-19 2004-01-22 Simske Steven J. Systems and methods for improved accuracy of extracted digital content
US20050289560A1 (en) * 2002-09-27 2005-12-29 Thales Method for making user-system interaction independent from the application of interaction media
US8020174B2 (en) 2002-09-27 2011-09-13 Thales Method for making user-system interaction independent from the application of interaction media
US20040167907A1 (en) * 2002-12-06 2004-08-26 Attensity Corporation Visualization of integrated structured data and extracted relational facts from free text
US20040215634A1 (en) * 2002-12-06 2004-10-28 Attensity Corporation Methods and products for merging codes and notes into an integrated relational database
US20060036585A1 (en) * 2004-02-15 2006-02-16 King Martin T Publishing techniques for adding value to a rendered document
US9268852B2 (en) 2004-02-15 2016-02-23 Google Inc. Search engines and systems with handheld document data capture devices
US20060294094A1 (en) * 2004-02-15 2006-12-28 King Martin T Processing techniques for text capture from a rendered document
US8214387B2 (en) 2004-02-15 2012-07-03 Google Inc. Document enhancement system and method
US8442331B2 (en) 2004-02-15 2013-05-14 Google Inc. Capturing text from rendered documents using supplemental information
US20060061806A1 (en) * 2004-02-15 2006-03-23 King Martin T Information gathering system and method
US7742953B2 (en) * 2004-02-15 2010-06-22 Exbiblio B.V. Adding information or functionality to a rendered document via association with an electronic counterpart
US8515816B2 (en) 2004-02-15 2013-08-20 Google Inc. Aggregate analysis of text captures performed by multiple users from rendered documents
US8019648B2 (en) 2004-02-15 2011-09-13 Google Inc. Search engines and systems with handheld document data capture devices
US8005720B2 (en) 2004-02-15 2011-08-23 Google Inc. Applying scanned information to identify content
US8831365B2 (en) 2004-02-15 2014-09-09 Google Inc. Capturing text from rendered documents using supplement information
US7831912B2 (en) 2004-02-15 2010-11-09 Exbiblio B. V. Publishing techniques for adding value to a rendered document
US7818215B2 (en) 2004-02-15 2010-10-19 Exbiblio, B.V. Processing techniques for text capture from a rendered document
US20050234851A1 (en) * 2004-02-15 2005-10-20 King Martin T Automatic modification of web pages
US7702624B2 (en) 2004-02-15 2010-04-20 Exbiblio, B.V. Processing techniques for visual capture data from a rendered document
US7707039B2 (en) 2004-02-15 2010-04-27 Exbiblio B.V. Automatic modification of web pages
US8781228B2 (en) 2004-04-01 2014-07-15 Google Inc. Triggering actions in response to optically or acoustically capturing keywords from a rendered document
US9143638B2 (en) 2004-04-01 2015-09-22 Google Inc. Data capture from rendered documents using handheld device
US8505090B2 (en) 2004-04-01 2013-08-06 Google Inc. Archive of text captures from rendered documents
US9008447B2 (en) 2004-04-01 2015-04-14 Google Inc. Method and system for character recognition
US9633013B2 (en) 2004-04-01 2017-04-25 Google Inc. Triggering actions in response to optically or acoustically capturing keywords from a rendered document
US9514134B2 (en) 2004-04-01 2016-12-06 Google Inc. Triggering actions in response to optically or acoustically capturing keywords from a rendered document
US7812860B2 (en) 2004-04-01 2010-10-12 Exbiblio B.V. Handheld device for capturing text from both a document printed on paper and a document displayed on a dynamic display device
US9116890B2 (en) 2004-04-01 2015-08-25 Google Inc. Triggering actions in response to optically or acoustically capturing keywords from a rendered document
US8713418B2 (en) 2004-04-12 2014-04-29 Google Inc. Adding value to a rendered document
US9030699B2 (en) 2004-04-19 2015-05-12 Google Inc. Association of a portable scanner with input/output and storage devices
US8261094B2 (en) 2004-04-19 2012-09-04 Google Inc. Secure data gathering from rendered documents
US8489624B2 (en) 2004-05-17 2013-07-16 Google, Inc. Processing techniques for text capture from a rendered document
US8799099B2 (en) 2004-05-17 2014-08-05 Google Inc. Processing techniques for text capture from a rendered document
US9275051B2 (en) 2004-07-19 2016-03-01 Google Inc. Automatic modification of web pages
US20060104515A1 (en) * 2004-07-19 2006-05-18 King Martin T Automatic modification of WEB pages
US8179563B2 (en) 2004-08-23 2012-05-15 Google Inc. Portable scanning device
US20060085366A1 (en) * 2004-10-20 2006-04-20 International Business Machines Corporation Method and system for creating hierarchical classifiers of software components
US7657495B2 (en) * 2004-10-20 2010-02-02 International Business Machines Corporation Method and system for creating hierarchical classifiers of software components to identify meaning for words with multiple meanings
US8081849B2 (en) 2004-12-03 2011-12-20 Google Inc. Portable scanning and memory device
US8953886B2 (en) 2004-12-03 2015-02-10 Google Inc. Method and system for character recognition
US8620083B2 (en) 2004-12-03 2013-12-31 Google Inc. Method and system for character recognition
US7990556B2 (en) 2004-12-03 2011-08-02 Google Inc. Association of a portable scanner with input/output and storage devices
US8874504B2 (en) 2004-12-03 2014-10-28 Google Inc. Processing techniques for visual capture data from a rendered document
US20070067320A1 (en) * 2005-09-20 2007-03-22 International Business Machines Corporation Detecting relationships in unstructured text
US20080177740A1 (en) * 2005-09-20 2008-07-24 International Business Machines Corporation Detecting relationships in unstructured text
US8001144B2 (en) 2005-09-20 2011-08-16 International Business Machines Corporation Detecting relationships in unstructured text
WO2007070237A2 (en) * 2005-12-12 2007-06-21 Qin Zhang A thinking system and method
US20070156623A1 (en) * 2005-12-12 2007-07-05 Qin Zhang Thinking system and method
US8019714B2 (en) * 2005-12-12 2011-09-13 Qin Zhang Thinking system and method
WO2007070237A3 (en) * 2005-12-12 2009-05-07 Qin Zhang A thinking system and method
US20070250504A1 (en) * 2006-04-21 2007-10-25 Yen-Fu Chen Office System Content Prediction Based On Regular Expression Pattern Analysis
US20070250765A1 (en) * 2006-04-21 2007-10-25 Yen-Fu Chen Office System Prediction Configuration Sharing
US10345922B2 (en) 2006-04-21 2019-07-09 International Business Machines Corporation Office system prediction configuration sharing
US8600916B2 (en) * 2006-04-21 2013-12-03 International Business Machines Corporation Office system content prediction based on regular expression pattern analysis
US8600196B2 (en) 2006-09-08 2013-12-03 Google Inc. Optical scanners, such as hand-held optical scanners
US20080243905A1 (en) * 2007-03-30 2008-10-02 Pavlov Dmitri Y Attribute extraction using limited training data
US7689527B2 (en) * 2007-03-30 2010-03-30 Yahoo! Inc. Attribute extraction using limited training data
US7930319B2 (en) * 2008-01-10 2011-04-19 Qin Zhang Search method and system using thinking system
US20090182731A1 (en) * 2008-01-10 2009-07-16 Qin Zhang Search method and system using thinking system
US8638363B2 (en) 2009-02-18 2014-01-28 Google Inc. Automatically capturing information, such as capturing information using a document-aware device
US8418055B2 (en) 2009-02-18 2013-04-09 Google Inc. Identifying a document by performing spectral analysis on the contents of the document
US9075779B2 (en) 2009-03-12 2015-07-07 Google Inc. Performing actions based on capturing information from rendered documents, such as documents under copyright
US8447066B2 (en) 2009-03-12 2013-05-21 Google Inc. Performing actions based on capturing information from rendered documents, such as documents under copyright
US8990235B2 (en) 2009-03-12 2015-03-24 Google Inc. Automatically providing content associated with captured information, such as information captured in real-time
US9081799B2 (en) 2009-12-04 2015-07-14 Google Inc. Using gestalt information to identify locations in printed information
US9323784B2 (en) 2009-12-09 2016-04-26 Google Inc. Image search using text-based elements within the contents of images
US11183307B2 (en) 2015-11-05 2021-11-23 Koninklijke Philips N.V. Crowd-sourced text annotation system for use by information extraction applications

Also Published As

Publication number Publication date
EP1364316A2 (en) 2003-11-26
FR2821186A1 (en) 2002-08-23
FR2821186B1 (en) 2003-06-20
WO2002067142A2 (en) 2002-08-29
WO2002067142A3 (en) 2003-02-13

Similar Documents

Publication Publication Date Title
US20040073874A1 (en) Device for retrieving data from a knowledge-based text
JP4467184B2 (en) Semantic analysis and selection of documents with knowledge creation potential
US8060357B2 (en) Linguistic user interface
US8832064B2 (en) Answer determination for natural language questioning
US6904429B2 (en) Information retrieval apparatus and information retrieval method
US20150081277A1 (en) System and Method for Automatically Classifying Text using Discourse Analysis
US20040117352A1 (en) System for answering natural language questions
JP2003085190A (en) Method and system for segmenting and discriminating event in image using voice comment
JP2012520527A (en) Question answering system and method based on semantic labeling of user questions and text documents
KR20120001053A (en) System and method for anaylyzing document sentiment
US20160110415A1 (en) Using question answering (qa) systems to identify answers and evidence of different medium types
JPWO2008023470A1 (en) SENTENCE UNIT SEARCH METHOD, SENTENCE UNIT SEARCH DEVICE, COMPUTER PROGRAM, RECORDING MEDIUM, AND DOCUMENT STORAGE DEVICE
JP2001075966A (en) Data analysis system
CN110609983A (en) Structured decomposition method for policy file
CN112380866A (en) Text topic label generation method, terminal device and storage medium
JP5291351B2 (en) Evaluation expression extraction method, evaluation expression extraction device, and evaluation expression extraction program
Begum et al. Analysis of legal case document automated summarizer
CN116090450A (en) Text processing method and computing device
JP4428703B2 (en) Information retrieval method and system, and computer program
JP7216627B2 (en) INPUT SUPPORT METHOD, INPUT SUPPORT SYSTEM, AND PROGRAM
JPH11259524A (en) Information retrieval system, information processing method in information retrieval system and record medium
JP2001325104A (en) Method and device for inferring language case and recording medium recording language case inference program
CN117633346A (en) Label labeling method, information display method and terminal
Loh et al. Opinion extraction from customer reviews
WO2020054465A1 (en) Problem solution assistance device and method therefor

Legal Events

Date Code Title Description
AS Assignment

Owner name: THALES, FRANCE

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:POIBEAU, THIERRY;SEDOGBO, CELESTIN;REEL/FRAME:014812/0405

Effective date: 20030725

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION