WO2016194054A1

WO2016194054A1 - Information extraction system, information extraction method, and recording medium

Info

Publication number: WO2016194054A1
Application number: PCT/JP2015/065594
Authority: WO
Inventors: 太亮尾崎; 真岩山; 彬童; 義行小林; 高橋　寿一; 新庄　広
Original assignee: 株式会社日立製作所
Priority date: 2015-05-29
Filing date: 2015-05-29
Publication date: 2016-12-08
Also published as: JP6334062B2; JPWO2016194054A1

Abstract

This information extraction system: extracts, from a target document, target expressions, each of which is a character string identical to a character string included in target information indicating a set of character strings to be extracted, and also extracts, from the target document, neighboring words, which are words disposed within a predetermined distance from the target expressions; generates a filter by means of unsupervised learning based on the frequency of occurrence of each neighboring word or the coordinates of each target expression within the target document; applies the filter to a set of words to be filtered including the neighboring words, thereby obtaining a set of words to be extracted; and outputs the obtained set of words to be extracted.

Description

Information extraction system, information extraction method, and recording medium

The present invention relates to an information extraction system, an information extraction method, and a recording medium.

There is an analysis system that extracts information described in the target document in a machine-processable form and performs analysis on various target documents. For example, if the so-called unique name such as the manufacturer name, product name, and series name can be extracted from the shopping website that is the target document, the analysis system performs analysis of product information statistics for each manufacturer. Can be implemented.

Thus, a technique for extracting necessary information from a non-standard document or document image is known. As background art of this technical field, there is JP 2013-232127 A (Patent Document 1). Patent Document 1 states that “the excerpt unit 101 obtains an excerpt document by extracting, from the original document, characters that should be displayed relatively large on the screen on which the original document is displayed. When the amount to be displayed on the screen does not fit within the predetermined amount, the excerpt unit 101 corrects the relative size criterion for excerpting characters ”(see summary).

JP 2013-232127 A

The analysis system extracts information from a non-standard document using, for example, a dictionary or a plurality of templates prepared in advance. However, in an atypical document, appropriate templates for all documents cannot always be prepared in advance. Also, it is not always easy to obtain a dictionary of words to be extracted.

Patent Document 1 discloses an information extraction method based on the display size of a sentence on a website, but information necessary for the user is described in an appropriate display size in the target document. There is a problem that is not limited.

According to one aspect of the present invention, information required by a user can be accurately obtained from a variety of non-standard documents such as websites and document images without depending on a dictionary prepared in advance and a logical structure such as HTML. The purpose is to extract.

In order to solve the above problems, one embodiment of the present invention employs the following configuration. An information extraction system for extracting information from a target document, comprising: a processor that executes a program; and a memory that is accessed by the processor; the processor performs an information extraction process; An input of target information indicating a set of character strings, and a target expression that is a character string that matches any of the character strings included in the target information, and words arranged within a predetermined distance of each of the target expressions A certain neighboring word is extracted from the target document, and a filter is generated using unsupervised learning based on the appearance frequency of each of the neighboring words in the target document or the coordinates of the target expression in the target document. Applying the filter to a filter application target word set including the neighboring words, and applying the filter to the filter application target word set. The resulting extract outputs the target word set, the information extraction system.

One embodiment of the present invention can extract information required by a user with high accuracy from various non-standard documents without depending on a dictionary prepared in advance and a logical structure such as HTML.

Issues, configurations, and effects other than those described above will be clarified by the following description of the embodiments.

1 is a block diagram illustrating an example of the overall configuration of an information extraction system in Embodiment 1. FIG. It is a figure which shows the example of the shopping website which is an example of the object document in Example 1. FIG. 6 is a diagram illustrating an example of a document image which is an example of a target document in Embodiment 1. FIG. It is a figure which shows the example of the data storage method of the storage part in Example 1. FIG. 6 is a flowchart illustrating a first example of target selection processing according to the first embodiment. 10 is a flowchart illustrating a second example of target selection processing according to the first embodiment. It is a figure which shows the example of the object selection result in Example 1. FIG. 3 is a block diagram illustrating a configuration example of a filter unit in Embodiment 1. FIG. 6 is a flowchart illustrating a first example of filter learning processing according to the first embodiment. 6 is a flowchart illustrating a first example of filter application processing according to the first embodiment. It is a figure which shows the example of the filter application result in Example 1. FIG. 6 is a flowchart illustrating a second example of filter learning processing according to the first exemplary embodiment. 6 is a flowchart illustrating a second example of filter application processing according to the first embodiment. 12 is a flowchart illustrating a third example of filter learning processing according to the first embodiment. 10 is a flowchart illustrating a third example of filter application processing according to the first embodiment. 6 is a diagram illustrating a first example of a user interface in Embodiment 1. FIG. 6 is a diagram illustrating a second example of a user interface in Embodiment 1. FIG. It is a block diagram which shows the example of whole structure of the information extraction system in Example 2. FIG.

Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the present embodiment, an information extraction system that extracts information from a target document will be described. When the information extraction system receives input of target information indicating a set of character strings to be extracted from the user, the target expression that is a character string that matches any of the character strings included in the target information, and each of the target expressions Neighboring words that are close to the physical distance are extracted from the target document. The information extraction system obtains not only the target expression that is the extraction target directly specified by the user, but also the information that may be necessary for the user related to the target expression by acquiring neighborhood words. It is possible to obtain a wide range without using the above.

The information extraction system generates a filter using unsupervised learning based on the appearance frequency of each nearby word in the target document or the coordinates of each of the target expressions in the target document. The information extraction system can delete unnecessary neighboring words without using a dictionary or the like by applying the generated filter to the filter application target word set including neighboring words, that is, the user needs Can be obtained with high accuracy.

FIG. 1 shows a configuration example of an information extraction system. The information extraction system 101 includes, for example, a computer having a processor (CPU) 111, a memory 112, an auxiliary storage device 113, and a communication interface 114.

The processor 111 executes a program stored in the memory 112. The memory 112 includes a ROM that is a nonvolatile storage element and a RAM that is a volatile storage element. The ROM stores an immutable program (for example, BIOS). The RAM is a high-speed and volatile storage element such as a DRAM (Dynamic Random Access Memory), and temporarily stores a program executed by the processor 111 and data used when the program is executed.

The auxiliary storage device 113 is a large-capacity non-volatile storage device such as a magnetic storage device (HDD) or a flash memory (SSD), for example, and stores a program executed by the processor 111 and data used when the program is executed. To do. That is, the program is read from the memory 112 or the auxiliary storage device 113, loaded into the memory 112, and executed by the processor 111.

The information extraction system 101 may include an input interface 115 and an output interface 118. The input interface 115 is an interface to which a keyboard 116, a mouse 117, and the like are connected and receives input from the user. The output interface 118 is an interface to which a display device 119, a printer, or the like is connected, and the execution result of the program is output in a format that can be viewed by the user.

The communication interface 114 is a network interface device that controls communication with other devices according to a predetermined protocol. The communication interface 114 includes a serial interface such as USB, for example.

A program executed by the processor 111 is provided to the information extraction system 101 via a removable medium (a computer-readable non-transitory storage medium such as a CD-ROM or a flash memory) or a network, and is stored non-temporarily. It may be stored in a nonvolatile auxiliary storage device 113 that is a medium. For this reason, the information extraction system 101 may have an interface for reading data from a removable medium.

The information extraction system 101 is a computer system that is configured on a single computer or a plurality of computers that are logically or physically configured, and operates on separate threads on the same computer. It may be possible to operate on a virtual machine constructed on a plurality of physical computer resources.

The information extraction system 101 receives input of the target document 102 and the target information 109 via the input interface 115 or the communication interface 114, for example. The target document 102 may be, for example, a document image or a website described in HTML, CSS, or the like. The document image indicates an image obtained by digitizing a document printed on a medium such as paper.

The target information 109 indicates information of a character string set serving as a base point for information extraction, and is designated by the user. The target information 109 is information including at least one of a regular expression, a word, a sentence, a sentence including a wild card, a part of speech, a target document ID, and a target sentence ID, for example. "\?, ????-\ 3, *-\ [1-4], 000-" is an example of a wild card, and "\\\ d [,]. \ D {2,4}-" It is an example of a regular expression. The information extraction system 101 extracts information specified by the target information 109 and information based on the target information 109 from the target document 102.

The memory 112 includes, for example, a sentence extraction unit 103 which is a program, a coordinate extraction unit 104, a target selection unit 106, and a result generation unit 108. The memory 112 includes an accumulation unit 105 that is an area for storing data. Further, the memory 112 includes a filter unit 107 including an area for storing data and a program.

The processor 111 operates as a functional unit that realizes a predetermined function by operating according to a program. For example, the processor 111 functions as a sentence extracting unit by operating according to the sentence extracting unit 103, and functions as a coordinate extracting unit by operating according to the coordinate extracting unit 104. Furthermore, the processor 111 also operates as a functional unit that realizes each of a plurality of processes executed by each program. A computer and a computer system are an apparatus and a system including these functional units.

The sentence extraction unit 103 extracts a sentence from each of the input target documents 102. The sentence in the present embodiment indicates each character string composed of one or more characters obtained by dividing a character string composed of all characters included in the target document 102 according to a predetermined rule, and does not necessarily match a grammatical sentence. It is a concept. A character string sandwiched between predetermined characters or symbols such as a punctuation mark, a punctuation mark, a comma, a period, or a space is an example of a sentence. The grammatical sentence included in the target document 102 is an example of the sentence of this embodiment. Each word included in the target document 102 is an example of a sentence. The sentence extraction unit 103 assigns a document ID to each input target document 102 and a sentence ID to each extracted sentence.

The coordinate extracting unit 104 extracts coordinate information of each sentence extracted by the sentence extracting unit 103. The coordinate information is represented by, for example, coordinates on the paper surface of the target document 102 or a display device. The coordinates of the two vertices forming the diagonal of the minimum-size rectangle surrounding the entire sentence are an example of the coordinate information of the sentence. One of the sentence extraction unit 103 or the coordinate extraction unit 104 assigns a document ID to the input target document. The sentence extraction unit 103 and the coordinate extraction unit 104 include, for example, a web browser rendering function and an OCR function.

The storage unit 105 holds, for example, information indicating the correspondence between the document ID of the target document 102, the extracted sentence, and the sentence ID and coordinate information of the extracted sentence. The target selecting unit 106 refers to the information held by the storage unit 105, selects the sentence that matches the target information 109, the coordinates of the matching sentence, and the neighboring words of the matching sentence, and selects the selected sentence, coordinates, and The neighborhood word is transmitted to the filter unit 107. Neighboring words will be described later. A sentence that matches the target information 109 selected by the target selection unit 106 is referred to as a target expression.

For example, based on the sentence coordinates selected by the target selection unit 106 and neighboring words, the filter unit 107 removes the sentence, neighboring words, and coordinates that are not extracted from the selected sentence, coordinates, and neighboring words, and removes them. The subsequent sentence, coordinates, and neighborhood word are transmitted to the result generation unit 108.

The result generation unit 108 outputs the sentence, coordinates, and neighborhood words received from the filter unit 107 in an appropriate format as the information extraction result 110 via the output interface 118. The result generation unit 108 may store the information extraction result 110 in the storage unit 105 as sentence data to be described later with an appropriate document ID.

The information extraction system 101 can appropriately output the information extraction result 110 based on the target information 109 input from the user with the above-described configuration. Further, the information extraction system 101 can extract information again from the information extraction result 110 based on the newly set target information 109.

FIG. 2A shows an example of a shopping website, which is an example of the target document 102. The shopping website in FIG. 2A lists a plurality of products of the same type, and describes different product information (manufacturer, unique name, price, etc.) for each product. When the website is the target document 102 as shown in FIG. 2A, the sentence extraction unit 103 and the coordinate extraction unit 104 extract the coordinates of the sentence and the sentence by using, for example, a rendering function of the web browser.

FIG. 2B shows an example of a document image, which is an example of the target document 102. In the document image of FIG. 2B, the stone name, depth, and details are displayed in various layouts. When the document image is the target document 102 as shown in FIG. 2B, the sentence extraction unit 103 and the coordinate extraction unit 104 extract the coordinates of the sentence and the sentence using, for example, the OCR function.

FIG. 3 shows an example of a data management method in the storage unit 105. The sentence data 300 is data accumulated by a method called “Key Value Store (KVS)”. The sentence data 300 includes a document ID 301, a sentence ID 302, and sentence information 303. The document ID 301 is information that uniquely identifies the target document 102. The sentence ID 302 is information for uniquely identifying a sentence in each target document. The sentence information 303 includes a sentence with a corresponding sentence ID and annotation information of the sentence. The coordinate information of the sentence and the font information included in the sentence are examples of annotation information.

By using the KVS method, the value for a desired key can be held in a plurality of layers in this way. For example, when a desired document ID or sentence ID is given, the information extraction system 101 can output a corresponding sentence. For example, when only the document ID is given, the information extraction system 101 can output a list of corresponding sentence IDs.

In this embodiment and other embodiments, the information used by the information extraction system 101 may be expressed in any data structure without depending on the data structure. For example, a data structure appropriately selected from a table, list, database or queue can store the information.

Hereinafter, an example of a selection method by the target selection unit 106 will be shown. FIG. 4 shows an example of a selection method using regular expressions by the object selection unit 106. The target selection unit 106 receives input of target information 109 including a document ID, a sentence ID, and a regular expression (S401). The target information 109 may not include the document ID and the sentence ID.

Subsequently, the target selecting unit 106 extracts the target sentence corresponding to the document ID and the sentence ID included in the target information 109 from the sentence data 300 of the storage unit 105, and converts the target sentence into a regular expression included in the target information 109 in each target sentence. It is checked whether or not a matching expression, that is, an object expression is included. (S402). Note that the target selection unit 106 may extract the target sentence again from data obtained by deleting words that are not included in the extraction target word generated by the result generation unit 108 from the sentence data 300 in accordance with a user instruction. Good. Thereby, the information extraction system 101 can further apply a filter to data that has been filtered once, and can improve the accuracy of information extraction. When the document information and the sentence ID are not included in the target information 109, the target selection unit 106 extracts all the sentences included in the sentence data 300 as the target sentence.

If the target expression is not included in all the target sentences (S402: no), the process ends. When the target sentence including the target expression exists, that is, when the target sentence including the target expression included in the target information 109 exists (S402: yes), the target selecting unit 106 selects the target expression, the coordinates of the target expression, and The neighborhood word of the target expression is acquired, and for example, the acquired information, the sentence ID and the document ID including the target expression are included in the target selection result data block, and output to the filter unit 107 (S403). The target selection result data block will be described later.

The minimum-size rectangular coordinates surrounding the target expression and the minimum-size rectangular coordinates surrounding the entire target sentence including the target expression are examples of coordinates output by the target selection unit 106 in step S403. Further, the neighborhood word of the target expression indicates a word that exists in a position close to the target expression in coordinates in the document. For example, the target selection unit 106 acquires a predetermined number of words within a predetermined distance from the target expression as neighborhood words of the target expression. The target selection unit 106 can acquire, for example, words that are necessary for the user and are unknown to the user by acquiring the neighborhood word.

FIG. 5 shows an example of a selection method using the part of speech by the object selection unit 106. The target selection unit 106 receives input of target information 109 including a document ID, a sentence ID, and a part of speech (S501). The target selection unit 106 extracts the document ID included in the target information 109 and the target sentence corresponding to the sentence ID from the storage unit 105, and whether or not a word that matches the part of speech included in the target information exists in each target sentence. (S502). Similarly to the description of FIG. 4, the target information 109 may not include the document ID and the sentence ID, and the target selection unit 106 extracts the target sentence from the sentence data 300 generated by the result generation unit 108. May be performed.

If the target expression is not included in all the target sentences (S502: no), the process ends. When there is a target sentence including the target expression, that is, when there is a target sentence including a matching word included in the target information 109 (S502: yes), the target expression, coordinates, and neighboring words of the target expression are acquired. Then, the acquired information and the sentence ID and document ID including the target expression are included in the target selection result data block and output to the filter unit 107 (S503). Note that the target selection unit 106 may perform word recognition and part-of-speech identification using, for example, a general morphological analysis method.

Note that, in the processes of step S403 and step S503, when a plurality of target expressions are included in one target sentence, the target selecting unit 106 may extract, for example, a target expression within a predetermined number from the top in the target sentence. Alternatively, all target expressions included in the target sentence may be extracted.

FIG. 4 shows an example of target selection using regular expressions, and FIG. 5 shows an example of target selection using parts of speech. The target selection unit 106 uses target information 109 including wild cards, words, and the like. Selection can be made in the same way. In addition, the target selection unit 106 may appropriately select a target by combining a plurality of types of target information 109 using, for example, logical sum or logical product. Specifically, for example, the target selection unit 106 may extract a target expression that matches a specific regular expression and / or includes a specific part of speech.

FIG. 6 shows an example of a target selection result data block generated by the target selection unit 106. The target selection result data block 600 includes, for example, a document ID 601, a target expression ID 602, and target expression information 603, and is data accumulated by, for example, the KVS method. The document ID 601 is information that uniquely identifies the target document 102. The target expression ID 602 is information for uniquely identifying the target expression in the target document 102, and is given by, for example, the target selection unit 106. The target expression information 603 is information related to the target expression, and includes, for example, the target expression, neighborhood words, and coordinates. By configuring the object selection result data block 600 as described above, the information extraction system 101 can easily acquire an actual expression, a neighborhood word, and coordinates for each selected object.

FIG. 7 shows a configuration example of the filter unit 107. The filter unit 107 includes, for example, a filter learning unit 702 and a filter application unit 704 that are programs, and a filter model storage unit 703 that is an area for storing data.

When target data 701 having target expressions, coordinates, and neighboring words is input to the filter unit 107, the filter learning unit 702 receives predetermined information included in the target data 701 and a filter model existing in the filter model storage unit 703. And learning a filter model based on the acquired information and model data.

The target selection result data block 600 is an example of the target data 701. Note that the filter learning unit 702 does not have to use the filter model of the filter model storage unit 703 when performing filter learning. The filter learning unit 702 transmits the generated filter model to the filter model storage unit 703, and the filter model storage unit 703 stores the filter model.

The filter application unit 704 applies an appropriate filter model existing in the filter model storage unit 703 to the target data 701. Finally, the result data 705 to which the filter is applied in the filter application unit 704 is output.

FIG. 8 shows an example of filter learning processing by the filter learning unit 702. The filter learning method in FIG. 8 is a so-called unsupervised learning method. The filter learning unit 702 acquires a word included in the target data 701 and acquires the appearance frequency of the word in the sentence data 300 (S801). For example, the neighborhood words included in the target data 701 are words that the filter learning unit 702 acquires in step S801. In step S801, the filter learning unit 702 may also acquire words obtained by performing morphological analysis on the target expression, for example. Hereinafter, the acquired words are _denoted by w ₁ ,.

In step S801, the filter learning unit 702 may acquire the word only in the learning range specified by the document ID or the like, and the appearance frequency of the word in the learning range. Do for the learning range. The learning range is specified by a user or the like, for example.

The filter learning unit 702 performs variable χ _i (0 or 1) and variable π _ij (0 ≦ π _ij ≦ 1, 1 ≦ j ≦) for each word w _i (1 ≦ i ≦ n) acquired in step S801. n) and initial values of the real number parameter θ _i are set within the respective domain (S802). In the initial value setting, the filter learning unit 702 can set all χ _i to 1, for example, and set π _ij and θ _i to predetermined values. Further, the filter learning unit 702 may set each initial value in a random manner within the range of each domain.

Subsequently, the filter learning unit 702 calculates R (w _i ) = P _D / P _N for each word w _i (S803). Here, P _D is the probability w _i is a word to be extracted, P _N is the probability word w _i is the filter word. The following describes the method of calculating the _{P D} and _{P N.} Filter learning unit 702 for each word _{w i,} to calculate the _{P D,} for example, as follows.

Here, chi _i is a flag word w _i indicating whether a word to be extracted, word w _i When chi _{i =} 1 is a word to be extracted, the words when chi _{i =} 0 Indicates that w _i is not a word to be extracted, that is, a filter word. π _ij is the probability that word w _i is derived from word w _j . Incidentally, "the word w _i is derived from w _j" and, sentence extracting unit 103, a word w _j in the target document, for example by OCR errors, etc., had been erroneously extracted with word w _i Indicates the state.

D _m (w _i , w _j ) indicates the similarity between the word w _i and the word w _j , and for example, an edit distance is used as the similarity. P (w _i | χ _i = 1) indicates the ratio of the appearance frequency of the word w _{i to} the total appearance frequency of all words with χ _i = 1. Filter learning unit 702, the calculation of P _D, by using the d _m and [pi _ij, even for words that are erroneously recognized by the OCR error, etc., it is possible to perform filtering learning with high precision. Here, the filter learning unit 702 calculates P (d _m | θ), for example, as follows.

Here, the filter learning unit 702 calculates P (d _m | θ) using the Poisson distribution, but an appropriate probability density function can be used in accordance with the word generation model. The filter learning unit 702 includes, for example, an exponential distribution family such as Bernoulli distribution, binomial distribution, multinomial distribution, normal distribution, exponential distribution, t distribution, chi-square distribution, gamma distribution, beta distribution, F distribution, or Laplace distribution. Other distributions may be used. On the other hand, the filter learning unit 702 calculates _PN as follows, for example.

P (w _i | χ _i = 0) indicates the ratio of the appearance frequency of the word w _{i to} the total appearance frequency of all the words with χ _i = 0. Filter learning unit 702, for all the words R is (w _i)> 1, then reset the value of the variable chi _i to _{1, R (w i) ≦} 1 a is a value of all variables for the word chi _i Is reset to 0, and π _ij and θ _i are reset based on the reset χ _i (S804). The filter learning unit 702, for all the words that are R (w _i) ≧ 1, then reset the value of the variable chi _i to 1, R (w _i) <for all words is one variable chi _i The value of may be reset to 0.

In step S804, the filter learning unit 702 resets the value of the variable χ _i based on R (w _i ) as described above. The threshold value at this time may be set to 1 as in the above example, or R it may be other values within the domain (0 or a real number) of (w _i). Here, for convenience, the variable γ _ik (1 ≦ k ≦ n) is defined as follows.

The variable Γ _i is defined as follows.

The filter learning unit 702 uses the above values to reset π _ij as follows, for example.

Further, the filter learning unit 702 resets the parameter θ _k as follows, for example.

The example of resetting the parameter θ _k described above corresponds to the case where the Poisson distribution is used for calculating P (d _m | θ). When a distribution other than the Poisson distribution is used to calculate P (d _m | θ), the filter learning unit 702 resets θ _k by, for example, solving the update formula for θ _k shown below.

Subsequently, the filter learning unit 702 calculates the joint probability for the current parameters in all words as follows (S805).

The filter learning unit 702 determines whether or not the above joint probability has converged (S806). For example, the filter learning unit 702 determines that the joint probability has converged when the joint probability is a value included in a predetermined range. Further, for example, the filter learning unit 702 compares the above joint probability with the joint probability calculated last time, and determines that the joint probability has converged when it has not increased by a certain value or a certain ratio or more. Good.

If the filter learning unit 702 determines that the joint probability has converged (S806; yes), the process ends. When the filter learning unit 702 determines that the joint probability has not converged (S806: no), the process returns to step S803.

The filter learning unit 702 can select whether each word w _i is an extraction target word or a filter word according to the value of χ _i corresponding to each word w _i at the end of processing. For example, the filter learning unit 702 transmits the set of extraction target words, the set of filter words, and the filter model storage unit 703.

FIG. 9 shows an example of filter application processing by the filter application unit 704. The filter application process in FIG. 9 shows an example using the filter learning process in FIG. The filter application unit 704 acquires a set of extraction target words from the filter model storage unit 703, and acquires a filter application target word set from the target data 701 (S901). The set of extraction target words held by the filter model storage unit 703 is a set obtained by the unsupervised learning means shown in FIG. A set of neighboring words included in the target data 701 is an example of a filter application target word set. For example, the filter application unit 704 may include a word obtained by morphological analysis on the target expression included in the target data 701 in the filter application target word set.

Subsequently, the filter application unit 704 checks whether the extraction target word is included in the filter application target word set (S902). At this time, the filter application unit 704 may perform an inspection based on a complete match between each word of the filter application target word set and each of the extraction target words, or may perform an inspection using a scale based on similarity between words such as an edit distance. You may go.

Further, the filter application unit 704 may check whether or not all of the extraction target words are included, or may check whether or not one or a plurality of extraction target words are included. When the filter application unit 704 determines that the extraction target word set is not included in the filter application target word set (S902: no), since all the words of the filter application target word set are filter words, nothing is output. The process is terminated.

When the filter application unit 704 determines that the extraction target word is included in the filter application target word set (S902: yes), the filter application unit 704 outputs the result data 705 after the filter application (S903), The process ends. Data obtained by removing the filter word and the coordinates corresponding to the filter word from the target data 701 is an example of the result data 705 after the filter application.

FIG. 10 shows an example of the result of filtering on words by the filter unit 107. “Correct” indicates a word that should actually be a target, and “Incorrect” indicates a word that is not actually a target. “Acquisition” indicates a word determined to be an extraction target word by the above-described unsupervised learning, and “non-acquisition” indicates a word determined to be a filter word by the above-described unsupervised learning method. In the extraction target word, the accuracy is 75% defined by (correct and acquired) / {(correct and acquired) + (incorrect and acquired)}, (correct and acquired) / {(correct and acquired) + (correct and acquired) The non-acquisition)} was 56.8%. The information extraction system 101 can determine a small number of extraction target words from many words without depending on the teacher by the method described above.

FIG. 11 shows a second example of filter learning processing by the filter learning unit 702. This example is a filter learning process for coordinates. The filter learning unit 702 acquires coordinate information of the target expression in the target data 701 (S1101). Note that the filter learning unit 702 may also acquire, for example, the coordinate information of neighboring words in the target data.

Subsequently, the filter learning unit 702 sets an initial value of the real number parameter η (S1102). The initial value of η may be specified in advance, or may be specified by a user, for example. The initial value of η is preferably specified according to the size of the target document 102. Specifically, for example, it is specified as a value obtained by substituting the area of one line of the target document 102 into a predetermined increase function. Is preferred. Moreover, η may be adjusted according to the extraction result. Subsequently, the filter learning unit 702 learns the kernel density estimation function p (x) according to the following formula (S1103), outputs the learned result, and ends. p (x) indicates a probability density in which the coordinate x is the coordinate to be extracted.

Here, N is the number of coordinates acquired in step S1101, D is a coordinate dimension, x is a variable indicating an arbitrary coordinate, and _xn is each coordinate acquired in step S1101. In the example of FIG. 11, the filter learning unit 702 estimates the probability density using kernel density estimation. For example, another probability density estimation method such as a k-nearest neighbor method, a histogram method, or a mixed Gaussian distribution is used. May be.

FIG. 12 shows a second example of filter application processing by the filter application unit 704. This example is processing for applying a filter to the coordinates shown in FIG. The filter application unit 704 acquires the target expression included in the target data 701, the coordinates of the target expression, and a threshold value (S1201). The threshold value may be given by a user or the like, may be set in advance, or may be set by the filter application unit 704 based on whether the output result is correct or not.

The filter application unit 704 calculates the likelihood (probability value) of each acquired coordinate by substituting each acquired coordinate into the filter model p (x) for the coordinate illustrated in FIG. It is determined whether or not the acquired threshold value is exceeded (S1202). When it is determined that all the calculated likelihoods are smaller than the threshold (S1202: no), the filter application unit 704 ends the process because there is no extraction target coordinate.

When it is determined that there is a likelihood that is equal to or greater than the threshold (S1202: yes), the filter application unit 704 outputs the result data 705 after the filter application (S1203), and ends the process. The target data of the coordinates corresponding to the likelihood that is less than the threshold, the neighborhood words of the target expressions, and the target data 701 from which the coordinates of the target expressions are removed are examples of the result data 705 after the filter application. In addition, when the filter with respect to the coordinate shown in FIG.11 and FIG.12 is used, the object selection part 106 does not need to acquire the neighborhood word of object expression.

FIG. 13 shows a third example of filter learning processing by the filter learning unit 702. This example is a filter learning process that combines a plurality of filter models. The filter learning unit 702 acquires target data 701 and a plurality of filter models (S1301).

The filter learning unit 702 initializes a filter combination model generated from the acquired plurality of filter models (S1302). As the filter combination model, for example, linear learning, a support vector machine, machine learning such as a decision tree, or the like using a value output from each filter model or a numerical result of a determination result as an input can be used. For example, when a filter combination model is defined by a weighted sum of a plurality of filter models, the filter learning unit 702 initializes weights in the initialization of the filter combination model.

The filter learning unit 702 learns the filter combination model based on the correct / incorrect information or the weight information (S1303). Hereinafter, an example in which linear identification is used for the filter combination model will be described. The filter learning unit 702 determines that filtering is performed when the following inequality is satisfied, and determines that filtering is not performed when the following inequality is not satisfied.

The filter learning unit 702 calculates an inner product S between the score vector X having the output value of each filter model as an element and a real vector W set for each filter model in the linear identification indicated by the inequality, and the calculated inner product S And the threshold value U are compared. Hereinafter, the inner product S is referred to as an output value by the filter combination model.

The filter learning unit 702 may accept input of correct / incorrect information for the filter result from the user. The filter learning unit 702 optimizes an evaluation function E such as a square sum error indicated by the following formula based on the input correct / incorrect information (T is a matrix of correct / incorrect information), for example, An appropriate W may be reset.

Further, when weight information is given by the user, the filter learning unit 702 may set the weight information as a real number matrix W. In addition, when correct / incorrect information is given together with weight information (a matrix of weight information is set to V), the filter learning unit 702 performs weight real matrix V of real matrix W in the evaluation function as shown in the following equation. May be set to perform optimization.

Also, the filter learning unit 702 may repeat the process of resetting W based on the correct / incorrect information received again by receiving the input of correct / incorrect information for the filter result for the reset W. Evaluation Function The filtering method described above can be applied without being limited to linear identification as long as the identification model and its evaluation function are appropriately defined.

FIG. 14 shows a third example of filter application processing by the filter application unit 704. This example is a filter application process in the filter combination model.

The filter application unit 704 acquires the target data 701, a plurality of filter models, and a filter combination model obtained by combining the plurality of filter models (S1401). Subsequently, the filter application unit 704 inputs the target data 701 to each acquired filter model, and acquires the output value of each filter model (S1402).

Subsequently, the filter application unit 704 inputs the output value of each filter model calculated in S1402 to the filter combination model, and acquires the output value of the filter combination model (S1403). Subsequently, the filter application unit 704 determines whether or not the output value of the filter combination model is, for example, greater than or equal to the threshold value U (S1404). If the output value of the filter combination model is smaller than the threshold value U (S1404: no), the process ends.

When the output value of the filter combination model is greater than or equal to the threshold value U (S1404: yes), the filter application unit 704 outputs the result data 705 after the filter application and ends (S1405).

FIG. 15 shows a first example of a user interface to the user. The user interface 1500 includes, for example, a target ID input section 1501, a target information input section 1502, filter adjustment check boxes 1503 to 1505, an extraction result display section 1506, and a correct / incorrect specification section 1507.

The target ID input section 1501 accepts input of, for example, a sentence ID included in the sentence data 300, a document ID, and a target ID included in the target selection result data block 600. The target information input section 1502 accepts input of target information 109, for example.

Check boxes 1503 to 1504 are check boxes for selecting a filter to be learned and applied. For example, a check box 1503 is a check box for selecting a filter by coordinates, and a check box 1504 is a filter for selecting a word. For example, by checking both the check box 1503 and the check box 1504, the user can select, for example, a filter combination model in which a filter by coordinates and a filter by words are combined. A check box 1505 is a check box for selecting whether or not learning is automatically performed based on the correctness determination result.

The extraction result display section 1506 lists and displays the extraction results after applying the filter. The extraction result display section 1506 displays, for example, the target expression included in the extraction result, neighboring words of the target expression, and the entire target sentence including the target expression. The extraction result display section 1506 may display the coordinates of the target expression, for example. The extraction result display section 1506 is displayed in, for example, a list format, but the display order in the list may follow a value when the filter is applied (for example, a value such as R (w _i )) calculated by the filter unit 107. . The correct / incorrect designation section 1507 accepts input of the result of correct / incorrect determination by the user regarding whether or not the extraction result is appropriate, for example.

FIG. 16 shows a second example of the user interface to the user. The user interface 1600 includes, for example, filter adjustment sections 1601 to 1602 in addition to the configuration of the user interface 1500.

The filter adjustment sections 1601 to 1602 accept input of information related to filter learning and filter application. The filter adjustment section 1601 accepts input of initial values of coordinate weights in a filter combination model based on linear identification, for example. The filter adjustment section 1602 accepts input of initial values of word weights in a filter combination model based on linear identification, for example.

By configuring the user interface as shown in FIG. 15 or FIG. 16, the user can give appropriate target information to any sentence, sentence, or extraction result, and further perform filter adjustment. Information extraction can be performed. Further, the user can designate correct / incorrect determination based on the extraction result, and can change the target information in accordance with the extraction result.

As described above, the information extraction system 101 according to the present embodiment enables the user to perform information extraction on a trial and error basis without checking the extraction target word or the like in advance. That is, the information extraction system 101 can extract information required by the user with high accuracy from various non-standard documents without depending on a logical structure such as a dictionary or HTML in advance.

FIG. 17 shows a second configuration example of the information extraction system. The information extraction system 1701 includes, for example, the same configuration as the information extraction system 101 of the first embodiment. The information extraction system 1701 is different from the information extraction system 101 of the first embodiment in the following points. The target selection unit 106 receives input of the target document 102, selects sentences and coordinates that match the target information 109 from the target document 102, and transmits the target selection results to the sentence extraction unit 103 and the coordinate extraction unit 104. The sentence extraction unit 103 / coordinate extraction unit 104 performs sentence / coordinate extraction from the target selection result instead of the target document 102.

Thus, by configuring the information extraction system 1701, the information extraction system 1701 can appropriately output the information extraction result 110 based on the target information 109 from the user. In addition, the information extraction system 1701 can extract information by setting the target information anew using the information extraction result 110 as an input.

By configuring the information extraction system 1701 as described above, it is possible to realize a system, method, and program that allow a user to perform information extraction on a trial and error basis without examining the extraction target word or the like in advance. As a result, the information extraction system 1701 can extract information required by the user with high accuracy from various non-standard documents in advance without depending on a logical structure such as a dictionary or HTML.

In addition, this invention is not limited to the above-mentioned Example, Various modifications are included. For example, the above-described embodiments have been described in detail for easy understanding of the present invention, and are not necessarily limited to those having all the configurations described. Further, a part of the configuration of a certain embodiment can be replaced with the configuration of another embodiment, and the configuration of another embodiment can be added to the configuration of a certain embodiment. Further, it is possible to add, delete, and replace other configurations for a part of the configuration of each embodiment.

In addition, each of the above-described configurations, functions, processing units, processing means, and the like may be realized by hardware by designing a part or all of them with, for example, an integrated circuit. Each of the above-described configurations, functions, and the like may be realized by software by interpreting and executing a program that realizes each function by the processor. Information such as programs, tables, and files that realize each function can be stored in a memory, a hard disk, a recording device such as an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, or a DVD.

Also, the control lines and information lines indicate what is considered necessary for the explanation, and not all the control lines and information lines on the product are necessarily shown. Actually, it may be considered that almost all the components are connected to each other.

Claims

An information extraction system that extracts information from a target document,
A processor that executes a program; and a memory that is accessed by the processor;
The processor performs information extraction processing,
In the information extraction process,
Accept input of target information indicating a set of character strings to be extracted,
Extracting from the target document a target expression that is a character string that matches any of the character strings included in the target information, and neighboring words that are words arranged within a predetermined distance of each of the target expressions,
Generating a filter using unsupervised learning based on the appearance frequency of each of the neighborhood words in the target document, or the coordinates in the target document of each of the target expressions,
Applying the filter to a filter application target word set including the neighboring words;
An information extraction system that outputs an extraction target word set obtained by applying the filter to the filter application target word set.
The information extraction system according to claim 1,
The processor is
In the generation of the filter, the setting process of each flag indicating whether each of the neighboring words is an extraction target word or a filter word that is a non-extraction target is repeated,
In the setting process,
Get a flag for each of the neighborhood words,
When it is determined that the simultaneous probability for the flag of the neighborhood word has converged, according to each flag of the neighborhood word, it is determined whether each of the neighborhood words is an extraction target word or a filter word, and the setting process ends. And
If it is determined that the joint probability has not converged,
For each of the neighboring words based on the ratio of the appearance frequency in the target document of each of the neighboring words out of the total appearance frequency in the target document of the neighboring word corresponding to the flag indicating that it is an extraction target word Calculating a first probability that the neighborhood word is an extraction target word;
Based on the ratio of the appearance frequency in the target document of each of the neighboring words out of the total appearance frequency in the target document of the neighboring word corresponding to the flag indicating that it is a filter word, Calculating a second probability that the neighborhood word is a filter word;
Based on the ratio between the first probability and the second probability of each of the neighboring words, determine a flag for each of the neighboring words in the next setting process,
An information extraction system for extracting the determined extraction target word from the filter application target word set in the application of the filter.
The information extraction system according to claim 2,
The information extraction system, wherein the processor calculates a first probability of each of the neighboring words based on a similarity between the neighboring words in the setting process.
The information extraction system according to claim 2,
The joint probability is expressed by the following formula:

In the above equation, i and j are natural numbers equal to or less than the number of neighboring words, w i and w j are neighboring words, χ i is the flag of neighboring words w i , and χ j is the flag of neighboring words w j. , the probability [pi ij is the word w i is derived from the word w j, d m (w i , w j) is the similarity of neighboring words w i and the neighboring word w j, P (w j | χ j = 1) is the ratio of the appearance frequency of the neighborhood word w j to the total appearance frequency of all the neighborhood words with χ j = 1, and P (w i | χ i = 0) is all the neighborhoods with χ i = 0. The ratio of the appearance frequency of the neighboring word w i to the total appearance frequency of the word, P (d m (w i , w j ) | θ j ) is a probability density function of a predetermined probability distribution, and the parameter is θ j the probability variable in density function d m (w i, w j ) the probability of time is, shows, information extraction system is.
The information extraction system according to claim 1,
The filter application target word set includes the target expression,
The processor is
In the generation of the filter, based on the coordinates in the target document of each of the target expressions, estimate a probability density function of a random variable indicating coordinates that are extraction targets in the target document,
In the application of the filter, based on the estimated probability density function, a probability that the coordinates are extraction target coordinates is calculated for each coordinate of the target expression, and the target expression and the target whose calculated probability is equal to or greater than a threshold value An information extraction system that extracts a neighborhood word of an expression from the filter application target word set.
The information extraction system according to claim 5,
A display device,
The said processor displays the said extraction object word set and the coordinate in the said object document of the object expression contained in the said extraction object word set on the said display apparatus.
The information extraction system according to claim 1,
The information extraction system, wherein the processor performs the information extraction process again on a target document in which a word not included in the extraction target word set is deleted from the target document.
The information extraction system according to claim 1,
The processor is
In generating the filter,
Generating a plurality of filters based on the unsupervised learning;
Generating a first filter combination model that is a weighted sum of predetermined weight values of the plurality of filters;
Applying the first filter combination model to the filter application target word set;
Receiving correct / incorrect information indicating the correctness of each extracted word included in the extracted word set obtained by applying the first filter combination model to the filter application target word set;
A new weight value is determined based on the first filter combination model and the correctness information,
Generating a second filter combination model that is a weighted sum of the determined new weight values of the plurality of filters;
The information extraction system, wherein the second filter combination model is the filter to be applied.
An information extraction system is a method for extracting information from a target document,
The information extraction system includes a processor that executes a program, and a memory that the processor accesses,
In the method, the information extraction system includes:
Accept input of target information indicating a set of character strings to be extracted,
Extracting from the target document a target expression that is a character string that matches any of the character strings included in the target information, and neighboring words that are words arranged within a predetermined distance of each of the target expressions,
Generating a filter using unsupervised learning based on the appearance frequency of each of the neighborhood words in the target document, or the coordinates in the target document of each of the target expressions,
Applying the filter to a filter application target word set including the neighboring words;
A method of outputting result data obtained by applying the filter to the filter application target word set.
A computer-readable non-transitory recording medium holding a program for causing a computer to perform information extraction from a target document,
The computer includes a processor that executes a program, and a memory that the processor accesses,
The program is
A procedure for receiving input of target information indicating a set of character strings to be extracted;
A procedure for extracting, from the target document, a target expression that is a character string that matches any of the character strings included in the target information, and neighboring words that are words arranged within a predetermined distance of each of the target expressions; ,
Generating a filter using unsupervised learning based on the appearance frequency of each of the neighborhood words in the target document or the coordinates in the target document of each of the target expressions;
Applying the filter to a filter application target word set including the neighboring words;
A computer-readable non-transitory recording medium that causes the computer to execute a procedure of outputting result data obtained by applying the filter to the filter application target word set.