WO2016194054A1 - Information extraction system, information extraction method, and recording medium - Google Patents

Information extraction system, information extraction method, and recording medium Download PDF

Info

Publication number
WO2016194054A1
WO2016194054A1 PCT/JP2015/065594 JP2015065594W WO2016194054A1 WO 2016194054 A1 WO2016194054 A1 WO 2016194054A1 JP 2015065594 W JP2015065594 W JP 2015065594W WO 2016194054 A1 WO2016194054 A1 WO 2016194054A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
filter
word
information
words
Prior art date
Application number
PCT/JP2015/065594
Other languages
French (fr)
Japanese (ja)
Inventor
太亮 尾崎
真 岩山
彬 童
義行 小林
高橋 寿一
新庄 広
Original Assignee
株式会社日立製作所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社日立製作所 filed Critical 株式会社日立製作所
Priority to PCT/JP2015/065594 priority Critical patent/WO2016194054A1/en
Priority to JP2017521323A priority patent/JP6334062B2/en
Publication of WO2016194054A1 publication Critical patent/WO2016194054A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis

Definitions

  • an analysis system that extracts information described in the target document in a machine-processable form and performs analysis on various target documents. For example, if the so-called unique name such as the manufacturer name, product name, and series name can be extracted from the shopping website that is the target document, the analysis system performs analysis of product information statistics for each manufacturer. Can be implemented.
  • Patent Document 1 states that “the excerpt unit 101 obtains an excerpt document by extracting, from the original document, characters that should be displayed relatively large on the screen on which the original document is displayed. When the amount to be displayed on the screen does not fit within the predetermined amount, the excerpt unit 101 corrects the relative size criterion for excerpting characters ”(see summary).
  • the analysis system extracts information from a non-standard document using, for example, a dictionary or a plurality of templates prepared in advance.
  • a dictionary or a plurality of templates prepared in advance.
  • appropriate templates for all documents cannot always be prepared in advance.
  • Patent Document 1 discloses an information extraction method based on the display size of a sentence on a website, but information necessary for the user is described in an appropriate display size in the target document. There is a problem that is not limited.
  • information required by a user can be accurately obtained from a variety of non-standard documents such as websites and document images without depending on a dictionary prepared in advance and a logical structure such as HTML.
  • the purpose is to extract.
  • An information extraction system for extracting information from a target document, comprising: a processor that executes a program; and a memory that is accessed by the processor; the processor performs an information extraction process; An input of target information indicating a set of character strings, and a target expression that is a character string that matches any of the character strings included in the target information, and words arranged within a predetermined distance of each of the target expressions A certain neighboring word is extracted from the target document, and a filter is generated using unsupervised learning based on the appearance frequency of each of the neighboring words in the target document or the coordinates of the target expression in the target document. Applying the filter to a filter application target word set including the neighboring words, and applying the filter to the filter application target word set. The resulting extract outputs the target word set, the information extraction system.
  • One embodiment of the present invention can extract information required by a user with high accuracy from various non-standard documents without depending on a dictionary prepared in advance and a logical structure such as HTML.
  • FIG. 1 is a block diagram illustrating an example of the overall configuration of an information extraction system in Embodiment 1.
  • FIG. It is a figure which shows the example of the shopping website which is an example of the object document in Example 1.
  • FIG. 6 is a diagram illustrating an example of a document image which is an example of a target document in Embodiment 1.
  • FIG. It is a figure which shows the example of the data storage method of the storage part in Example 1.
  • FIG. 6 is a flowchart illustrating a first example of target selection processing according to the first embodiment.
  • 10 is a flowchart illustrating a second example of target selection processing according to the first embodiment. It is a figure which shows the example of the object selection result in Example 1.
  • FIG. 6 is a flowchart illustrating a first example of filter learning processing according to the first embodiment.
  • 6 is a flowchart illustrating a first example of filter application processing according to the first embodiment. It is a figure which shows the example of the filter application result in Example 1.
  • FIG. 6 is a flowchart illustrating a second example of filter learning processing according to the first exemplary embodiment.
  • 6 is a flowchart illustrating a second example of filter application processing according to the first embodiment.
  • 12 is a flowchart illustrating a third example of filter learning processing according to the first embodiment.
  • 10 is a flowchart illustrating a third example of filter application processing according to the first embodiment.
  • FIG. 6 is a diagram illustrating a first example of a user interface in Embodiment 1.
  • FIG. 6 is a diagram illustrating a second example of a user interface in Embodiment 1.
  • FIG. It is a block diagram which shows the example of whole structure of the information extraction system in Example 2.
  • an information extraction system that extracts information from a target document
  • the information extraction system receives input of target information indicating a set of character strings to be extracted from the user, the target expression that is a character string that matches any of the character strings included in the target information, and each of the target expressions Neighboring words that are close to the physical distance are extracted from the target document.
  • the information extraction system obtains not only the target expression that is the extraction target directly specified by the user, but also the information that may be necessary for the user related to the target expression by acquiring neighborhood words. It is possible to obtain a wide range without using the above.
  • the information extraction system generates a filter using unsupervised learning based on the appearance frequency of each nearby word in the target document or the coordinates of each of the target expressions in the target document.
  • the information extraction system can delete unnecessary neighboring words without using a dictionary or the like by applying the generated filter to the filter application target word set including neighboring words, that is, the user needs Can be obtained with high accuracy.
  • FIG. 1 shows a configuration example of an information extraction system.
  • the information extraction system 101 includes, for example, a computer having a processor (CPU) 111, a memory 112, an auxiliary storage device 113, and a communication interface 114.
  • the processor 111 executes a program stored in the memory 112.
  • the memory 112 includes a ROM that is a nonvolatile storage element and a RAM that is a volatile storage element.
  • the ROM stores an immutable program (for example, BIOS).
  • BIOS basic input/output
  • the RAM is a high-speed and volatile storage element such as a DRAM (Dynamic Random Access Memory), and temporarily stores a program executed by the processor 111 and data used when the program is executed.
  • the auxiliary storage device 113 is a large-capacity non-volatile storage device such as a magnetic storage device (HDD) or a flash memory (SSD), for example, and stores a program executed by the processor 111 and data used when the program is executed. To do. That is, the program is read from the memory 112 or the auxiliary storage device 113, loaded into the memory 112, and executed by the processor 111.
  • HDD magnetic storage device
  • SSD flash memory
  • the information extraction system 101 may include an input interface 115 and an output interface 118.
  • the input interface 115 is an interface to which a keyboard 116, a mouse 117, and the like are connected and receives input from the user.
  • the output interface 118 is an interface to which a display device 119, a printer, or the like is connected, and the execution result of the program is output in a format that can be viewed by the user.
  • the communication interface 114 is a network interface device that controls communication with other devices according to a predetermined protocol.
  • the communication interface 114 includes a serial interface such as USB, for example.
  • a program executed by the processor 111 is provided to the information extraction system 101 via a removable medium (a computer-readable non-transitory storage medium such as a CD-ROM or a flash memory) or a network, and is stored non-temporarily. It may be stored in a nonvolatile auxiliary storage device 113 that is a medium. For this reason, the information extraction system 101 may have an interface for reading data from a removable medium.
  • a removable medium a computer-readable non-transitory storage medium such as a CD-ROM or a flash memory
  • the information extraction system 101 is a computer system that is configured on a single computer or a plurality of computers that are logically or physically configured, and operates on separate threads on the same computer. It may be possible to operate on a virtual machine constructed on a plurality of physical computer resources.
  • the information extraction system 101 receives input of the target document 102 and the target information 109 via the input interface 115 or the communication interface 114, for example.
  • the target document 102 may be, for example, a document image or a website described in HTML, CSS, or the like.
  • the document image indicates an image obtained by digitizing a document printed on a medium such as paper.
  • the target information 109 indicates information of a character string set serving as a base point for information extraction, and is designated by the user.
  • the target information 109 is information including at least one of a regular expression, a word, a sentence, a sentence including a wild card, a part of speech, a target document ID, and a target sentence ID, for example. " ⁇ ?, ????- ⁇ 3, *- ⁇ [1-4], 000-" is an example of a wild card, and " ⁇ d [,]. ⁇ D ⁇ 2,4 ⁇ -" It is an example of a regular expression.
  • the information extraction system 101 extracts information specified by the target information 109 and information based on the target information 109 from the target document 102.
  • the memory 112 includes, for example, a sentence extraction unit 103 which is a program, a coordinate extraction unit 104, a target selection unit 106, and a result generation unit 108.
  • the memory 112 includes an accumulation unit 105 that is an area for storing data. Further, the memory 112 includes a filter unit 107 including an area for storing data and a program.
  • the processor 111 operates as a functional unit that realizes a predetermined function by operating according to a program.
  • the processor 111 functions as a sentence extracting unit by operating according to the sentence extracting unit 103, and functions as a coordinate extracting unit by operating according to the coordinate extracting unit 104.
  • the processor 111 also operates as a functional unit that realizes each of a plurality of processes executed by each program.
  • a computer and a computer system are an apparatus and a system including these functional units.
  • the sentence extraction unit 103 extracts a sentence from each of the input target documents 102.
  • the sentence in the present embodiment indicates each character string composed of one or more characters obtained by dividing a character string composed of all characters included in the target document 102 according to a predetermined rule, and does not necessarily match a grammatical sentence. It is a concept.
  • a character string sandwiched between predetermined characters or symbols such as a punctuation mark, a punctuation mark, a comma, a period, or a space is an example of a sentence.
  • the grammatical sentence included in the target document 102 is an example of the sentence of this embodiment.
  • Each word included in the target document 102 is an example of a sentence.
  • the sentence extraction unit 103 assigns a document ID to each input target document 102 and a sentence ID to each extracted sentence.
  • the coordinate extracting unit 104 extracts coordinate information of each sentence extracted by the sentence extracting unit 103.
  • the coordinate information is represented by, for example, coordinates on the paper surface of the target document 102 or a display device.
  • the coordinates of the two vertices forming the diagonal of the minimum-size rectangle surrounding the entire sentence are an example of the coordinate information of the sentence.
  • One of the sentence extraction unit 103 or the coordinate extraction unit 104 assigns a document ID to the input target document.
  • the sentence extraction unit 103 and the coordinate extraction unit 104 include, for example, a web browser rendering function and an OCR function.
  • the storage unit 105 holds, for example, information indicating the correspondence between the document ID of the target document 102, the extracted sentence, and the sentence ID and coordinate information of the extracted sentence.
  • the target selecting unit 106 refers to the information held by the storage unit 105, selects the sentence that matches the target information 109, the coordinates of the matching sentence, and the neighboring words of the matching sentence, and selects the selected sentence, coordinates, and The neighborhood word is transmitted to the filter unit 107. Neighboring words will be described later.
  • a sentence that matches the target information 109 selected by the target selection unit 106 is referred to as a target expression.
  • the filter unit 107 removes the sentence, neighboring words, and coordinates that are not extracted from the selected sentence, coordinates, and neighboring words, and removes them.
  • the subsequent sentence, coordinates, and neighborhood word are transmitted to the result generation unit 108.
  • the result generation unit 108 outputs the sentence, coordinates, and neighborhood words received from the filter unit 107 in an appropriate format as the information extraction result 110 via the output interface 118.
  • the result generation unit 108 may store the information extraction result 110 in the storage unit 105 as sentence data to be described later with an appropriate document ID.
  • the information extraction system 101 can appropriately output the information extraction result 110 based on the target information 109 input from the user with the above-described configuration. Further, the information extraction system 101 can extract information again from the information extraction result 110 based on the newly set target information 109.
  • FIG. 2A shows an example of a shopping website, which is an example of the target document 102.
  • the shopping website in FIG. 2A lists a plurality of products of the same type, and describes different product information (manufacturer, unique name, price, etc.) for each product.
  • the sentence extraction unit 103 and the coordinate extraction unit 104 extract the coordinates of the sentence and the sentence by using, for example, a rendering function of the web browser.
  • FIG. 2B shows an example of a document image, which is an example of the target document 102.
  • the stone name, depth, and details are displayed in various layouts.
  • the sentence extraction unit 103 and the coordinate extraction unit 104 extract the coordinates of the sentence and the sentence using, for example, the OCR function.
  • FIG. 3 shows an example of a data management method in the storage unit 105.
  • the sentence data 300 is data accumulated by a method called “Key Value Store (KVS)”.
  • the sentence data 300 includes a document ID 301, a sentence ID 302, and sentence information 303.
  • the document ID 301 is information that uniquely identifies the target document 102.
  • the sentence ID 302 is information for uniquely identifying a sentence in each target document.
  • the sentence information 303 includes a sentence with a corresponding sentence ID and annotation information of the sentence.
  • the coordinate information of the sentence and the font information included in the sentence are examples of annotation information.
  • the value for a desired key can be held in a plurality of layers in this way. For example, when a desired document ID or sentence ID is given, the information extraction system 101 can output a corresponding sentence. For example, when only the document ID is given, the information extraction system 101 can output a list of corresponding sentence IDs.
  • the information used by the information extraction system 101 may be expressed in any data structure without depending on the data structure.
  • a data structure appropriately selected from a table, list, database or queue can store the information.
  • FIG. 4 shows an example of a selection method using regular expressions by the object selection unit 106.
  • the target selection unit 106 receives input of target information 109 including a document ID, a sentence ID, and a regular expression (S401).
  • the target information 109 may not include the document ID and the sentence ID.
  • the target selecting unit 106 extracts the target sentence corresponding to the document ID and the sentence ID included in the target information 109 from the sentence data 300 of the storage unit 105, and converts the target sentence into a regular expression included in the target information 109 in each target sentence. It is checked whether or not a matching expression, that is, an object expression is included. (S402).
  • the target selection unit 106 may extract the target sentence again from data obtained by deleting words that are not included in the extraction target word generated by the result generation unit 108 from the sentence data 300 in accordance with a user instruction. Good. Thereby, the information extraction system 101 can further apply a filter to data that has been filtered once, and can improve the accuracy of information extraction.
  • the target selection unit 106 extracts all the sentences included in the sentence data 300 as the target sentence.
  • the process ends.
  • the target sentence including the target expression exists, that is, when the target sentence including the target expression included in the target information 109 exists (S402: yes)
  • the target selecting unit 106 selects the target expression, the coordinates of the target expression, and The neighborhood word of the target expression is acquired, and for example, the acquired information, the sentence ID and the document ID including the target expression are included in the target selection result data block, and output to the filter unit 107 (S403).
  • the target selection result data block will be described later.
  • the minimum-size rectangular coordinates surrounding the target expression and the minimum-size rectangular coordinates surrounding the entire target sentence including the target expression are examples of coordinates output by the target selection unit 106 in step S403.
  • the neighborhood word of the target expression indicates a word that exists in a position close to the target expression in coordinates in the document.
  • the target selection unit 106 acquires a predetermined number of words within a predetermined distance from the target expression as neighborhood words of the target expression.
  • the target selection unit 106 can acquire, for example, words that are necessary for the user and are unknown to the user by acquiring the neighborhood word.
  • FIG. 5 shows an example of a selection method using the part of speech by the object selection unit 106.
  • the target selection unit 106 receives input of target information 109 including a document ID, a sentence ID, and a part of speech (S501).
  • the target selection unit 106 extracts the document ID included in the target information 109 and the target sentence corresponding to the sentence ID from the storage unit 105, and whether or not a word that matches the part of speech included in the target information exists in each target sentence. (S502).
  • the target information 109 may not include the document ID and the sentence ID, and the target selection unit 106 extracts the target sentence from the sentence data 300 generated by the result generation unit 108. May be performed.
  • the process ends.
  • a target sentence including the target expression that is, when there is a target sentence including a matching word included in the target information 109 (S502: yes)
  • the target expression, coordinates, and neighboring words of the target expression are acquired.
  • the acquired information and the sentence ID and document ID including the target expression are included in the target selection result data block and output to the filter unit 107 (S503).
  • the target selection unit 106 may perform word recognition and part-of-speech identification using, for example, a general morphological analysis method.
  • the target selecting unit 106 may extract, for example, a target expression within a predetermined number from the top in the target sentence. Alternatively, all target expressions included in the target sentence may be extracted.
  • FIG. 4 shows an example of target selection using regular expressions
  • FIG. 5 shows an example of target selection using parts of speech.
  • the target selection unit 106 uses target information 109 including wild cards, words, and the like. Selection can be made in the same way.
  • the target selection unit 106 may appropriately select a target by combining a plurality of types of target information 109 using, for example, logical sum or logical product. Specifically, for example, the target selection unit 106 may extract a target expression that matches a specific regular expression and / or includes a specific part of speech.
  • FIG. 6 shows an example of a target selection result data block generated by the target selection unit 106.
  • the target selection result data block 600 includes, for example, a document ID 601, a target expression ID 602, and target expression information 603, and is data accumulated by, for example, the KVS method.
  • the document ID 601 is information that uniquely identifies the target document 102.
  • the target expression ID 602 is information for uniquely identifying the target expression in the target document 102, and is given by, for example, the target selection unit 106.
  • the target expression information 603 is information related to the target expression, and includes, for example, the target expression, neighborhood words, and coordinates.
  • FIG. 7 shows a configuration example of the filter unit 107.
  • the filter unit 107 includes, for example, a filter learning unit 702 and a filter application unit 704 that are programs, and a filter model storage unit 703 that is an area for storing data.
  • the filter learning unit 702 receives predetermined information included in the target data 701 and a filter model existing in the filter model storage unit 703. And learning a filter model based on the acquired information and model data.
  • the target selection result data block 600 is an example of the target data 701. Note that the filter learning unit 702 does not have to use the filter model of the filter model storage unit 703 when performing filter learning.
  • the filter learning unit 702 transmits the generated filter model to the filter model storage unit 703, and the filter model storage unit 703 stores the filter model.
  • the filter application unit 704 applies an appropriate filter model existing in the filter model storage unit 703 to the target data 701. Finally, the result data 705 to which the filter is applied in the filter application unit 704 is output.
  • FIG. 8 shows an example of filter learning processing by the filter learning unit 702.
  • the filter learning method in FIG. 8 is a so-called unsupervised learning method.
  • the filter learning unit 702 acquires a word included in the target data 701 and acquires the appearance frequency of the word in the sentence data 300 (S801).
  • the neighborhood words included in the target data 701 are words that the filter learning unit 702 acquires in step S801.
  • the filter learning unit 702 may also acquire words obtained by performing morphological analysis on the target expression, for example.
  • the acquired words are denoted by w 1 ,.
  • the filter learning unit 702 may acquire the word only in the learning range specified by the document ID or the like, and the appearance frequency of the word in the learning range. Do for the learning range.
  • the learning range is specified by a user or the like, for example.
  • the filter learning unit 702 performs variable ⁇ i (0 or 1) and variable ⁇ ij (0 ⁇ ⁇ ij ⁇ 1, 1 ⁇ j ⁇ ) for each word w i (1 ⁇ i ⁇ n) acquired in step S801. n) and initial values of the real number parameter ⁇ i are set within the respective domain (S802).
  • the filter learning unit 702 can set all ⁇ i to 1, for example, and set ⁇ ij and ⁇ i to predetermined values. Further, the filter learning unit 702 may set each initial value in a random manner within the range of each domain.
  • P D is the probability w i is a word to be extracted
  • P N is the probability word w i is the filter word.
  • Filter learning unit 702 for each word w i, to calculate the P D for example, as follows.
  • chi i is a flag word w i indicating whether a word to be extracted
  • ⁇ ij is the probability that word w i is derived from word w j .
  • "the word w i is derived from w j” and, sentence extracting unit 103, a word w j in the target document, for example by OCR errors, etc., had been erroneously extracted with word w i Indicates the state.
  • D m (w i , w j ) indicates the similarity between the word w i and the word w j , and for example, an edit distance is used as the similarity.
  • Filter learning unit 702 the calculation of P D, by using the d m and [pi ij, even for words that are erroneously recognized by the OCR error, etc., it is possible to perform filtering learning with high precision.
  • the filter learning unit 702 calculates P (d m
  • the filter learning unit 702 calculates P (d m
  • the filter learning unit 702 includes, for example, an exponential distribution family such as Bernoulli distribution, binomial distribution, multinomial distribution, normal distribution, exponential distribution, t distribution, chi-square distribution, gamma distribution, beta distribution, F distribution, or Laplace distribution. Other distributions may be used.
  • the filter learning unit 702 calculates PN as follows, for example.
  • Filter learning unit 702 for all the words R is (w i)> 1, then reset the value of the variable chi i to 1, R (w i) ⁇ 1 a is a value of all variables for the word chi i Is reset to 0, and ⁇ ij and ⁇ i are reset based on the reset ⁇ i (S804).
  • step S804 the filter learning unit 702 resets the value of the variable ⁇ i based on R (w i ) as described above.
  • the threshold value at this time may be set to 1 as in the above example, or R it may be other values within the domain (0 or a real number) of (w i).
  • the variable ⁇ ik (1 ⁇ k ⁇ n) is defined as follows.
  • variable ⁇ i is defined as follows.
  • the filter learning unit 702 uses the above values to reset ⁇ ij as follows, for example.
  • the example of resetting the parameter ⁇ k described above corresponds to the case where the Poisson distribution is used for calculating P (d m
  • the filter learning unit 702 resets ⁇ k by, for example, solving the update formula for ⁇ k shown below.
  • the filter learning unit 702 calculates the joint probability for the current parameters in all words as follows (S805).
  • the filter learning unit 702 determines whether or not the above joint probability has converged (S806). For example, the filter learning unit 702 determines that the joint probability has converged when the joint probability is a value included in a predetermined range. Further, for example, the filter learning unit 702 compares the above joint probability with the joint probability calculated last time, and determines that the joint probability has converged when it has not increased by a certain value or a certain ratio or more. Good.
  • the filter learning unit 702 determines that the joint probability has converged (S806; yes). If the filter learning unit 702 determines that the joint probability has not converged (S806: no), the process returns to step S803.
  • FIG. 9 shows an example of filter application processing by the filter application unit 704.
  • the filter application process in FIG. 9 shows an example using the filter learning process in FIG.
  • the filter application unit 704 acquires a set of extraction target words from the filter model storage unit 703, and acquires a filter application target word set from the target data 701 (S901).
  • the set of extraction target words held by the filter model storage unit 703 is a set obtained by the unsupervised learning means shown in FIG.
  • a set of neighboring words included in the target data 701 is an example of a filter application target word set.
  • the filter application unit 704 may include a word obtained by morphological analysis on the target expression included in the target data 701 in the filter application target word set.
  • the filter application unit 704 checks whether the extraction target word is included in the filter application target word set (S902). At this time, the filter application unit 704 may perform an inspection based on a complete match between each word of the filter application target word set and each of the extraction target words, or may perform an inspection using a scale based on similarity between words such as an edit distance. You may go.
  • the filter application unit 704 may check whether or not all of the extraction target words are included, or may check whether or not one or a plurality of extraction target words are included. When the filter application unit 704 determines that the extraction target word set is not included in the filter application target word set (S902: no), since all the words of the filter application target word set are filter words, nothing is output. The process is terminated.
  • FIG. 10 shows an example of the result of filtering on words by the filter unit 107.
  • “Correct” indicates a word that should actually be a target
  • “Incorrect” indicates a word that is not actually a target.
  • “Acquisition” indicates a word determined to be an extraction target word by the above-described unsupervised learning
  • “non-acquisition” indicates a word determined to be a filter word by the above-described unsupervised learning method.
  • the accuracy is 75% defined by (correct and acquired) / ⁇ (correct and acquired) + (incorrect and acquired) ⁇ , (correct and acquired) / ⁇ (correct and acquired) + (correct and acquired) The non-acquisition) ⁇ was 56.8%.
  • the information extraction system 101 can determine a small number of extraction target words from many words without depending on the teacher by the method described above.
  • FIG. 11 shows a second example of filter learning processing by the filter learning unit 702.
  • This example is a filter learning process for coordinates.
  • the filter learning unit 702 acquires coordinate information of the target expression in the target data 701 (S1101). Note that the filter learning unit 702 may also acquire, for example, the coordinate information of neighboring words in the target data.
  • the filter learning unit 702 sets an initial value of the real number parameter ⁇ (S1102).
  • the initial value of ⁇ may be specified in advance, or may be specified by a user, for example.
  • the initial value of ⁇ is preferably specified according to the size of the target document 102. Specifically, for example, it is specified as a value obtained by substituting the area of one line of the target document 102 into a predetermined increase function. Is preferred.
  • may be adjusted according to the extraction result.
  • the filter learning unit 702 learns the kernel density estimation function p (x) according to the following formula (S1103), outputs the learned result, and ends.
  • p (x) indicates a probability density in which the coordinate x is the coordinate to be extracted.
  • N is the number of coordinates acquired in step S1101
  • D is a coordinate dimension
  • x is a variable indicating an arbitrary coordinate
  • xn is each coordinate acquired in step S1101.
  • the filter learning unit 702 estimates the probability density using kernel density estimation.
  • another probability density estimation method such as a k-nearest neighbor method, a histogram method, or a mixed Gaussian distribution is used. May be.
  • FIG. 12 shows a second example of filter application processing by the filter application unit 704.
  • This example is processing for applying a filter to the coordinates shown in FIG.
  • the filter application unit 704 acquires the target expression included in the target data 701, the coordinates of the target expression, and a threshold value (S1201).
  • the threshold value may be given by a user or the like, may be set in advance, or may be set by the filter application unit 704 based on whether the output result is correct or not.
  • the filter application unit 704 calculates the likelihood (probability value) of each acquired coordinate by substituting each acquired coordinate into the filter model p (x) for the coordinate illustrated in FIG. It is determined whether or not the acquired threshold value is exceeded (S1202). When it is determined that all the calculated likelihoods are smaller than the threshold (S1202: no), the filter application unit 704 ends the process because there is no extraction target coordinate.
  • the filter application unit 704 When it is determined that there is a likelihood that is equal to or greater than the threshold (S1202: yes), the filter application unit 704 outputs the result data 705 after the filter application (S1203), and ends the process.
  • the target data of the coordinates corresponding to the likelihood that is less than the threshold, the neighborhood words of the target expressions, and the target data 701 from which the coordinates of the target expressions are removed are examples of the result data 705 after the filter application.
  • the object selection part 106 does not need to acquire the neighborhood word of object expression.
  • FIG. 13 shows a third example of filter learning processing by the filter learning unit 702.
  • This example is a filter learning process that combines a plurality of filter models.
  • the filter learning unit 702 acquires target data 701 and a plurality of filter models (S1301).
  • the filter learning unit 702 initializes a filter combination model generated from the acquired plurality of filter models (S1302).
  • the filter combination model for example, linear learning, a support vector machine, machine learning such as a decision tree, or the like using a value output from each filter model or a numerical result of a determination result as an input can be used.
  • the filter learning unit 702 initializes weights in the initialization of the filter combination model.
  • the filter learning unit 702 learns the filter combination model based on the correct / incorrect information or the weight information (S1303). Hereinafter, an example in which linear identification is used for the filter combination model will be described.
  • the filter learning unit 702 determines that filtering is performed when the following inequality is satisfied, and determines that filtering is not performed when the following inequality is not satisfied.
  • the filter learning unit 702 may accept input of correct / incorrect information for the filter result from the user.
  • the filter learning unit 702 optimizes an evaluation function E such as a square sum error indicated by the following formula based on the input correct / incorrect information (T is a matrix of correct / incorrect information), for example, An appropriate W may be reset.
  • the filter learning unit 702 may set the weight information as a real number matrix W.
  • the filter learning unit 702 performs weight real matrix V of real matrix W in the evaluation function as shown in the following equation. May be set to perform optimization.
  • the filter learning unit 702 may repeat the process of resetting W based on the correct / incorrect information received again by receiving the input of correct / incorrect information for the filter result for the reset W.
  • Evaluation Function The filtering method described above can be applied without being limited to linear identification as long as the identification model and its evaluation function are appropriately defined.
  • FIG. 14 shows a third example of filter application processing by the filter application unit 704. This example is a filter application process in the filter combination model.
  • the filter application unit 704 acquires the target data 701, a plurality of filter models, and a filter combination model obtained by combining the plurality of filter models (S1401). Subsequently, the filter application unit 704 inputs the target data 701 to each acquired filter model, and acquires the output value of each filter model (S1402).
  • the filter application unit 704 inputs the output value of each filter model calculated in S1402 to the filter combination model, and acquires the output value of the filter combination model (S1403). Subsequently, the filter application unit 704 determines whether or not the output value of the filter combination model is, for example, greater than or equal to the threshold value U (S1404). If the output value of the filter combination model is smaller than the threshold value U (S1404: no), the process ends.
  • the filter application unit 704 When the output value of the filter combination model is greater than or equal to the threshold value U (S1404: yes), the filter application unit 704 outputs the result data 705 after the filter application and ends (S1405).
  • FIG. 15 shows a first example of a user interface to the user.
  • the user interface 1500 includes, for example, a target ID input section 1501, a target information input section 1502, filter adjustment check boxes 1503 to 1505, an extraction result display section 1506, and a correct / incorrect specification section 1507.
  • the target ID input section 1501 accepts input of, for example, a sentence ID included in the sentence data 300, a document ID, and a target ID included in the target selection result data block 600.
  • the target information input section 1502 accepts input of target information 109, for example.
  • Check boxes 1503 to 1504 are check boxes for selecting a filter to be learned and applied.
  • a check box 1503 is a check box for selecting a filter by coordinates
  • a check box 1504 is a filter for selecting a word.
  • a check box 1505 is a check box for selecting whether or not learning is automatically performed based on the correctness determination result.
  • the extraction result display section 1506 lists and displays the extraction results after applying the filter.
  • the extraction result display section 1506 displays, for example, the target expression included in the extraction result, neighboring words of the target expression, and the entire target sentence including the target expression.
  • the extraction result display section 1506 may display the coordinates of the target expression, for example.
  • the extraction result display section 1506 is displayed in, for example, a list format, but the display order in the list may follow a value when the filter is applied (for example, a value such as R (w i )) calculated by the filter unit 107. .
  • the correct / incorrect designation section 1507 accepts input of the result of correct / incorrect determination by the user regarding whether or not the extraction result is appropriate, for example.
  • FIG. 16 shows a second example of the user interface to the user.
  • the user interface 1600 includes, for example, filter adjustment sections 1601 to 1602 in addition to the configuration of the user interface 1500.
  • the filter adjustment sections 1601 to 1602 accept input of information related to filter learning and filter application.
  • the filter adjustment section 1601 accepts input of initial values of coordinate weights in a filter combination model based on linear identification, for example.
  • the filter adjustment section 1602 accepts input of initial values of word weights in a filter combination model based on linear identification, for example.
  • the user can give appropriate target information to any sentence, sentence, or extraction result, and further perform filter adjustment. Information extraction can be performed. Further, the user can designate correct / incorrect determination based on the extraction result, and can change the target information in accordance with the extraction result.
  • the information extraction system 101 enables the user to perform information extraction on a trial and error basis without checking the extraction target word or the like in advance. That is, the information extraction system 101 can extract information required by the user with high accuracy from various non-standard documents without depending on a logical structure such as a dictionary or HTML in advance.
  • FIG. 17 shows a second configuration example of the information extraction system.
  • the information extraction system 1701 includes, for example, the same configuration as the information extraction system 101 of the first embodiment.
  • the information extraction system 1701 is different from the information extraction system 101 of the first embodiment in the following points.
  • the target selection unit 106 receives input of the target document 102, selects sentences and coordinates that match the target information 109 from the target document 102, and transmits the target selection results to the sentence extraction unit 103 and the coordinate extraction unit 104.
  • the sentence extraction unit 103 / coordinate extraction unit 104 performs sentence / coordinate extraction from the target selection result instead of the target document 102.
  • the information extraction system 1701 can appropriately output the information extraction result 110 based on the target information 109 from the user.
  • the information extraction system 1701 can extract information by setting the target information anew using the information extraction result 110 as an input.
  • the information extraction system 1701 By configuring the information extraction system 1701 as described above, it is possible to realize a system, method, and program that allow a user to perform information extraction on a trial and error basis without examining the extraction target word or the like in advance. As a result, the information extraction system 1701 can extract information required by the user with high accuracy from various non-standard documents in advance without depending on a logical structure such as a dictionary or HTML.
  • this invention is not limited to the above-mentioned Example, Various modifications are included.
  • the above-described embodiments have been described in detail for easy understanding of the present invention, and are not necessarily limited to those having all the configurations described.
  • a part of the configuration of a certain embodiment can be replaced with the configuration of another embodiment, and the configuration of another embodiment can be added to the configuration of a certain embodiment.
  • each of the above-described configurations, functions, processing units, processing means, and the like may be realized by hardware by designing a part or all of them with, for example, an integrated circuit.
  • Each of the above-described configurations, functions, and the like may be realized by software by interpreting and executing a program that realizes each function by the processor.
  • Information such as programs, tables, and files that realize each function can be stored in a memory, a hard disk, a recording device such as an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, or a DVD.
  • control lines and information lines indicate what is considered necessary for the explanation, and not all the control lines and information lines on the product are necessarily shown. Actually, it may be considered that almost all the components are connected to each other.

Abstract

This information extraction system: extracts, from a target document, target expressions, each of which is a character string identical to a character string included in target information indicating a set of character strings to be extracted, and also extracts, from the target document, neighboring words, which are words disposed within a predetermined distance from the target expressions; generates a filter by means of unsupervised learning based on the frequency of occurrence of each neighboring word or the coordinates of each target expression within the target document; applies the filter to a set of words to be filtered including the neighboring words, thereby obtaining a set of words to be extracted; and outputs the obtained set of words to be extracted.

Description

情報抽出システム、情報抽出方法、及び記録媒体Information extraction system, information extraction method, and recording medium
 本発明は、情報抽出システム、情報抽出方法、及び記録媒体に関する。 The present invention relates to an information extraction system, an information extraction method, and a recording medium.
 対象文書に記載されている情報を、機械処理可能な形で抽出し、様々な対象文書における分析を行う分析システムがある。分析システムは、例えば、対象文書であるショッピングウェブサイトから製造者名、商品名、及びシリーズ名等の所謂、固有名を抽出することができれば、製造者毎の製品情報の統計を行う等の分析を実施することができる。 There is an analysis system that extracts information described in the target document in a machine-processable form and performs analysis on various target documents. For example, if the so-called unique name such as the manufacturer name, product name, and series name can be extracted from the shopping website that is the target document, the analysis system performs analysis of product information statistics for each manufacturer. Can be implemented.
 このように、非定形の文書又は文書画像から、必要な情報を抽出する技術が知られている。本技術分野の背景技術として特開2013-232127号公報(特許文献1)がある。特許文献1には、「抜粋部101は、原文書が表示される画面において、相対的に大きく表示されるべき文字を原文書から抜粋することにより抜粋文書を得る。修正部103は、抜粋文書が画面に表示されるべき量が所定の量に収まらない場合、抜粋部101が文字を抜粋する相対的な大きさの基準を修正する。」と記載されている(要約参照)。 Thus, a technique for extracting necessary information from a non-standard document or document image is known. As background art of this technical field, there is JP 2013-232127 A (Patent Document 1). Patent Document 1 states that “the excerpt unit 101 obtains an excerpt document by extracting, from the original document, characters that should be displayed relatively large on the screen on which the original document is displayed. When the amount to be displayed on the screen does not fit within the predetermined amount, the excerpt unit 101 corrects the relative size criterion for excerpting characters ”(see summary).
特開2013-232127号公報JP 2013-232127 A
 分析システムは、例えば、予め用意された辞書や複数の雛型等を用いて、非定形文書から情報抽出を行う。しかし、非定型文書においては、すべての文書に対する適切な雛型を予め用意することができるとは限らない。また、抽出の対象となる単語の辞書を容易に得られるとは限らない。 The analysis system extracts information from a non-standard document using, for example, a dictionary or a plurality of templates prepared in advance. However, in an atypical document, appropriate templates for all documents cannot always be prepared in advance. Also, it is not always easy to obtain a dictionary of words to be extracted.
 また、特許文献1には、ウェブサイトにおいて、文の表示サイズに基づく情報抽出方法が開示されているが、利用者にとって必要な情報が、対象文書中に適切な表示サイズで記述されているとは限らないという問題がある。 Patent Document 1 discloses an information extraction method based on the display size of a sentence on a website, but information necessary for the user is described in an appropriate display size in the target document. There is a problem that is not limited.
 本発明の一態様は、ウェブサイトや文書画像等の多様な非定形文書から、事前に用意された辞書、及びHTML等の論理構造等に依存せず、利用者が必要とする情報を高精度に抽出することを目的とする。 According to one aspect of the present invention, information required by a user can be accurately obtained from a variety of non-standard documents such as websites and document images without depending on a dictionary prepared in advance and a logical structure such as HTML. The purpose is to extract.
 上記課題を解決するため、本発明の一態様は、以下の構成を採用する。対象文書から情報を抽出する情報抽出システムであって、プログラムを実行するプロセッサと、前記プロセッサがアクセスするメモリと、を含み、前記プロセッサは、情報抽出処理を行い、前記情報抽出処理において、抽出対象の文字列の集合を示す対象情報の入力を受け付け、前記対象情報に含まれる文字列のいずれかに合致する文字列である対象表現と、前記対象表現それぞれの所定距離以内に配置された単語である近傍語と、を前記対象文書から抽出し、前記近傍語それぞれの前記対象文書中の出現頻度、又は前記対象表現それぞれの前記対象文書中の座標、に基づく教師なし学習を用いてフィルタを生成し、前記近傍語を含むフィルタ適用対象語集合に、前記フィルタを適用し、前記フィルタ適用対象語集合に前記フィルタを適用して得られた抽出対象語集合を出力する、情報抽出システム。 In order to solve the above problems, one embodiment of the present invention employs the following configuration. An information extraction system for extracting information from a target document, comprising: a processor that executes a program; and a memory that is accessed by the processor; the processor performs an information extraction process; An input of target information indicating a set of character strings, and a target expression that is a character string that matches any of the character strings included in the target information, and words arranged within a predetermined distance of each of the target expressions A certain neighboring word is extracted from the target document, and a filter is generated using unsupervised learning based on the appearance frequency of each of the neighboring words in the target document or the coordinates of the target expression in the target document. Applying the filter to a filter application target word set including the neighboring words, and applying the filter to the filter application target word set. The resulting extract outputs the target word set, the information extraction system.
 本発明の一態様は、予め用意された辞書、及びHTML等の論理構造等に依存せず、多様な非定形文書から、利用者が必要とする情報を高精度に抽出することができる。 One embodiment of the present invention can extract information required by a user with high accuracy from various non-standard documents without depending on a dictionary prepared in advance and a logical structure such as HTML.
 上記した以外の課題、構成及び効果は、以下の実施形態の説明により明らかにされる。 Issues, configurations, and effects other than those described above will be clarified by the following description of the embodiments.
実施例1における情報抽出システムの全体構成例を示すブロック図である。1 is a block diagram illustrating an example of the overall configuration of an information extraction system in Embodiment 1. FIG. 実施例1における対象文書の一例であるショッピングウェブサイトの例を示す図である。It is a figure which shows the example of the shopping website which is an example of the object document in Example 1. FIG. 実施例1における対象文書の一例である文書画像の例を示す図である。6 is a diagram illustrating an example of a document image which is an example of a target document in Embodiment 1. FIG. 実施例1における蓄積部のデータ蓄積方法の例を示す図である。It is a figure which shows the example of the data storage method of the storage part in Example 1. FIG. 実施例1における対象選定処理の第一の例を示すフローチャートである。6 is a flowchart illustrating a first example of target selection processing according to the first embodiment. 実施例1における対象選定処理の第二の例を示すフローチャートである。10 is a flowchart illustrating a second example of target selection processing according to the first embodiment. 実施例1における対象選定結果の例を示す図である。It is a figure which shows the example of the object selection result in Example 1. FIG. 実施例1におけるフィルタ部の構成例を示すブロック図である。3 is a block diagram illustrating a configuration example of a filter unit in Embodiment 1. FIG. 実施例1におけるフィルタ学習処理の第一の例を示すフローチャートである。6 is a flowchart illustrating a first example of filter learning processing according to the first embodiment. 実施例1におけるフィルタ適用処理第一の例を示すフローチャートである。6 is a flowchart illustrating a first example of filter application processing according to the first embodiment. 実施例1におけるフィルタ適用結果の例を示す図である。It is a figure which shows the example of the filter application result in Example 1. FIG. 実施例1におけるフィルタ学習処理の第二の例を示すフローチャートである。6 is a flowchart illustrating a second example of filter learning processing according to the first exemplary embodiment. 実施例1におけるフィルタ適用処理の第二の例を示すフローチャートである。6 is a flowchart illustrating a second example of filter application processing according to the first embodiment. 実施例1におけるフィルタ学習処理の第三の例を示すフローチャートである。12 is a flowchart illustrating a third example of filter learning processing according to the first embodiment. 実施例1におけるフィルタ適用処理の第三の例を示すフローチャートである。10 is a flowchart illustrating a third example of filter application processing according to the first embodiment. 実施例1におけるユーザインターフェースの第一の例を示す図である。6 is a diagram illustrating a first example of a user interface in Embodiment 1. FIG. 実施例1におけるユーザインターフェースの第二の例を示す図である。6 is a diagram illustrating a second example of a user interface in Embodiment 1. FIG. 実施例2における情報抽出システムの全体構成例を示すブロック図である。It is a block diagram which shows the example of whole structure of the information extraction system in Example 2. FIG.
 以下、本発明の実施形態について図面を参照して説明する。本実施形態は対象文書から情報を抽出する情報抽出システムを説明する。情報抽出システムは、抽出対象の文字列の集合を示す対象情報の入力を利用者から受け付けると、対象情報に含まれる文字列のいずれかに合致する文字列である対象表現と、対象表現それぞれと物理的距離が近い位置にある近傍語と、を対象文書から抽出する。情報抽出システムは、利用者により直接的に指定された抽出対象である対象表現のみならず、近傍語を取得することにより、対象表現に関連する利用者にとって必要な可能性のある情報を、辞書等を用いることなく広く取得することができる。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the present embodiment, an information extraction system that extracts information from a target document will be described. When the information extraction system receives input of target information indicating a set of character strings to be extracted from the user, the target expression that is a character string that matches any of the character strings included in the target information, and each of the target expressions Neighboring words that are close to the physical distance are extracted from the target document. The information extraction system obtains not only the target expression that is the extraction target directly specified by the user, but also the information that may be necessary for the user related to the target expression by acquiring neighborhood words. It is possible to obtain a wide range without using the above.
 情報抽出システムは、近傍語それぞれの対象文書中の出現頻度、又は前記対象表現それぞれの前記対象文書中の座標、に基づく教師なし学習を用いてフィルタを生成する。情報抽出システムは、近傍語を含むフィルタ適用対象語集合に生成したフィルタを適用することで、辞書等を用いることなく、利用者にとって不要な近傍語を削除することができる、即ち利用者が必要とする情報を高精度に取得することができる。 The information extraction system generates a filter using unsupervised learning based on the appearance frequency of each nearby word in the target document or the coordinates of each of the target expressions in the target document. The information extraction system can delete unnecessary neighboring words without using a dictionary or the like by applying the generated filter to the filter application target word set including neighboring words, that is, the user needs Can be obtained with high accuracy.
 図1は、情報抽出システムの構成例を示す。情報抽出システム101は、例えば、プロセッサ(CPU)111、メモリ112、補助記憶装置113及び通信インターフェース114を有する計算機によって構成される。 FIG. 1 shows a configuration example of an information extraction system. The information extraction system 101 includes, for example, a computer having a processor (CPU) 111, a memory 112, an auxiliary storage device 113, and a communication interface 114.
 プロセッサ111は、メモリ112に格納されたプログラムを実行する。メモリ112は、不揮発性の記憶素子であるROM及び揮発性の記憶素子であるRAMを含む。ROMは、不変のプログラム(例えば、BIOS)などを格納する。RAMは、DRAM(Dynamic Random Access Memory)のような高速かつ揮発性の記憶素子であり、プロセッサ111が実行するプログラム及びプログラムの実行時に使用されるデータを一時的に格納する。 The processor 111 executes a program stored in the memory 112. The memory 112 includes a ROM that is a nonvolatile storage element and a RAM that is a volatile storage element. The ROM stores an immutable program (for example, BIOS). The RAM is a high-speed and volatile storage element such as a DRAM (Dynamic Random Access Memory), and temporarily stores a program executed by the processor 111 and data used when the program is executed.
 補助記憶装置113は、例えば、磁気記憶装置(HDD)、フラッシュメモリ(SSD)等の大容量かつ不揮発性の記憶装置であり、プロセッサ111が実行するプログラム及びプログラムの実行時に使用されるデータを格納する。すなわち、プログラムは、メモリ112又は補助記憶装置113から読み出されて、メモリ112にロードされて、プロセッサ111によって実行される。 The auxiliary storage device 113 is a large-capacity non-volatile storage device such as a magnetic storage device (HDD) or a flash memory (SSD), for example, and stores a program executed by the processor 111 and data used when the program is executed. To do. That is, the program is read from the memory 112 or the auxiliary storage device 113, loaded into the memory 112, and executed by the processor 111.
 情報抽出システム101は、入力インターフェース115及び出力インターフェース118を有してもよい。入力インターフェース115は、キーボード116やマウス117などが接続され、利用者からの入力を受けるインターフェースである。出力インターフェース118は、ディスプレイ装置119やプリンタなどが接続され、プログラムの実行結果を利用者が視認可能な形式で出力するインターフェースである。 The information extraction system 101 may include an input interface 115 and an output interface 118. The input interface 115 is an interface to which a keyboard 116, a mouse 117, and the like are connected and receives input from the user. The output interface 118 is an interface to which a display device 119, a printer, or the like is connected, and the execution result of the program is output in a format that can be viewed by the user.
 通信インターフェース114は、所定のプロトコルに従って、他の装置との通信を制御するネットワークインターフェース装置である。また、通信インターフェース114は、例えば、USB等のシリアルインターフェースを含む。 The communication interface 114 is a network interface device that controls communication with other devices according to a predetermined protocol. The communication interface 114 includes a serial interface such as USB, for example.
 プロセッサ111が実行するプログラムは、リムーバブルメディア(CD-ROM、フラッシュメモリなどなどのコンピュータ読み取り可能な可搬性の非一時的記憶媒体)又はネットワークを介して情報抽出システム101に提供され、非一時的記憶媒体である不揮発性の補助記憶装置113に格納されてもよい。このため、情報抽出システム101は、リムーバブルメディアからデータを読み込むインターフェースを有するとよい。 A program executed by the processor 111 is provided to the information extraction system 101 via a removable medium (a computer-readable non-transitory storage medium such as a CD-ROM or a flash memory) or a network, and is stored non-temporarily. It may be stored in a nonvolatile auxiliary storage device 113 that is a medium. For this reason, the information extraction system 101 may have an interface for reading data from a removable medium.
 情報抽出システム101は、物理的に一つの計算機上で、又は、論理的又は物理的に構成された複数の計算機上で構成される計算機システムであり、同一の計算機上で別個のスレッドで動作してもよく、複数の物理的計算機資源上に構築された仮想計算機上で動作してもよい。 The information extraction system 101 is a computer system that is configured on a single computer or a plurality of computers that are logically or physically configured, and operates on separate threads on the same computer. It may be possible to operate on a virtual machine constructed on a plurality of physical computer resources.
 情報抽出システム101は、例えば入力インターフェース115又は通信インターフェース114を介して、対象文書102と対象情報109の入力を受け付ける。対象文書102は、例えば、文書画像であってもよいしHTML及びCSS等で記述されたウェブサイトであってもよい。文書画像とは、紙等の媒体に印刷された文書が電子化された画像を示す。 The information extraction system 101 receives input of the target document 102 and the target information 109 via the input interface 115 or the communication interface 114, for example. The target document 102 may be, for example, a document image or a website described in HTML, CSS, or the like. The document image indicates an image obtained by digitizing a document printed on a medium such as paper.
 対象情報109は、情報抽出の基点となる文字列集合の情報を示し、利用者によって指定される。対象情報109は、例えば、正規表現や単語、文、ワイルドカードを含む文、品詞、対象の文書ID、及び対象の文IDの少なくとも1つを含む情報である。「¥?,???- ¥3,*- ¥[1-4],000-」はワイルドカードの一例であり、「¥¥¥d[,].¥d{2,4}-」は正規表現の一例である。情報抽出システム101は、対象文書102から、対象情報109によって指定された情報、及び対象情報109に基づく情報を抽出する。 The target information 109 indicates information of a character string set serving as a base point for information extraction, and is designated by the user. The target information 109 is information including at least one of a regular expression, a word, a sentence, a sentence including a wild card, a part of speech, a target document ID, and a target sentence ID, for example. "\?, ????-\ 3, *-\ [1-4], 000-" is an example of a wild card, and "\\\ d [,]. \ D {2,4}-" It is an example of a regular expression. The information extraction system 101 extracts information specified by the target information 109 and information based on the target information 109 from the target document 102.
 メモリ112は、例えば、プログラムである文抽出部103、座標抽出部104、対象選定部106、及び結果生成部108を含む。また、メモリ112は、データを格納する領域である蓄積部105を含む。また、メモリ112は、データを格納する領域及びプログラムを含むフィルタ部107を含む。 The memory 112 includes, for example, a sentence extraction unit 103 which is a program, a coordinate extraction unit 104, a target selection unit 106, and a result generation unit 108. The memory 112 includes an accumulation unit 105 that is an area for storing data. Further, the memory 112 includes a filter unit 107 including an area for storing data and a program.
 プロセッサ111は、プログラムに従って動作することによって、所定の機能を実現する機能部として動作する。例えば、プロセッサ111は、文抽出部103に従って動作することで文抽出部として機能し、座標抽出部104に従って動作することで座標抽出部として機能する。さらに、プロセッサ111は、各プログラムが実行する複数の処理のそれぞれを実現する機能部としても動作する。計算機及び計算機システムは、これらの機能部を含む装置及びシステムである。 The processor 111 operates as a functional unit that realizes a predetermined function by operating according to a program. For example, the processor 111 functions as a sentence extracting unit by operating according to the sentence extracting unit 103, and functions as a coordinate extracting unit by operating according to the coordinate extracting unit 104. Furthermore, the processor 111 also operates as a functional unit that realizes each of a plurality of processes executed by each program. A computer and a computer system are an apparatus and a system including these functional units.
 文抽出部103は、入力された対象文書102それぞれから文を抽出する。本実施例における文とは、対象文書102に含まれる全ての文字からなる文字列を、所定のルールで分割した1以上の文字からなる文字列それぞれを示し、必ずしも文法上の文とは一致しない概念である。句点、読点、カンマ、ピリオド、又はスペース等の所定の文字又は記号の間に挟まれた文字列は文の一例である。対象文書102に含まれる文法上の文は本実施例の文の一例である。また、対象文書102に含まれる単語それぞれは文の一例である。文抽出部103は、入力された対象文書102それぞれに文書IDを、抽出した文それぞれに文IDを付与する。 The sentence extraction unit 103 extracts a sentence from each of the input target documents 102. The sentence in the present embodiment indicates each character string composed of one or more characters obtained by dividing a character string composed of all characters included in the target document 102 according to a predetermined rule, and does not necessarily match a grammatical sentence. It is a concept. A character string sandwiched between predetermined characters or symbols such as a punctuation mark, a punctuation mark, a comma, a period, or a space is an example of a sentence. The grammatical sentence included in the target document 102 is an example of the sentence of this embodiment. Each word included in the target document 102 is an example of a sentence. The sentence extraction unit 103 assigns a document ID to each input target document 102 and a sentence ID to each extracted sentence.
 座標抽出部104は、文抽出部103が抽出した文それぞれの座標情報を抽出する。座標情報は、例えば、対象文書102の紙面又は表示装置における座標で表される。文全体を囲う最小サイズの矩形の対角を成す2頂点の座標は、文の座標情報の一例である。文抽出部103又は座標抽出部104の一方は、入力された対象文書に文書IDを付与する。文抽出部103及び座標抽出部104は、例えば、ウェブブラウザのレンダリング機能及びOCR機能を含む。 The coordinate extracting unit 104 extracts coordinate information of each sentence extracted by the sentence extracting unit 103. The coordinate information is represented by, for example, coordinates on the paper surface of the target document 102 or a display device. The coordinates of the two vertices forming the diagonal of the minimum-size rectangle surrounding the entire sentence are an example of the coordinate information of the sentence. One of the sentence extraction unit 103 or the coordinate extraction unit 104 assigns a document ID to the input target document. The sentence extraction unit 103 and the coordinate extraction unit 104 include, for example, a web browser rendering function and an OCR function.
 蓄積部105は、例えば、対象文書102の文書IDと、抽出された文と、抽出された文の文ID及び座標情報と、の対応を示す情報を保持する。対象選定部106は、蓄積部105が保持する情報を参照して、対象情報109に合致する文、合致する文の座標、及び合致する文の近傍語を選定し、選定した文、座標、及び近傍語をフィルタ部107に送信する。近傍語については後述する。なお、対象選定部106が選定した対象情報109に合致する文を、対象表現と呼ぶ。 The storage unit 105 holds, for example, information indicating the correspondence between the document ID of the target document 102, the extracted sentence, and the sentence ID and coordinate information of the extracted sentence. The target selecting unit 106 refers to the information held by the storage unit 105, selects the sentence that matches the target information 109, the coordinates of the matching sentence, and the neighboring words of the matching sentence, and selects the selected sentence, coordinates, and The neighborhood word is transmitted to the filter unit 107. Neighboring words will be described later. A sentence that matches the target information 109 selected by the target selection unit 106 is referred to as a target expression.
 フィルタ部107は、例えば、対象選定部106が選定した文座標、及び近傍語に基づき、選定した文、座標、及び近傍語から抽出対象外となる文、近傍語、及び座標を除去し、除去後の文、座標、及び近傍語を結果生成部108に送信する。 For example, based on the sentence coordinates selected by the target selection unit 106 and neighboring words, the filter unit 107 removes the sentence, neighboring words, and coordinates that are not extracted from the selected sentence, coordinates, and neighboring words, and removes them. The subsequent sentence, coordinates, and neighborhood word are transmitted to the result generation unit 108.
 結果生成部108は、フィルタ部107から受信した文、座標、及び近傍語を適切な形式で、出力インターフェース118を介して、情報抽出結果110として出力する。また、結果生成部108は、蓄積部105に情報抽出結果110を適切な文書IDを付与して後述する文データとして蓄積してもよい。 The result generation unit 108 outputs the sentence, coordinates, and neighborhood words received from the filter unit 107 in an appropriate format as the information extraction result 110 via the output interface 118. The result generation unit 108 may store the information extraction result 110 in the storage unit 105 as sentence data to be described later with an appropriate document ID.
 情報抽出システム101は上述の構成により、利用者から入力された対象情報109に基づき、適切に情報抽出結果110を出力することができる。また、情報抽出システム101は、情報抽出結果110から、新たに設定された対象情報109に基づいて、再度情報抽出を行うことができる。 The information extraction system 101 can appropriately output the information extraction result 110 based on the target information 109 input from the user with the above-described configuration. Further, the information extraction system 101 can extract information again from the information extraction result 110 based on the newly set target information 109.
 図2Aは、対象文書102の一例である、ショッピングウェブサイトの一例を示す。図2Aのショッピングウェブサイトには、複数の同一種類の商品が列挙され、各商品についてそれぞれ異なる商品情報(製造者、固有名、値段等)が記載されている。図2Aのようにウェブサイトが対象文書102である場合、文抽出部103及び座標抽出部104は、例えば、ウェブブラウザのレンダリング機能を利用して、文及び文の座標を抽出する。 FIG. 2A shows an example of a shopping website, which is an example of the target document 102. The shopping website in FIG. 2A lists a plurality of products of the same type, and describes different product information (manufacturer, unique name, price, etc.) for each product. When the website is the target document 102 as shown in FIG. 2A, the sentence extraction unit 103 and the coordinate extraction unit 104 extract the coordinates of the sentence and the sentence by using, for example, a rendering function of the web browser.
 図2Bは、対象文書102の一例である、文書画像の一例を示す。図2Bの文書画像には、石名、深さ、及び詳細が様々なレイアウトで表示されている。図2Bのように文書画像が対象文書102である場合、文抽出部103及び座標抽出部104は、例えば、OCR機能を利用して、文及び文の座標を抽出する。 FIG. 2B shows an example of a document image, which is an example of the target document 102. In the document image of FIG. 2B, the stone name, depth, and details are displayed in various layouts. When the document image is the target document 102 as shown in FIG. 2B, the sentence extraction unit 103 and the coordinate extraction unit 104 extract the coordinates of the sentence and the sentence using, for example, the OCR function.
 図3は、蓄積部105におけるデータ管理方法の一例を示す。文データ300は、Key Value Store(KVS)と呼ばれる方法によって蓄積されたデータである。文データ300は、文書ID301、文ID302、及び文情報303を含む。文書ID301は、対象文書102を一意に識別する情報である。文ID302は、各対象文書内の文を一意に識別する情報である。文情報303は、対応する文IDの文及び当該文のアノテーション情報を含む。文の座標情報、及び文に含まれるフォント情報は、アノテーション情報の一例である。 FIG. 3 shows an example of a data management method in the storage unit 105. The sentence data 300 is data accumulated by a method called “Key Value Store (KVS)”. The sentence data 300 includes a document ID 301, a sentence ID 302, and sentence information 303. The document ID 301 is information that uniquely identifies the target document 102. The sentence ID 302 is information for uniquely identifying a sentence in each target document. The sentence information 303 includes a sentence with a corresponding sentence ID and annotation information of the sentence. The coordinate information of the sentence and the font information included in the sentence are examples of annotation information.
 KVS方式を用いることによって、このように、所望のキーに対し、その値を複数の階層で保持することができる。情報抽出システム101は、例えば、所望の文書ID、又は文IDが与えられた場合、対応する文を出力することができる。また、例えば、文書IDのみが与えられた場合、情報抽出システム101は、対応する文IDのリストを出力することができる。 By using the KVS method, the value for a desired key can be held in a plurality of layers in this way. For example, when a desired document ID or sentence ID is given, the information extraction system 101 can output a corresponding sentence. For example, when only the document ID is given, the information extraction system 101 can output a list of corresponding sentence IDs.
 なお、本実施形態及び他の実施形態において、情報抽出システム101が使用する情報は、データ構造に依存せずどのようなデータ構造で表現されていてもよい。例えば、テーブル、リスト、データベース又はキューから適切に選択したデータ構造体が、情報を格納することができる。 In this embodiment and other embodiments, the information used by the information extraction system 101 may be expressed in any data structure without depending on the data structure. For example, a data structure appropriately selected from a table, list, database or queue can store the information.
 以下、対象選定部106による、選定方法の例を示す。図4は、対象選定部106による、正規表現を用いた選定方法の例を示す。対象選定部106は、文書ID、文ID、及び正規表現を含む対象情報109の入力を受け付ける(S401)。なお、対象情報109は、文書ID及び文IDを含まなくてもよい。 Hereinafter, an example of a selection method by the target selection unit 106 will be shown. FIG. 4 shows an example of a selection method using regular expressions by the object selection unit 106. The target selection unit 106 receives input of target information 109 including a document ID, a sentence ID, and a regular expression (S401). The target information 109 may not include the document ID and the sentence ID.
 続いて、対象選定部106は、対象情報109に含まれる文書ID及び文IDに対応する対象文を蓄積部105の文データ300から抽出し、各対象文中に対象情報109に含まれる正規表現に合致する表現、即ち対象表現が含まれるか否かを検査する。(S402)。なお、対象選定部106は、例えば、利用者の指示に従って、文データ300から結果生成部108が生成した抽出対象語に含まれない語を削除したデータから、再度対象文の抽出を行ってもよい。これにより、情報抽出システム101は、一度フィルタリングされたデータにさらにフィルタを適用することができ、情報抽出の精度を高めることができる。また、対象情報109に文書ID及び文IDが含まれていない場合、対象選定部106は、文データ300に含まれる全ての文を対象文として抽出する。 Subsequently, the target selecting unit 106 extracts the target sentence corresponding to the document ID and the sentence ID included in the target information 109 from the sentence data 300 of the storage unit 105, and converts the target sentence into a regular expression included in the target information 109 in each target sentence. It is checked whether or not a matching expression, that is, an object expression is included. (S402). Note that the target selection unit 106 may extract the target sentence again from data obtained by deleting words that are not included in the extraction target word generated by the result generation unit 108 from the sentence data 300 in accordance with a user instruction. Good. Thereby, the information extraction system 101 can further apply a filter to data that has been filtered once, and can improve the accuracy of information extraction. When the document information and the sentence ID are not included in the target information 109, the target selection unit 106 extracts all the sentences included in the sentence data 300 as the target sentence.
 全ての対象文中に対象表現が含まれない場合(S402:no)、処理を終了する。対象表現が含まれる対象文が存在する場合、即ち対象情報109に含まれる対象表現を含む対象文が存在する場合(S402:yes)、対象選定部106は、対象表現、対象表現の座標、及び対象表現の近傍語を取得し、例えば、取得した情報と、対象表現が含まれる文ID及び文書IDと、を対象選定結果データブロックに含め、フィルタ部107に出力する(S403)。対象選定結果データブロックについては後述する。 If the target expression is not included in all the target sentences (S402: no), the process ends. When the target sentence including the target expression exists, that is, when the target sentence including the target expression included in the target information 109 exists (S402: yes), the target selecting unit 106 selects the target expression, the coordinates of the target expression, and The neighborhood word of the target expression is acquired, and for example, the acquired information, the sentence ID and the document ID including the target expression are included in the target selection result data block, and output to the filter unit 107 (S403). The target selection result data block will be described later.
 対象表現を囲う最小サイズの矩形座標、及び対象表現を含む対象文全体を囲う最小サイズの矩形座標は、対象表現の座標は、ステップS403において対象選定部106が出力する座標の一例である。また、対象表現の近傍語とは、文書中において当該対象表現と座標上で近い位置に存在する単語を示す。対象選定部106は、例えば、対象表現から所定距離以内にある、所定個数以内の単語を当該対象表現の近傍語として取得する。対象選定部106は、近傍語を取得することにより、例えば、利用者にとって必要な単語であり、かつ利用者が知らない単語、を取得することができる。 The minimum-size rectangular coordinates surrounding the target expression and the minimum-size rectangular coordinates surrounding the entire target sentence including the target expression are examples of coordinates output by the target selection unit 106 in step S403. Further, the neighborhood word of the target expression indicates a word that exists in a position close to the target expression in coordinates in the document. For example, the target selection unit 106 acquires a predetermined number of words within a predetermined distance from the target expression as neighborhood words of the target expression. The target selection unit 106 can acquire, for example, words that are necessary for the user and are unknown to the user by acquiring the neighborhood word.
 図5は、対象選定部106による、品詞を用いた選定方法の例を示す。対象選定部106は、文書ID、文ID、及び品詞を含む対象情報109の入力を受け付ける(S501)。対象選定部106は、対象情報109に含まれる文書ID、及び文IDに対応する対象文を蓄積部105から抽出し、各対象文中に対象情報に含まれる品詞に合致する単語が存在するか否かを検査する(S502)。また、図4の説明と同様に、対象情報109は、文書ID及び文IDを含まなくてもよいし、対象選定部106は、結果生成部108が生成した文データ300から、対象文の抽出を行ってもよい。 FIG. 5 shows an example of a selection method using the part of speech by the object selection unit 106. The target selection unit 106 receives input of target information 109 including a document ID, a sentence ID, and a part of speech (S501). The target selection unit 106 extracts the document ID included in the target information 109 and the target sentence corresponding to the sentence ID from the storage unit 105, and whether or not a word that matches the part of speech included in the target information exists in each target sentence. (S502). Similarly to the description of FIG. 4, the target information 109 may not include the document ID and the sentence ID, and the target selection unit 106 extracts the target sentence from the sentence data 300 generated by the result generation unit 108. May be performed.
 全ての対象文中に対象表現が含まれない場合(S502:no)、処理を終了する。対象表現が含まれる対象文が存在する場合、即ち対象情報109に含まれる合致する単語を含む対象文が存在する場合(S502:yes)、対象表現、座標、及び、対象表現の近傍語を取得し、取得した情報と、対象表現が含まれる文ID及び文書IDと、を対象選定結果データブロックに含め、フィルタ部107に出力する(S503)。なお、対象選定部106は、例えば、一般的な形態素解析手法を用いて、文中の単語の認識及び品詞の同定を行えばよい。 If the target expression is not included in all the target sentences (S502: no), the process ends. When there is a target sentence including the target expression, that is, when there is a target sentence including a matching word included in the target information 109 (S502: yes), the target expression, coordinates, and neighboring words of the target expression are acquired. Then, the acquired information and the sentence ID and document ID including the target expression are included in the target selection result data block and output to the filter unit 107 (S503). Note that the target selection unit 106 may perform word recognition and part-of-speech identification using, for example, a general morphological analysis method.
 なお、ステップS403及びステップS503の処理において、1つの対象文に複数の対象表現が含まれる場合、対象選定部106は、例えば、当該対象文において先頭から所定個数以内の対象表現を抽出してもよいし、当該対象文に含まれる全ての対象表現を抽出してもよい。 Note that, in the processes of step S403 and step S503, when a plurality of target expressions are included in one target sentence, the target selecting unit 106 may extract, for example, a target expression within a predetermined number from the top in the target sentence. Alternatively, all target expressions included in the target sentence may be extracted.
 図4に正規表現を用いた対象選定の例、及び図5に品詞を用いた対象選定の例を示したが、対象選定部106は、ワイルドカードや単語等を含む対象情報109を用いた対象選定を同様に行うことができる。また、対象選定部106は、適宜、複数種類の対象情報109を、例えば論理和や論理積を用いて組み合わせて対象選定を行ってもよい。具体的には、対象選定部106は、例えば、特定の正規表現に合致し、かつ/又は特定の品詞を含む対象表現を抽出してもよい。 FIG. 4 shows an example of target selection using regular expressions, and FIG. 5 shows an example of target selection using parts of speech. The target selection unit 106 uses target information 109 including wild cards, words, and the like. Selection can be made in the same way. In addition, the target selection unit 106 may appropriately select a target by combining a plurality of types of target information 109 using, for example, logical sum or logical product. Specifically, for example, the target selection unit 106 may extract a target expression that matches a specific regular expression and / or includes a specific part of speech.
 図6は、対象選定部106が生成する対象選定結果データブロックの例を示す。対象選定結果データブロック600は、例えば、文書ID601、対象表現ID602、及び対象表現情報603を含み、例えば、KVS方式で蓄積されたデータである。文書ID601は、対象文書102を一意に識別する情報である。対象表現ID602は、対象文書102中の対象表現を一意に識別する情報であり、例えば、対象選定部106によって付与される。対象表現情報603は、対象表現に関する情報であり、例えば、対象表現、近傍語、及び座標を含む。このように対象選定結果データブロック600が構成されることにより、情報抽出システム101は、選定された対象毎に実際の表現、近傍語、座標を容易に取得することができる。 FIG. 6 shows an example of a target selection result data block generated by the target selection unit 106. The target selection result data block 600 includes, for example, a document ID 601, a target expression ID 602, and target expression information 603, and is data accumulated by, for example, the KVS method. The document ID 601 is information that uniquely identifies the target document 102. The target expression ID 602 is information for uniquely identifying the target expression in the target document 102, and is given by, for example, the target selection unit 106. The target expression information 603 is information related to the target expression, and includes, for example, the target expression, neighborhood words, and coordinates. By configuring the object selection result data block 600 as described above, the information extraction system 101 can easily acquire an actual expression, a neighborhood word, and coordinates for each selected object.
 図7は、フィルタ部107の構成例を示す。フィルタ部107は、例えば、プログラムであるフィルタ学習部702及びフィルタ適用部704、並びにデータを格納する領域であるフィルタモデル蓄積部703を含む。 FIG. 7 shows a configuration example of the filter unit 107. The filter unit 107 includes, for example, a filter learning unit 702 and a filter application unit 704 that are programs, and a filter model storage unit 703 that is an area for storing data.
 フィルタ部107に、対象表現、座標、及び近傍語を有する対象データ701が入力されると、フィルタ学習部702は対象データ701に含まれる所定の情報と、フィルタモデル蓄積部703に存在するフィルタモデルと、を取得し、取得した情報とモデルデータとに基づいて、フィルタモデルを学習する。 When target data 701 having target expressions, coordinates, and neighboring words is input to the filter unit 107, the filter learning unit 702 receives predetermined information included in the target data 701 and a filter model existing in the filter model storage unit 703. And learning a filter model based on the acquired information and model data.
 なお、対象選定結果データブロック600は、対象データ701の一例である。なお、フィルタ学習部702は、フィルタ学習に際して、フィルタモデル蓄積部703のフィルタモデルを使用しなくてもよい。フィルタ学習部702は、生成したフィルタモデルを、フィルタモデル蓄積部703に送信し、フィルタモデル蓄積部703はフィルタモデルを蓄積する。 The target selection result data block 600 is an example of the target data 701. Note that the filter learning unit 702 does not have to use the filter model of the filter model storage unit 703 when performing filter learning. The filter learning unit 702 transmits the generated filter model to the filter model storage unit 703, and the filter model storage unit 703 stores the filter model.
 フィルタ適用部704は、フィルタモデル蓄積部703に存在する適切なフィルタモデルを対象データ701に対して適用する。最後にフィルタ適用部704においてフィルタが適用された結果データ705を出力する。 The filter application unit 704 applies an appropriate filter model existing in the filter model storage unit 703 to the target data 701. Finally, the result data 705 to which the filter is applied in the filter application unit 704 is output.
 図8は、フィルタ学習部702によるフィルタ学習処理の一例を示す。図8におけるフィルタ学習手法は、所謂、教師なし学習手法である。フィルタ学習部702は、対象データ701に含まれる単語を取得し、文データ300における当該単語の出現頻度を取得する(S801)。例えば、対象データ701に含まれる近傍語は、ステップS801において、フィルタ学習部702が取得する単語である。また、ステップS801において、フィルタ学習部702は、例えば、対象表現を形態素解析することにより得られる単語を併せて取得してもよい。以下、取得した単語をw,…,wとする。 FIG. 8 shows an example of filter learning processing by the filter learning unit 702. The filter learning method in FIG. 8 is a so-called unsupervised learning method. The filter learning unit 702 acquires a word included in the target data 701 and acquires the appearance frequency of the word in the sentence data 300 (S801). For example, the neighborhood words included in the target data 701 are words that the filter learning unit 702 acquires in step S801. In step S801, the filter learning unit 702 may also acquire words obtained by performing morphological analysis on the target expression, for example. Hereinafter, the acquired words are denoted by w 1 ,.
 なお、フィルタ学習部702は、ステップS801において、文書ID等で指定された学習範囲のみにおける単語及び、当該学習範囲における当該単語の出現頻度を取得してもよく、このとき以降の処理についても当該学習範囲に対して行う。当該学習範囲は、例えば、利用者等によって指定される。 In step S801, the filter learning unit 702 may acquire the word only in the learning range specified by the document ID or the like, and the appearance frequency of the word in the learning range. Do for the learning range. The learning range is specified by a user or the like, for example.
 フィルタ学習部702は、ステップS801で取得した単語w(1≦i≦n)それぞれに対して、変数χ(0または1)、変数πij(0≦πij≦1,1≦j≦n)、及び実数パラメータθの各初期値を、各定義域の範囲内において、設定する(S802)。フィルタ学習部702は、初期値の設定において、例えばχを全て1とし、πij及びθを予め定められた値とすることができる。また、フィルタ学習部702は、各初期値を、各定義域の範囲内において、乱数的に設定してもよい。 The filter learning unit 702 performs variable χ i (0 or 1) and variable π ij (0 ≦ π ij ≦ 1, 1 ≦ j ≦) for each word w i (1 ≦ i ≦ n) acquired in step S801. n) and initial values of the real number parameter θ i are set within the respective domain (S802). In the initial value setting, the filter learning unit 702 can set all χ i to 1, for example, and set π ij and θ i to predetermined values. Further, the filter learning unit 702 may set each initial value in a random manner within the range of each domain.
 続いて、フィルタ学習部702は、単語wそれぞれに対して、R(w)=P/Pを計算する(S803)。ここで、Pはwが抽出対象の単語である確率であり、Pは単語wがフィルタ語である確率である。以下、P及びPの算出方法について説明する。フィルタ学習部702は各単語wについて、Pを例えば以下のように計算する。 Subsequently, the filter learning unit 702 calculates R (w i ) = P D / P N for each word w i (S803). Here, P D is the probability w i is a word to be extracted, P N is the probability word w i is the filter word. The following describes the method of calculating the P D and P N. Filter learning unit 702 for each word w i, to calculate the P D, for example, as follows.
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000002
 ここで、χは、単語wが抽出対象の単語であるか否かを表すフラグであり、χ=1のとき単語wが抽出対象の単語である、χ=0のとき単語wが抽出対象の単語でない即ちフィルタ語である、ことを示す。πijは単語wが単語wから派生している確率である。なお、「単語wがwから派生している」とは、文抽出部103が、対象文書中の単語wを、例えばOCRエラー等により、単語wと誤って抽出してしまった状態を示す。 Here, chi i is a flag word w i indicating whether a word to be extracted, word w i When chi i = 1 is a word to be extracted, the words when chi i = 0 Indicates that w i is not a word to be extracted, that is, a filter word. π ij is the probability that word w i is derived from word w j . Incidentally, "the word w i is derived from w j" and, sentence extracting unit 103, a word w j in the target document, for example by OCR errors, etc., had been erroneously extracted with word w i Indicates the state.
 また、d(w,w)は単語wと単語wの類似度を示し、類似度として例えば編集距離が用いられる。P(w|χ=1)は、χ=1である全ての単語の総出現頻度のうち、単語wの出現頻度が占める割合を示す。フィルタ学習部702は、Pの算出に、dやπijを利用することにより、OCRエラー等により誤って認識されている単語に対しても、高精度にフィルタ学習を行うことができる。ここで、フィルタ学習部702は、P(d|θ)を例えば、以下のように計算する。 D m (w i , w j ) indicates the similarity between the word w i and the word w j , and for example, an edit distance is used as the similarity. P (w i | χ i = 1) indicates the ratio of the appearance frequency of the word w i to the total appearance frequency of all words with χ i = 1. Filter learning unit 702, the calculation of P D, by using the d m and [pi ij, even for words that are erroneously recognized by the OCR error, etc., it is possible to perform filtering learning with high precision. Here, the filter learning unit 702 calculates P (d m | θ), for example, as follows.
Figure JPOXMLDOC01-appb-M000003
Figure JPOXMLDOC01-appb-M000003
 ここでは、フィルタ学習部702は、ポアソン分布を用いてP(d|θ)を計算しているが、単語の生成モデルに合わせて適当な確率密度関数を用いることができる。フィルタ学習部702は、例えば、ベルヌーイ分布、二項分布、多項分布、正規分布、指数分布、t分布、カイ2乗分布、ガンマ分布、ベータ分布、F分布、又はラプラス分布等の指数分布族の他の分布を用いてもよい。一方、フィルタ学習部702は、Pを、例えば、以下のように計算する。 Here, the filter learning unit 702 calculates P (d m | θ) using the Poisson distribution, but an appropriate probability density function can be used in accordance with the word generation model. The filter learning unit 702 includes, for example, an exponential distribution family such as Bernoulli distribution, binomial distribution, multinomial distribution, normal distribution, exponential distribution, t distribution, chi-square distribution, gamma distribution, beta distribution, F distribution, or Laplace distribution. Other distributions may be used. On the other hand, the filter learning unit 702 calculates PN as follows, for example.
Figure JPOXMLDOC01-appb-M000004
Figure JPOXMLDOC01-appb-M000004
 P(w|χ=0)は、χ=0である全ての単語の総出現頻度のうち、単語wの出現頻度が占める割合を示す。フィルタ学習部702は、R(wi)>1である全ての単語について、変数χの値を1に再設定し、R(wi)≦1である全ての単語について変数χの値を0に再設定し、再設定したχに基づいてπij及びθを再設定する(S804)。なお、フィルタ学習部702は、R(wi)≧1である全ての単語について、変数χの値を1に再設定し、R(wi)<1である全ての単語について変数χの値を0に再設定してもよい。 P (w i | χ i = 0) indicates the ratio of the appearance frequency of the word w i to the total appearance frequency of all the words with χ i = 0. Filter learning unit 702, for all the words R is (w i)> 1, then reset the value of the variable chi i to 1, R (w i) ≦ 1 a is a value of all variables for the word chi i Is reset to 0, and π ij and θ i are reset based on the reset χ i (S804). The filter learning unit 702, for all the words that are R (w i) ≧ 1, then reset the value of the variable chi i to 1, R (w i) <for all words is one variable chi i The value of may be reset to 0.
 ステップS804において、フィルタ学習部702は、このようにR(w)に基づき、変数χの値を再設定するが、この際の閾値を上記例のように、1としてもよいし、R(w)の定義域内(0以上の実数)の他の値としてもよい。ここで、利便性のために、変数γik(1≦k≦n)を以下のように定義する。 In step S804, the filter learning unit 702 resets the value of the variable χ i based on R (w i ) as described above. The threshold value at this time may be set to 1 as in the above example, or R it may be other values within the domain (0 or a real number) of (w i). Here, for convenience, the variable γ ik (1 ≦ k ≦ n) is defined as follows.
Figure JPOXMLDOC01-appb-M000005
Figure JPOXMLDOC01-appb-M000005
 また、変数Γを以下のように定義する。 The variable Γ i is defined as follows.
Figure JPOXMLDOC01-appb-M000006
Figure JPOXMLDOC01-appb-M000006
 フィルタ学習部702は、以上の値を用いて、πijを例えば、以下のように再設定する。 The filter learning unit 702 uses the above values to reset π ij as follows, for example.
Figure JPOXMLDOC01-appb-M000007
Figure JPOXMLDOC01-appb-M000007
 また、フィルタ学習部702は、パラメータθを例えば、以下のように再設定する。 Further, the filter learning unit 702 resets the parameter θ k as follows, for example.
Figure JPOXMLDOC01-appb-M000008
Figure JPOXMLDOC01-appb-M000008
 なお、上述したパラメータθの再設定の例は、P(d|θ)の算出にポワソン分布が用いられた場合に対応するものである。P(d|θ)の算出にポワソン分布以外の分布が用いられた場合、フィルタ学習部702は、例えば、以下に示すθについての更新式を解くことにより、θを再設定する。 The example of resetting the parameter θ k described above corresponds to the case where the Poisson distribution is used for calculating P (d m | θ). When a distribution other than the Poisson distribution is used to calculate P (d m | θ), the filter learning unit 702 resets θ k by, for example, solving the update formula for θ k shown below.
Figure JPOXMLDOC01-appb-M000009
Figure JPOXMLDOC01-appb-M000009
 続いて、フィルタ学習部702は、全単語における現在のパラメータに対する同時確率を以下のように計算する(S805)。 Subsequently, the filter learning unit 702 calculates the joint probability for the current parameters in all words as follows (S805).
Figure JPOXMLDOC01-appb-M000010
Figure JPOXMLDOC01-appb-M000010
 フィルタ学習部702は、上記の同時確率が収束したか否かを判定する(S806)。フィルタ学習部702は、例えば、同時確率が所定範囲に含まれる値であった場合に同時確率が収束したと判定する。また、フィルタ学習部702は、例えば、上記の同時確率と前回計算した同時確率とを比較して、一定値又は一定比以上、上昇しなかった場合に、同時確率が収束したと判定してもよい。 The filter learning unit 702 determines whether or not the above joint probability has converged (S806). For example, the filter learning unit 702 determines that the joint probability has converged when the joint probability is a value included in a predetermined range. Further, for example, the filter learning unit 702 compares the above joint probability with the joint probability calculated last time, and determines that the joint probability has converged when it has not increased by a certain value or a certain ratio or more. Good.
 フィルタ学習部702が、同時確率が収束したと判定した場合(S806;yes)、処理を終了する。フィルタ学習部702が、同時確率が収束していないと判定した場合(S806:no)、ステップS803に戻る。 If the filter learning unit 702 determines that the joint probability has converged (S806; yes), the process ends. When the filter learning unit 702 determines that the joint probability has not converged (S806: no), the process returns to step S803.
 フィルタ学習部702は、処理終了時点における単語wそれぞれに対応するχの値に従って、単語wそれぞれが抽出対象語であるかフィルタ語であるかを選定することができる。フィルタ学習部702は、例えば、抽出対象語の集合と、フィルタ語の集合と、フィルタモデル蓄積部703に送信する。 The filter learning unit 702 can select whether each word w i is an extraction target word or a filter word according to the value of χ i corresponding to each word w i at the end of processing. For example, the filter learning unit 702 transmits the set of extraction target words, the set of filter words, and the filter model storage unit 703.
 図9は、フィルタ適用部704によるフィルタ適用処理の一例を示す。図9におけるフィルタ適用処理は、図8におけるフィルタ学習処理を用いる例を示す。フィルタ適用部704は、フィルタモデル蓄積部703から抽出対象語の集合を取得し、対象データ701からフィルタ適用対象語集合を取得する(S901)。フィルタモデル蓄積部703が保持する抽出対象語の集合は、図8に示した教師なし学習手段によって得られた集合である。対象データ701に含まれる近傍語からなる集合はフィルタ適用対象語集合の一例である。フィルタ適用部704は、例えば、対象データ701に含まれる対象表現に対する形態素解析により得られた単語を、フィルタ適用対象語集合に含めてもよい。 FIG. 9 shows an example of filter application processing by the filter application unit 704. The filter application process in FIG. 9 shows an example using the filter learning process in FIG. The filter application unit 704 acquires a set of extraction target words from the filter model storage unit 703, and acquires a filter application target word set from the target data 701 (S901). The set of extraction target words held by the filter model storage unit 703 is a set obtained by the unsupervised learning means shown in FIG. A set of neighboring words included in the target data 701 is an example of a filter application target word set. For example, the filter application unit 704 may include a word obtained by morphological analysis on the target expression included in the target data 701 in the filter application target word set.
 続いて、フィルタ適用部704は、フィルタ適用対象語集合に抽出対象語が含まれているかを検査する(S902)。この際、フィルタ適用部704は、フィルタ適用対象語集合の単語それぞれと抽出対象語それぞれとの完全一致による検査を行ってもよいし、編集距離などの単語間の類似性に基づく尺度によって検査を行ってもよい。 Subsequently, the filter application unit 704 checks whether the extraction target word is included in the filter application target word set (S902). At this time, the filter application unit 704 may perform an inspection based on a complete match between each word of the filter application target word set and each of the extraction target words, or may perform an inspection using a scale based on similarity between words such as an edit distance. You may go.
 また、フィルタ適用部704は、抽出対象語の全てを含むかどうかの検査を行ってもよいし、一つ又は複数の抽出対象語を含むかどうかの検査を行ってもよい。フィルタ適用部704が、フィルタ適用対象語集合に抽出対象語が含まれていないと判定した場合(S902:no)、フィルタ適用対象語集合の単語は全てフィルタ語であるため、何も出力せず、処理を終了する。 Further, the filter application unit 704 may check whether or not all of the extraction target words are included, or may check whether or not one or a plurality of extraction target words are included. When the filter application unit 704 determines that the extraction target word set is not included in the filter application target word set (S902: no), since all the words of the filter application target word set are filter words, nothing is output. The process is terminated.
 フィルタ適用部704が、フィルタ適用対象語集合に抽出対象語が含まれていると判定した場合(S902:yes)、フィルタ適用部704は、フィルタ適用後の結果データ705を出力し(S903)、処理を終了する。対象データ701から、フィルタ語と、フィルタ語に対応する座標と、を除去したデータは、フィルタ適用後の結果データ705の一例である。 When the filter application unit 704 determines that the extraction target word is included in the filter application target word set (S902: yes), the filter application unit 704 outputs the result data 705 after the filter application (S903), The process ends. Data obtained by removing the filter word and the coordinates corresponding to the filter word from the target data 701 is an example of the result data 705 after the filter application.
 図10は、フィルタ部107による、単語に対するフィルタ結果の例を示す。「正解」は実際に対象とすべき単語を示し、「不正解」は実際に対象ではない単語を示す。「取得」は前述の教師なし学習によって、抽出対象語であると判定された単語を示し、「非取得」は前述の教師なし学習手法によって、フィルタ語であると判定された単語を示す。抽出対象語においては、(正解かつ取得)/{(正解かつ取得)+(不正解かつ取得)}で定義される精度75%、(正解かつ取得)/{(正解かつ取得)+(正解かつ非取得)}で定義される再現率56.8%であった。情報抽出システム101は前述した方法により、多くの単語から、少数の抽出対象語を、教師によらず判定できる。 FIG. 10 shows an example of the result of filtering on words by the filter unit 107. “Correct” indicates a word that should actually be a target, and “Incorrect” indicates a word that is not actually a target. “Acquisition” indicates a word determined to be an extraction target word by the above-described unsupervised learning, and “non-acquisition” indicates a word determined to be a filter word by the above-described unsupervised learning method. In the extraction target word, the accuracy is 75% defined by (correct and acquired) / {(correct and acquired) + (incorrect and acquired)}, (correct and acquired) / {(correct and acquired) + (correct and acquired) The non-acquisition)} was 56.8%. The information extraction system 101 can determine a small number of extraction target words from many words without depending on the teacher by the method described above.
 図11は、フィルタ学習部702によるフィルタ学習処理の第二の例を示す。本例は、座標に対するフィルタの学習処理である。フィルタ学習部702は、対象データ701中の対象表現の座標情報を取得する(S1101)。なお、フィルタ学習部702は、例えば、対象データ中の近傍語の座標情報を併せて取得してもよい。 FIG. 11 shows a second example of filter learning processing by the filter learning unit 702. This example is a filter learning process for coordinates. The filter learning unit 702 acquires coordinate information of the target expression in the target data 701 (S1101). Note that the filter learning unit 702 may also acquire, for example, the coordinate information of neighboring words in the target data.
 続いて、フィルタ学習部702は、実数パラメータηの初期値を設定する(S1102)。ηの初期値は、予め指定されていてもよいし、例えば利用者などによって指定されてもよい。ηの初期値は、対象文書102のサイズに従って指定されるのが好ましく、具体的には、例えば、対象文書102の1行の面積を所定の増加関数に代入して得られる値に指定されるのが好ましい。また、ηは、抽出結果に合わせて調整されてもよい。続いて、フィルタ学習部702は、カーネル密度推定の関数p(x)を以下の数式に従って学習し(S1103)、学習した結果を出力して終了する。p(x)は、座標xが抽出対象の座標である確率密度を示す。 Subsequently, the filter learning unit 702 sets an initial value of the real number parameter η (S1102). The initial value of η may be specified in advance, or may be specified by a user, for example. The initial value of η is preferably specified according to the size of the target document 102. Specifically, for example, it is specified as a value obtained by substituting the area of one line of the target document 102 into a predetermined increase function. Is preferred. Moreover, η may be adjusted according to the extraction result. Subsequently, the filter learning unit 702 learns the kernel density estimation function p (x) according to the following formula (S1103), outputs the learned result, and ends. p (x) indicates a probability density in which the coordinate x is the coordinate to be extracted.
Figure JPOXMLDOC01-appb-M000011
Figure JPOXMLDOC01-appb-M000011
 ここで、NはステップS1101で取得した座標の数、Dは座標の次元、xは任意の座標を示す変数、xはステップS1101で取得した各座標を示す。図11の例において、フィルタ学習部702はカーネル密度推定を用いて確率密度の推定を行っているが、例えば、k近傍法、ヒストグラム法、又は混合ガウス分布などの他の確率密度推定法を用いてもよい。 Here, N is the number of coordinates acquired in step S1101, D is a coordinate dimension, x is a variable indicating an arbitrary coordinate, and xn is each coordinate acquired in step S1101. In the example of FIG. 11, the filter learning unit 702 estimates the probability density using kernel density estimation. For example, another probability density estimation method such as a k-nearest neighbor method, a histogram method, or a mixed Gaussian distribution is used. May be.
 図12は、フィルタ適用部704によるフィルタ適用処理の第二の例を示す。本例は、図11に示した座標に対するフィルタを適用する処理である。フィルタ適用部704は、対象データ701に含まれる対象表現及び対象表現の座標、並びに閾値を取得する(S1201)。閾値は、利用者などより与えられてもよいし、予め設定されていてもよいし、出力結果の正否判定に基づいてフィルタ適用部704によって設定されてもよい。 FIG. 12 shows a second example of filter application processing by the filter application unit 704. This example is processing for applying a filter to the coordinates shown in FIG. The filter application unit 704 acquires the target expression included in the target data 701, the coordinates of the target expression, and a threshold value (S1201). The threshold value may be given by a user or the like, may be set in advance, or may be set by the filter application unit 704 based on whether the output result is correct or not.
 フィルタ適用部704は、取得した座標それぞれを図11で例示した座標に対するフィルタモデルp(x)に代入して、取得した座標それぞれの尤度(確率値)を算出し、算出した尤度それぞれが取得した閾値以上であるか否かを判定する(S1202)。フィルタ適用部704は、算出した全ての尤度が閾値より小さいと判定した場合(S1202:no)、抽出対象の座標が存在しないため、処理を終了する。 The filter application unit 704 calculates the likelihood (probability value) of each acquired coordinate by substituting each acquired coordinate into the filter model p (x) for the coordinate illustrated in FIG. It is determined whether or not the acquired threshold value is exceeded (S1202). When it is determined that all the calculated likelihoods are smaller than the threshold (S1202: no), the filter application unit 704 ends the process because there is no extraction target coordinate.
 フィルタ適用部704は、閾値以上である尤度が存在すると判定した場合(S1202:yes)、フィルタ適用後の結果データ705を出力し(S1203)、処理を終了する。閾値未満である尤度に対応する座標の対象表現、当該対象表現の近傍語、及び当該対象表現の座標を除去した対象データ701は、フィルタ適用後の結果データ705の一例である。なお、図11及び図12に示した座標に対するフィルタが用いられる場合、対象選定部106は、対象表現の近傍語を取得しなくてもよい。 When it is determined that there is a likelihood that is equal to or greater than the threshold (S1202: yes), the filter application unit 704 outputs the result data 705 after the filter application (S1203), and ends the process. The target data of the coordinates corresponding to the likelihood that is less than the threshold, the neighborhood words of the target expressions, and the target data 701 from which the coordinates of the target expressions are removed are examples of the result data 705 after the filter application. In addition, when the filter with respect to the coordinate shown in FIG.11 and FIG.12 is used, the object selection part 106 does not need to acquire the neighborhood word of object expression.
 図13は、フィルタ学習部702によるフィルタ学習処理の第三の例を示す。本例は、複数のフィルタモデルを結合するフィルタ学習処理である。フィルタ学習部702は、対象データ701と複数のフィルタモデルを取得する(S1301)。 FIG. 13 shows a third example of filter learning processing by the filter learning unit 702. This example is a filter learning process that combines a plurality of filter models. The filter learning unit 702 acquires target data 701 and a plurality of filter models (S1301).
 フィルタ学習部702は、取得した複数のフィルタモデルから生成されるフィルタ結合モデルを初期化する(S1302)。フィルタ結合モデルは、例えば、各フィルタモデルが出力する値又は判定結果を数値化したものを入力とする、例えば、線形識別、サポートベクタマシン、決定木などの機械学習等を利用することができる。例えば、複数のフィルタモデルの重み付き和でフィルタ結合モデルが定義されている場合、フィルタ学習部702は、フィルタ結合モデルの初期化において、重みを初期化する。 The filter learning unit 702 initializes a filter combination model generated from the acquired plurality of filter models (S1302). As the filter combination model, for example, linear learning, a support vector machine, machine learning such as a decision tree, or the like using a value output from each filter model or a numerical result of a determination result as an input can be used. For example, when a filter combination model is defined by a weighted sum of a plurality of filter models, the filter learning unit 702 initializes weights in the initialization of the filter combination model.
 フィルタ学習部702は、正誤情報、又は重み情報に基づき、フィルタ結合モデルを学習する(S1303)。以下、フィルタ結合モデルに線形識別が用いられる例を説明する。フィルタ学習部702は、下記の不等式が成立する場合にフィルタすると判定し、成立しない場合にフィルタしないと判定する。 The filter learning unit 702 learns the filter combination model based on the correct / incorrect information or the weight information (S1303). Hereinafter, an example in which linear identification is used for the filter combination model will be described. The filter learning unit 702 determines that filtering is performed when the following inequality is satisfied, and determines that filtering is not performed when the following inequality is not satisfied.
Figure JPOXMLDOC01-appb-M000012
Figure JPOXMLDOC01-appb-M000012
 フィルタ学習部702は、上記不等式が示す線形識別において、各フィルタモデルの出力値を要素とするスコアベクトルXとフィルタモデル毎に設定された実数ベクトルWとの内積Sを算出し、算出した内積Sと閾値Uとを比較する。以下、内積Sをフィルタ結合モデルによる出力値と呼ぶ。 The filter learning unit 702 calculates an inner product S between the score vector X having the output value of each filter model as an element and a real vector W set for each filter model in the linear identification indicated by the inequality, and the calculated inner product S And the threshold value U are compared. Hereinafter, the inner product S is referred to as an output value by the filter combination model.
 フィルタ学習部702は、フィルタ結果に対する正誤情報の入力を利用者から受け付けてもよい。フィルタ学習部702は、入力された正誤情報(正誤情報を行列化したものをTとする)に基づいて、例えば、下記の数式が示す二乗和誤差等の評価関数Eを最適化することにより、適切なWを再設定してもよい。 The filter learning unit 702 may accept input of correct / incorrect information for the filter result from the user. The filter learning unit 702 optimizes an evaluation function E such as a square sum error indicated by the following formula based on the input correct / incorrect information (T is a matrix of correct / incorrect information), for example, An appropriate W may be reset.
Figure JPOXMLDOC01-appb-M000013
Figure JPOXMLDOC01-appb-M000013
 また、利用者により、重み情報が与えられた場合は、フィルタ学習部702は、当該重み情報を実数行列Wと設定してもよい。また、重み情報(重み情報を行列化したものをVとする)と共に正誤情報が与えられた場合は、フィルタ学習部702は、下記数式のように評価関数内における実数行列Wの重み実数行列Vとして設定して、最適化を実行してもよい。 Further, when weight information is given by the user, the filter learning unit 702 may set the weight information as a real number matrix W. In addition, when correct / incorrect information is given together with weight information (a matrix of weight information is set to V), the filter learning unit 702 performs weight real matrix V of real matrix W in the evaluation function as shown in the following equation. May be set to perform optimization.
Figure JPOXMLDOC01-appb-M000014
Figure JPOXMLDOC01-appb-M000014
 また、フィルタ学習部702は、再設定したWに対するフィルタ結果に対する正誤情報の入力を再度受けつけ、再度受け付けた正誤情報に基づいて、Wを再設定する処理を繰り返してもよい。評価関数上述したフィルタ方法は、識別モデルとその評価関数を適切に定義すれば、線形識別に限定されず適用可能である。 Also, the filter learning unit 702 may repeat the process of resetting W based on the correct / incorrect information received again by receiving the input of correct / incorrect information for the filter result for the reset W. Evaluation Function The filtering method described above can be applied without being limited to linear identification as long as the identification model and its evaluation function are appropriately defined.
 図14は、フィルタ適用部704によるフィルタ適用処理の第三の例を示す。本例は、フィルタ結合モデルにおけるフィルタの適用処理である。 FIG. 14 shows a third example of filter application processing by the filter application unit 704. This example is a filter application process in the filter combination model.
 フィルタ適用部704は、対象データ701、複数のフィルタモデル、及び当該複数のフィルタモデルが結合されたフィルタ結合モデルを取得する(S1401)。続いて、フィルタ適用部704は、対象データ701を取得した各フィルタモデルに入力し、各フィルタモデルの出力値を取得する(S1402)。 The filter application unit 704 acquires the target data 701, a plurality of filter models, and a filter combination model obtained by combining the plurality of filter models (S1401). Subsequently, the filter application unit 704 inputs the target data 701 to each acquired filter model, and acquires the output value of each filter model (S1402).
 続いて、フィルタ適用部704は、S1402で算出した各フィルタモデルの出力値を、フィルタ結合モデルに入力し、フィルタ結合モデルの出力値を取得する(S1403)。続いて、フィルタ適用部704は、フィルタ結合モデルの出力値が、例えば閾値U以上であるか否かを判定する(S1404)。フィルタ結合モデルの出力値が、閾値Uより小さい場合(S1404:no)、処理を終了する。 Subsequently, the filter application unit 704 inputs the output value of each filter model calculated in S1402 to the filter combination model, and acquires the output value of the filter combination model (S1403). Subsequently, the filter application unit 704 determines whether or not the output value of the filter combination model is, for example, greater than or equal to the threshold value U (S1404). If the output value of the filter combination model is smaller than the threshold value U (S1404: no), the process ends.
 フィルタ結合モデルの出力値が、閾値U以上である場合(S1404:yes)、フィルタ適用部704は、フィルタ適用後の結果データ705を出力して終了する(S1405)。 When the output value of the filter combination model is greater than or equal to the threshold value U (S1404: yes), the filter application unit 704 outputs the result data 705 after the filter application and ends (S1405).
 図15は、利用者へのユーザインターフェースの第一の例を示す。ユーザインターフェース1500は、例えば、対象ID入力セクション1501、対象情報入力セクション1502、フィルタ調整用のチェックボックス1503~1505、抽出結果表示セクション1506、及び正誤指定セクション1507を含む。 FIG. 15 shows a first example of a user interface to the user. The user interface 1500 includes, for example, a target ID input section 1501, a target information input section 1502, filter adjustment check boxes 1503 to 1505, an extraction result display section 1506, and a correct / incorrect specification section 1507.
 対象ID入力セクション1501は、例えば、文データ300に含まれる文ID、文書ID、及び対象選定結果データブロック600に含まれる対象ID等の入力を受け付ける。対象情報入力セクション1502は、例えば対象情報109の入力を受け付ける。 The target ID input section 1501 accepts input of, for example, a sentence ID included in the sentence data 300, a document ID, and a target ID included in the target selection result data block 600. The target information input section 1502 accepts input of target information 109, for example.
 チェックボックス1503~1504は、学習及び適用するフィルタを選択するためのチェックボックスである。例えば、チェックボックス1503は座標によるフィルタ、チェックボックス1504は単語によるフィルタ、を選択するためのチェックボックスである。利用者は、例えば、チェックボックス1503、及びチェックボックス1504の双方にチェックを入れることにより、例えば、座標によるフィルタと単語によるフィルタとを結合したフィルタ結合モデルが選択することができる。チェックボックス1505は、正誤判定結果より自動的に学習を行うか否かを選択するためのチェックボックスである。 Check boxes 1503 to 1504 are check boxes for selecting a filter to be learned and applied. For example, a check box 1503 is a check box for selecting a filter by coordinates, and a check box 1504 is a filter for selecting a word. For example, by checking both the check box 1503 and the check box 1504, the user can select, for example, a filter combination model in which a filter by coordinates and a filter by words are combined. A check box 1505 is a check box for selecting whether or not learning is automatically performed based on the correctness determination result.
 抽出結果表示セクション1506は、フィルタ適用後の抽出結果を列挙して表示する。抽出結果表示セクション1506は、例えば、当該抽出結果に含まれる対象表現、当該対象表現の近傍語、及び当該対象表現を含む対象文全文を表示する。また、抽出結果表示セクション1506は、例えば、当該対象表現の座標を表示してもよい。抽出結果表示セクション1506は、例えば、リスト形式で表示されるが、リスト内の表示順序はフィルタ部107が算出したフィルタ適用時の値(例えば、R(w)等の値)に従っていてもよい。正誤指定セクション1507は、例えば、抽出結果が適切であったか否かについて、利用者が正誤判定した結果の入力を受け付ける。 The extraction result display section 1506 lists and displays the extraction results after applying the filter. The extraction result display section 1506 displays, for example, the target expression included in the extraction result, neighboring words of the target expression, and the entire target sentence including the target expression. The extraction result display section 1506 may display the coordinates of the target expression, for example. The extraction result display section 1506 is displayed in, for example, a list format, but the display order in the list may follow a value when the filter is applied (for example, a value such as R (w i )) calculated by the filter unit 107. . The correct / incorrect designation section 1507 accepts input of the result of correct / incorrect determination by the user regarding whether or not the extraction result is appropriate, for example.
 図16は、利用者へのユーザインターフェースの第二の例を示す。ユーザインターフェース1600は、ユーザインターフェース1500の構成に加え、例えば、フィルタ調整セクション1601~1602を含む。 FIG. 16 shows a second example of the user interface to the user. The user interface 1600 includes, for example, filter adjustment sections 1601 to 1602 in addition to the configuration of the user interface 1500.
 フィルタ調整セクション1601~1602は、フィルタ学習及びフィルタ適用に関する情報の入力を受け付ける。フィルタ調整セクション1601は、例えば、線形識別によるフィルタ結合モデルにおける座標の重みの初期値の入力を受け付ける。フィルタ調整セクション1602は、例えば、線形識別によるフィルタ結合モデルにおける単語の重みの初期値の入力を受け付ける。 The filter adjustment sections 1601 to 1602 accept input of information related to filter learning and filter application. The filter adjustment section 1601 accepts input of initial values of coordinate weights in a filter combination model based on linear identification, for example. The filter adjustment section 1602 accepts input of initial values of word weights in a filter combination model based on linear identification, for example.
 図15、又は図16のようにユーザインターフェースが構成されることで、利用者は、任意の文章、文、又は抽出結果に対し、適切な対象情報を与えることができ、更に、フィルタ調整を行いつつ情報抽出を行うことができる。また、利用者は、抽出結果に基づき、正誤判定を指定できるようになると共に、抽出結果に合わせて、対象情報を変更することができる。 By configuring the user interface as shown in FIG. 15 or FIG. 16, the user can give appropriate target information to any sentence, sentence, or extraction result, and further perform filter adjustment. Information extraction can be performed. Further, the user can designate correct / incorrect determination based on the extraction result, and can change the target information in accordance with the extraction result.
 以上、本実施例の情報抽出システム101によって、利用者は抽出対象の単語等を事前に調べることなく、試行錯誤的に情報抽出を行うことができる。つまり情報抽出システム101は、事前に辞書やHTML等の論理構造に依存せず、多様な非定形文書から、利用者が必要とする情報を高精度に抽出することができる。 As described above, the information extraction system 101 according to the present embodiment enables the user to perform information extraction on a trial and error basis without checking the extraction target word or the like in advance. That is, the information extraction system 101 can extract information required by the user with high accuracy from various non-standard documents without depending on a logical structure such as a dictionary or HTML in advance.
 図17には、情報抽出システムの第二の構成例を示す。情報抽出システム1701は、例えば、実施例1の情報抽出システム101と同様の構成を含む。情報抽出システム1701は、以下の点において、実施例1の情報抽出システム101と異なる。対象選定部106が、対象文書102の入力を受け付け、対象文書102から対象情報109に合致する文及び座標を選定し、文抽出部103及び座標抽出部104に対象選定結果を送信する。文抽出部103/座標抽出部104は、対象文書102ではなく対象選定結果から文/座標抽出を行う。 FIG. 17 shows a second configuration example of the information extraction system. The information extraction system 1701 includes, for example, the same configuration as the information extraction system 101 of the first embodiment. The information extraction system 1701 is different from the information extraction system 101 of the first embodiment in the following points. The target selection unit 106 receives input of the target document 102, selects sentences and coordinates that match the target information 109 from the target document 102, and transmits the target selection results to the sentence extraction unit 103 and the coordinate extraction unit 104. The sentence extraction unit 103 / coordinate extraction unit 104 performs sentence / coordinate extraction from the target selection result instead of the target document 102.
 このように、情報抽出システム1701を構成することで、情報抽出システム1701は、利用者からの対象情報109に基づき、適切に情報抽出結果110を出力することができる。また、情報抽出システム1701は、情報抽出結果110を入力として、新たに対象情報を設定して情報抽出を行うことができる。 Thus, by configuring the information extraction system 1701, the information extraction system 1701 can appropriately output the information extraction result 110 based on the target information 109 from the user. In addition, the information extraction system 1701 can extract information by setting the target information anew using the information extraction result 110 as an input.
 以上のように情報抽出システム1701を構成することで、利用者が、抽出対象の単語等を事前に調べることなく、試行錯誤的に情報抽出を行うことができるシステム・方法及びプログラムが実現できる。これにより、情報抽出システム1701は、多様な非定形文書から、事前に辞書やHTML等の論理構造に依存せず、利用者が必要とする情報を高精度に抽出することができる。 By configuring the information extraction system 1701 as described above, it is possible to realize a system, method, and program that allow a user to perform information extraction on a trial and error basis without examining the extraction target word or the like in advance. As a result, the information extraction system 1701 can extract information required by the user with high accuracy from various non-standard documents in advance without depending on a logical structure such as a dictionary or HTML.
 なお、本発明は上記した実施例に限定されるものではなく、様々な変形例が含まれる。例えば、上記した実施例は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに限定されるものではない。また、ある実施例の構成の一部を他の実施例の構成に置き換えることも可能であり、また、ある実施例の構成に他の実施例の構成を加えることも可能である。また、各実施例の構成の一部について、他の構成の追加・削除・置換をすることが可能である。 In addition, this invention is not limited to the above-mentioned Example, Various modifications are included. For example, the above-described embodiments have been described in detail for easy understanding of the present invention, and are not necessarily limited to those having all the configurations described. Further, a part of the configuration of a certain embodiment can be replaced with the configuration of another embodiment, and the configuration of another embodiment can be added to the configuration of a certain embodiment. Further, it is possible to add, delete, and replace other configurations for a part of the configuration of each embodiment.
 また、上記の各構成、機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路で設計する等によりハードウェアで実現してもよい。また、上記の各構成、機能等は、プロセッサがそれぞれの機能を実現するプログラムを解釈し、実行することによりソフトウェアで実現してもよい。各機能を実現するプログラム、テーブル、ファイル等の情報は、メモリや、ハードディスク、SSD(Solid State Drive)等の記録装置、または、ICカード、SDカード、DVD等の記録媒体に置くことができる。 In addition, each of the above-described configurations, functions, processing units, processing means, and the like may be realized by hardware by designing a part or all of them with, for example, an integrated circuit. Each of the above-described configurations, functions, and the like may be realized by software by interpreting and executing a program that realizes each function by the processor. Information such as programs, tables, and files that realize each function can be stored in a memory, a hard disk, a recording device such as an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, or a DVD.
 また、制御線や情報線は説明上必要と考えられるものを示しており、製品上必ずしも全ての制御線や情報線を示しているとは限らない。実際には殆ど全ての構成が相互に接続されていると考えてもよい。 Also, the control lines and information lines indicate what is considered necessary for the explanation, and not all the control lines and information lines on the product are necessarily shown. Actually, it may be considered that almost all the components are connected to each other.

Claims (10)

  1.  対象文書から情報を抽出する情報抽出システムであって、
     プログラムを実行するプロセッサと、前記プロセッサがアクセスするメモリと、を含み、
     前記プロセッサは、情報抽出処理を行い、
     前記情報抽出処理において、
     抽出対象の文字列の集合を示す対象情報の入力を受け付け、
     前記対象情報に含まれる文字列のいずれかに合致する文字列である対象表現と、前記対象表現それぞれの所定距離以内に配置された単語である近傍語と、を前記対象文書から抽出し、
     前記近傍語それぞれの前記対象文書中の出現頻度、又は前記対象表現それぞれの前記対象文書中の座標、に基づく教師なし学習を用いてフィルタを生成し、
     前記近傍語を含むフィルタ適用対象語集合に、前記フィルタを適用し、
     前記フィルタ適用対象語集合に前記フィルタを適用して得られた抽出対象語集合を出力する、情報抽出システム。
    An information extraction system that extracts information from a target document,
    A processor that executes a program; and a memory that is accessed by the processor;
    The processor performs information extraction processing,
    In the information extraction process,
    Accept input of target information indicating a set of character strings to be extracted,
    Extracting from the target document a target expression that is a character string that matches any of the character strings included in the target information, and neighboring words that are words arranged within a predetermined distance of each of the target expressions,
    Generating a filter using unsupervised learning based on the appearance frequency of each of the neighborhood words in the target document, or the coordinates in the target document of each of the target expressions,
    Applying the filter to a filter application target word set including the neighboring words;
    An information extraction system that outputs an extraction target word set obtained by applying the filter to the filter application target word set.
  2.  請求項1に記載の情報抽出システムであって、
     前記プロセッサは、
     前記フィルタの生成において、前記近傍語それぞれが、抽出対象語であるか、抽出非対象であるフィルタ語であるか、を示すフラグそれぞれの設定処理を繰り返し、
     前記設定処理において、
      前記近傍語それぞれのフラグを取得し、
      前記近傍語のフラグに対する同時確率が収束したと判定した場合、前記近傍語それぞれのフラグに従って、前記近傍語それぞれが抽出対象語であるかフィルタ語であるかを決定して、前記設定処理を終了し、
      前記同時確率が収束していないと判定した場合、
      抽出対象語であることを示すフラグに対応する近傍語の前記対象文書中の総出現頻度のうち、前記近傍語それぞれの前記対象文書中の出現頻度が占める割合に基づいて、前記近傍語それぞれについて当該近傍語が抽出対象語である第1確率を算出し、
      フィルタ語であることを示すフラグに対応する近傍語の前記対象文書中の総出現頻度のうち、前記近傍語それぞれの前記対象文書中の出現頻度が占める割合に基づいて、前記近傍語それぞれについて当該近傍語がフィルタ語である第2確率を算出し、
      前記近傍語それぞれの第1確率と第2確率との比に基づいて、次回の設定処理における前記近傍語それぞれのフラグを決定し、
     前記フィルタの適用において、前記フィルタ適用対象語集合から、前記決定した抽出対象語を抽出する、情報抽出システム。
    The information extraction system according to claim 1,
    The processor is
    In the generation of the filter, the setting process of each flag indicating whether each of the neighboring words is an extraction target word or a filter word that is a non-extraction target is repeated,
    In the setting process,
    Get a flag for each of the neighborhood words,
    When it is determined that the simultaneous probability for the flag of the neighborhood word has converged, according to each flag of the neighborhood word, it is determined whether each of the neighborhood words is an extraction target word or a filter word, and the setting process ends. And
    If it is determined that the joint probability has not converged,
    For each of the neighboring words based on the ratio of the appearance frequency in the target document of each of the neighboring words out of the total appearance frequency in the target document of the neighboring word corresponding to the flag indicating that it is an extraction target word Calculating a first probability that the neighborhood word is an extraction target word;
    Based on the ratio of the appearance frequency in the target document of each of the neighboring words out of the total appearance frequency in the target document of the neighboring word corresponding to the flag indicating that it is a filter word, Calculating a second probability that the neighborhood word is a filter word;
    Based on the ratio between the first probability and the second probability of each of the neighboring words, determine a flag for each of the neighboring words in the next setting process,
    An information extraction system for extracting the determined extraction target word from the filter application target word set in the application of the filter.
  3.  請求項2に記載の情報抽出システムであって、
     前記プロセッサは、前記設定処理において、前記近傍語それぞれの間の類似度に基づいて、前記近傍語それぞれの第1確率を算出する、情報抽出システム。
    The information extraction system according to claim 2,
    The information extraction system, wherein the processor calculates a first probability of each of the neighboring words based on a similarity between the neighboring words in the setting process.
  4.  請求項2に記載の情報抽出システムであって、
     前記同時確率は、下記数式で表され、
    Figure JPOXMLDOC01-appb-M000001
     上記数式における、i及びjは近傍語の個数以下の自然数を、w及びwは近傍語を、χは近傍語wの前記フラグを、χは近傍語wの前記フラグを、πijは単語wが単語wから派生している確率を、d(w,w)は近傍語wと近傍語wの類似度、P(w|χ=1)はχ=1である全ての近傍語の総出現頻度に占める近傍語wの出現頻度の割合を、P(w|χ=0)はχ=0である全ての近傍語の総出現頻度に占める近傍語wの出現頻度の割合を、P(d(w,w)|θ)は所定の確率分布の確率密度関数であって、パラメータがθである確率密度関数において確率変数がd(w,w)であるときの確率を、示す、情報抽出システム。
    The information extraction system according to claim 2,
    The joint probability is expressed by the following formula:
    Figure JPOXMLDOC01-appb-M000001
    In the above equation, i and j are natural numbers equal to or less than the number of neighboring words, w i and w j are neighboring words, χ i is the flag of neighboring words w i , and χ j is the flag of neighboring words w j. , the probability [pi ij is the word w i is derived from the word w j, d m (w i , w j) is the similarity of neighboring words w i and the neighboring word w j, P (w j | χ j = 1) is the ratio of the appearance frequency of the neighborhood word w j to the total appearance frequency of all the neighborhood words with χ j = 1, and P (w i | χ i = 0) is all the neighborhoods with χ i = 0. The ratio of the appearance frequency of the neighboring word w i to the total appearance frequency of the word, P (d m (w i , w j ) | θ j ) is a probability density function of a predetermined probability distribution, and the parameter is θ j the probability variable in density function d m (w i, w j ) the probability of time is, shows, information extraction system is.
  5.  請求項1に記載の情報抽出システムであって、
     前記フィルタ適用対象語集合は前記対象表現を含み、
     前記プロセッサは、
     前記フィルタの生成において、前記対象表現それぞれの前記対象文書中の座標に基づいて、前記対象文書中の抽出対象である座標を示す確率変数の確率密度関数を推定し、
     前記フィルタの適用において、前記推定した確率密度関数に基づいて、前記対象表現それぞれの座標について当該座標が抽出対象座標である確率を算出し、前記算出した確率が閾値以上である対象表現と当該対象表現の近傍語とを、前記フィルタ適用対象語集合から抽出する、情報抽出システム。
    The information extraction system according to claim 1,
    The filter application target word set includes the target expression,
    The processor is
    In the generation of the filter, based on the coordinates in the target document of each of the target expressions, estimate a probability density function of a random variable indicating coordinates that are extraction targets in the target document,
    In the application of the filter, based on the estimated probability density function, a probability that the coordinates are extraction target coordinates is calculated for each coordinate of the target expression, and the target expression and the target whose calculated probability is equal to or greater than a threshold value An information extraction system that extracts a neighborhood word of an expression from the filter application target word set.
  6.  請求項5に記載の情報抽出システムであって、
     表示装置をさらに含み、
     前記プロセッサは、前記抽出対象語集合と、前記抽出対象語集合に含まれる対象表現の前記対象文書中の座標と、を前記表示装置に表示する、情報抽出システム。
    The information extraction system according to claim 5,
    A display device,
    The said processor displays the said extraction object word set and the coordinate in the said object document of the object expression contained in the said extraction object word set on the said display apparatus.
  7.  請求項1に記載の情報抽出システムであって、
     前記プロセッサは、前記対象文書から前記抽出対象語集合に含まれない語を削除した対象文書に対して、前記情報抽出処理を再度行う、情報抽出システム。
    The information extraction system according to claim 1,
    The information extraction system, wherein the processor performs the information extraction process again on a target document in which a word not included in the extraction target word set is deleted from the target document.
  8.  請求項1に記載の情報抽出システムであって、
     前記プロセッサは、
     前記フィルタの生成において、
      前記教師なし学習に基づいて、複数のフィルタを生成し、
      前記複数のフィルタの所定の重み値による重み付き和である第1フィルタ結合モデルを生成し、
      前記フィルタ適用対象語集合に前記第1フィルタ結合モデルを適用し、
      前記フィルタ適用対象語集合に前記第1フィルタ結合モデルを適用して得られた抽出語集合に含まれる抽出語それぞれの正誤を示す正誤情報の入力を受け付け、
      前記第1フィルタ結合モデルと、前記正誤情報と、に基づいて、新たな重み値を決定し、
      前記複数のフィルタの前記決定した新たな重み値による重み付き和である第2フィルタ結合モデルを生成し、
     前記第2フィルタ結合モデルは前記適用するフィルタである、情報抽出システム。
    The information extraction system according to claim 1,
    The processor is
    In generating the filter,
    Generating a plurality of filters based on the unsupervised learning;
    Generating a first filter combination model that is a weighted sum of predetermined weight values of the plurality of filters;
    Applying the first filter combination model to the filter application target word set;
    Receiving correct / incorrect information indicating the correctness of each extracted word included in the extracted word set obtained by applying the first filter combination model to the filter application target word set;
    A new weight value is determined based on the first filter combination model and the correctness information,
    Generating a second filter combination model that is a weighted sum of the determined new weight values of the plurality of filters;
    The information extraction system, wherein the second filter combination model is the filter to be applied.
  9.  情報抽出システムが、対象文書から情報を抽出する方法であって、
     前記情報抽出システムは、プログラムを実行するプロセッサと、前記プロセッサがアクセスするメモリと、を含み、
     前記方法は、前記情報抽出システムが、
     抽出対象の文字列の集合を示す対象情報の入力を受け付け、
     前記対象情報に含まれる文字列のいずれかに合致する文字列である対象表現と、前記対象表現それぞれの所定距離以内に配置された単語である近傍語と、を前記対象文書から抽出し、
     前記近傍語それぞれの前記対象文書中の出現頻度、又は前記対象表現それぞれの前記対象文書中の座標、に基づく教師なし学習を用いてフィルタを生成し、
     前記近傍語を含むフィルタ適用対象語集合に、前記フィルタを適用し、
     前記フィルタ適用対象語集合に前記フィルタを適用した結果データを出力する、方法。
    An information extraction system is a method for extracting information from a target document,
    The information extraction system includes a processor that executes a program, and a memory that the processor accesses,
    In the method, the information extraction system includes:
    Accept input of target information indicating a set of character strings to be extracted,
    Extracting from the target document a target expression that is a character string that matches any of the character strings included in the target information, and neighboring words that are words arranged within a predetermined distance of each of the target expressions,
    Generating a filter using unsupervised learning based on the appearance frequency of each of the neighborhood words in the target document, or the coordinates in the target document of each of the target expressions,
    Applying the filter to a filter application target word set including the neighboring words;
    A method of outputting result data obtained by applying the filter to the filter application target word set.
  10.  対象文書からの情報抽出を、コンピュータに実行させるプログラムを保持する、コンピュータ読み取り可能な非一時的記録媒体であって、
     前記コンピュータは、プログラムを実行するプロセッサと、前記プロセッサがアクセスするメモリと、を含み、
     前記プログラムは、
     抽出対象の文字列の集合を示す対象情報の入力を受け付ける手順と、
     前記対象情報に含まれる文字列のいずれかに合致する文字列である対象表現と、前記対象表現それぞれの所定距離以内に配置された単語である近傍語と、を前記対象文書から抽出する手順と、
     前記近傍語それぞれの前記対象文書中の出現頻度、又は前記対象表現それぞれの前記対象文書中の座標、に基づく教師なし学習を用いてフィルタを生成する手順と、
     前記近傍語を含むフィルタ適用対象語集合に、前記フィルタを適用する手順と、
     前記フィルタ適用対象語集合に前記フィルタを適用した結果データを出力する手順と、を前記コンピュータに実行させる、コンピュータ読み取り可能な非一時的記録媒体。
    A computer-readable non-transitory recording medium holding a program for causing a computer to perform information extraction from a target document,
    The computer includes a processor that executes a program, and a memory that the processor accesses,
    The program is
    A procedure for receiving input of target information indicating a set of character strings to be extracted;
    A procedure for extracting, from the target document, a target expression that is a character string that matches any of the character strings included in the target information, and neighboring words that are words arranged within a predetermined distance of each of the target expressions; ,
    Generating a filter using unsupervised learning based on the appearance frequency of each of the neighborhood words in the target document or the coordinates in the target document of each of the target expressions;
    Applying the filter to a filter application target word set including the neighboring words;
    A computer-readable non-transitory recording medium that causes the computer to execute a procedure of outputting result data obtained by applying the filter to the filter application target word set.
PCT/JP2015/065594 2015-05-29 2015-05-29 Information extraction system, information extraction method, and recording medium WO2016194054A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2015/065594 WO2016194054A1 (en) 2015-05-29 2015-05-29 Information extraction system, information extraction method, and recording medium
JP2017521323A JP6334062B2 (en) 2015-05-29 2015-05-29 Information extraction system, information extraction method, and recording medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2015/065594 WO2016194054A1 (en) 2015-05-29 2015-05-29 Information extraction system, information extraction method, and recording medium

Publications (1)

Publication Number Publication Date
WO2016194054A1 true WO2016194054A1 (en) 2016-12-08

Family

ID=57441961

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2015/065594 WO2016194054A1 (en) 2015-05-29 2015-05-29 Information extraction system, information extraction method, and recording medium

Country Status (2)

Country Link
JP (1) JP6334062B2 (en)
WO (1) WO2016194054A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000259670A (en) * 1999-03-12 2000-09-22 Dainippon Printing Co Ltd Document analysis system and recording medium
US20100145678A1 (en) * 2008-11-06 2010-06-10 University Of North Texas Method, System and Apparatus for Automatic Keyword Extraction
JP2013140499A (en) * 2012-01-05 2013-07-18 Nippon Telegr & Teleph Corp <Ntt> Method, apparatus, and program for extracting word

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060080321A1 (en) * 2004-09-22 2006-04-13 Whenu.Com, Inc. System and method for processing requests for contextual information
JP5232449B2 (en) * 2007-11-21 2013-07-10 Kddi株式会社 Information retrieval apparatus and computer program

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000259670A (en) * 1999-03-12 2000-09-22 Dainippon Printing Co Ltd Document analysis system and recording medium
US20100145678A1 (en) * 2008-11-06 2010-06-10 University Of North Texas Method, System and Apparatus for Automatic Keyword Extraction
JP2013140499A (en) * 2012-01-05 2013-07-18 Nippon Telegr & Teleph Corp <Ntt> Method, apparatus, and program for extracting word

Also Published As

Publication number Publication date
JP6334062B2 (en) 2018-05-30
JPWO2016194054A1 (en) 2017-08-31

Similar Documents

Publication Publication Date Title
DE102017005880A1 (en) Replacement based on optical similarity
CN110765770A (en) Automatic contract generation method and device
JP2019008778A (en) Captioning region of image
RU2613846C2 (en) Method and system for extracting data from images of semistructured documents
CN111428457B (en) Automatic formatting of data tables
CN111512315A (en) Block-wise extraction of document metadata
US9286526B1 (en) Cohort-based learning from user edits
WO2019224891A1 (en) Classification device, classification method, generation method, classification program, and generation program
US9946813B2 (en) Computer-readable recording medium, search support method, search support apparatus, and responding method
JP6492880B2 (en) Machine learning device, machine learning method, and machine learning program
CN111488732B (en) Method, system and related equipment for detecting deformed keywords
US20190303437A1 (en) Status reporting with natural language processing risk assessment
JP2019125353A (en) Method for inferring blocks of text in electronic documents
CN108701126B (en) Theme estimation device, theme estimation method, and storage medium
US9437020B2 (en) System and method to check the correct rendering of a font
JP6334062B2 (en) Information extraction system, information extraction method, and recording medium
JP6303669B2 (en) Document retrieval device, document retrieval system, document retrieval method, and program
JP7448132B2 (en) Handwritten structural decomposition
Chowdhury et al. Implementation of an optical character reader (ocr) for bengali language
CN108733637B (en) Information processing apparatus, information processing method, and computer program
WO2014030258A1 (en) Morphological analysis device, text analysis method, and program for same
JP4545614B2 (en) Document classification program and document classification apparatus
JP2007018158A (en) Character processor, character processing method, and recording medium
US20220092260A1 (en) Information output apparatus, question generation apparatus, and non-transitory computer readable medium
WO2022215433A1 (en) Information representation structure analysis device, and information representation structure analysis method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15894089

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2017521323

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15894089

Country of ref document: EP

Kind code of ref document: A1