WO2012101169A1

WO2012101169A1 - Automatic extraction of information about semantic relationships from a document pool using a neural system

Info

Publication number: WO2012101169A1
Application number: PCT/EP2012/051134
Authority: WO
Inventors: Eckart Schröder-Bergen; Solveig HOFMANN; Maria Winkler
Original assignee: SUPERWISE Technologies AG
Priority date: 2011-01-25
Filing date: 2012-01-25
Publication date: 2012-08-02
Also published as: DE102011009378A1

Abstract

The present invention relates to a method for the automated production and/or recognition of context information in plain-text portfolios by means of a computer using a neural network, wherein the neural network is structured on a plurality of levels, comprising the following steps: plain-text portfolios with an arbitrary multiplicity of words are supplied; particularly on a scale of more than 10 million lines of text and preferably 100 million lines of text, individual neurons (2) are associated with words on at least one first level (3), wherein individual neurons are associated with a specific word for training and updating purposes using association relations between words from the plain-text portfolio and wherein every new word has an additional word neuron (2) created for it; individual neurons (5, 7, 9, 11) are associated with at least one group of words by means of synapses on at least one further level (4, 6, 8, 10), wherein different words in a respective textural context can be updated by means of new and/or existing association relations which have been produced between said words, and wherein additional synapses are produced in the neural network, wherein, for the purpose of representing the association relations when training with plain text, the synapses have a respective updatable relevance value associated with them which defines the strength of the association relation; association relations are formed between each word from a unit of a plain-text portfolio and all further words from this text unit by setting up synapsis links between the neurons which are associated with the respective words from the first and/or further level(s), wherein the association relations which have been freshly produced and/or which have had their relevance value updated are stored in a temporary memory; and the association relations and synapsis links to the neurons on the levels and between the levels are stored in a permanent memory after a predetermined time to form structured knowledge in a database in which a search query using the association relations and synapsis links can be used to perform a search for any words and the corresponding context.

Description

DESCRIPTION

Automatic extraction of information about semantic relationships from a pool of documents with a neural system

The present invention relates to a method for the automated generation and / or recognition of context information in plain text files by means of a computer using a neural network, wherein the neural network is structured in several levels, so-called layers.

Numerous methods for the automated classification of texts are known from the prior art. The existing procedures each relate to individual words, each word being related to its textual environment. This achieves basic forms of a context analysis.

However, the common methods have the disadvantage that the semantic relations between different words are generally unknown. Although there is tabular knowledge about related word forms and synonyms. However, because of the large number of possible pair relationships between words, this knowledge remains incomplete, as it could not previously be generated by an automated mass-based method.

A particular problem arises from specialized language areas. On the one hand, these are cultural-cultural or temporal or spatial, for example in dialects. Of particular importance are the subject-specific languages of the different specialists. Since so far lacks the ability to adapt the systems for automatic context analysis in a simple way to specific linguistic spaces, their function remains severely limited.

The object of the invention is therefore to provide a method with which, in a simple and advantageous manner, using a neural network, a pool with formalized basic knowledge about meaning relations between different words, i. a so-called "association knowledge" is generated.

This object is achieved by a method having the features according to claim 1.

The inventive method for the automated generation and / or recognition of context information in plain text files by means of a computer using a neural network, wherein the neural network is structured in several levels, comprises the following steps:

- supplying plain text files with any number of words; in particular to a volume of more than 10 million lines of text and preferably 100 million lines of text,

Assigning individual neurons to words in at least a first level, wherein the assignment of individual neurons to a specific word is done by training and updating using association relations between words of the plain text population and wherein an additional word neuron is created for each new word;

- Assigning individual neurons to at least one group of words by means of synapses in at least one other level, wherein

different words in a respective textual context can be updated by means of new and / or existing association relations generated between these words, and additional synapses are generated in the neural network, wherein the synapses for representing the association relations during training of plaintext are respectively updateable Relevance value is assigned, which defines the strength of the association relation;

- Forming association relations between each word of a unit of a plain text stock and all other words of this text unit by establishing synapse connections between the neurons which are assigned to the respective words from the first and / or further level, where

the newly created and / or association information updated in its relevance value are stored in a temporary memory; and

Storing the association relations and synapse connections to the neurons in the levels and between the levels in a permanent memory after a predeterminable time to form a structured knowledge in a database, in which a search for arbitrary words and corresponding search results by means of the association relations and synapse connections Context is feasible.

This can advantageously be searched automatically for contexts that are described with synonymous related words. Furthermore, the search for unknown terms is possible. It is also possible to automatically generate taxonomies, ie classification schemes. Existing or generated formalized association knowledge should be extended with the help of special text corpora, so that texts from special language areas can be optimally treated and thus a knowledge base can be created. In addition, text documents should automatically be sorted into different thematic categories. It is also possible to generate statistics based on plain text sources and to place plain text entries in a database grid. For automated translation systems, the appropriate translation may be for an ambiguous word being found. On the device side, the method steps can be carried out on a system whose memory and processing power only low demands must be made.

In particular, the context information is automatically recognized in plain text inventories. This is done using the neural network, where the assignment to the respective context is realized via association relations between different words. Each such word-pair relationship has the meaning: "Word A also comes to mind in word A." The association relations are achieved through the representation of synapses in the neural network, and in the neural network there are neurons in a first level each associated with a particular word. This level of the network is referred to as a "word layer". Whenever a previously unknown word is found during a training process, advantageously another neuron is automatically created in the word layer, which is permanently assigned to this new word.

The association relationships are advantageously generated automatically from the trained texts. This determines which words occur together in a sentence. If the words A and B occur in a sentence, a relationship A - B and a relationship B - A are built up or amplified.

Each association relationship also has a quantitative relevance value that indicates the strength of the coupling between the words. For association relationships with a comparatively high relevance value, the words therefore have a particularly close relationship to one another.

Advantageous embodiments of the invention are the subject of the dependent claims.

Preferably, the assignment of individual neurons and the formation of association relations and synapse connections takes place within a subsection of the plaintext, in particular within a sentence, a text line or a tabular element. Thus, for example, an association relation is formed between each word of a text section or sentence and all other words of the text unit. This happens through the formation of synapses between the neurons associated with the words.

Preferably, the assignment of individual neurons in the at least one further level to at least one word family is carried out with the same word stem, wherein between the Neurons and the neurons of all belonging to the respective word family words synapse connections. In the neural network there are thus neurons which are each assigned to a word family. This part of the network is called a "lemma layer." Synapse connections exist between a neuron of this layer and the neurons of all the words belonging to the word family.

The assignment of individual neurons in one of the further levels preferably takes place to at least one synonym group, whereby synaptic connections are established between the neurons and the neurons of all the words belonging to the respective synonym group. In the neural network there are thus neurons which are each assigned to a group of synonyms. This part or level of the network is called a "synonym layer." There are synapse connections between a neuron of this layer and the neurons of the words of the synonym group.

The assignment of individual neurons in the at least one further level to at least one multiword concept and / or one word composite is preferably carried out, wherein synaptic connections are made between the neurons and the neurons of all words belonging to the respective multiword concept and / or compound and the storage of the context information of Synapses in the order of the respective words. In the neural network there are thus neurons which are each assigned to a multi-word term or a composite. This part or level of the net is called a "lexem layer." There are synapse connections between a neuron of this layer and the neurons of the words that make up the lexeme, and word components for which a neuron exists in the lemma layer In addition, the synapse is advantageously connected to the corresponding lemma neuron The synapses of the lexeme neurons are stored in such a way that the information about the order of the terms contained is present.

Preferably, assigning individual neurons in one of the further levels to at least one group of free word associations, wherein between the neurons of the respective words additional synapse connections between at least one associative euron and the at least one word neuron, which consists of all neurons a respective word family and / or from all neurons of a respective multi-word concept and / or compound belonging words can be produced. In the neural network, there are thus neurons each associated with a group of word associations. This part or level of the neural network is called the "associative layer". Advantageously, every single word association is realized via a synapse between the associative neuron and a word neuron. If a neuron exists in the lemma layer for the target word, the synapse is connected to the corresponding lemma neuron. If there is a matching lexeme neuron, the synapse is linked to the lexeme neuron.

Generating association relations from a plain text inventory preferably comprises the following automated steps:

- splitting the plain text content into text units and / or words;

Forming association relations between each word of a text unit and all other words of that text unit by establishing synapse connections between the neurons associated with the respective words, the respective word families and / or the respective multi-word and / or multi-level composites; and

Generate free word associations by means of associative neurons of an additional level between words of which associations are known.

Preferably, the free word associations are generated according to a neural learning rule using a positive or negative weighting, whereby the assignment of the synaptic connections can be changed in dependence on time based on the relevance values. Each association relationship thus has a quantitative relevance value that defines the strength of the coupling. Association relations with a high relevance value have a particularly close relationship between the words. For the procedure for building or amplifying the synapse connection, the Hebbian learning rule is advantageously used.

Preferably, the relevance value of the association relation is increased if a word pair relationship is repeatedly determined for an existing synapse at different plaintext positions.

Alternatively, the relevance value of the association relation is reduced if, for an existing synapse at a plaintext position, its source word but not the target word of the association concerned is found and / or an association relation is deleted if the relevance value falls below a predeterminable threshold value.

For the calculation of the relevance values of the association relations, the part of speech of the source and target words of the association is preferably taken into account and / or weighted. Advantageously, nouns and verbs increase the relevance value, while adjectives with a reduced factor are taken into account.

The entire processing and evaluation of text stocks is most efficient with neural mechanisms. To prepare for this, the context information is stored in a neural network.

For this purpose, a neural network is used whose structure has been optimized for text processing.

The structures mentioned are an implementation of a semantic-syntactic knowledge base, which is used particularly advantageously for the processes described below.

For example, a general text file is entered or "trained" into the neural network, which can be advantageously extensive, for example, 100 million lines of text For special applications, texts are drawn in from special language areas, thereby further increasing the linguistic competence of the network. During training, the texts are broken down into sentences and words, and abbreviations are expanded.

If the word A appears in a sentence, but not B, then the relevance value of the relationship A - B is reduced while B - A is not affected. As a result, the system has the option of "forgetting" an initially overestimated assignment A - B partially or completely, since the word A is later found predominantly without the word B. If the weakening reduces the relevance value below a specified threshold value This eliminates the need to increase the number of association relationships stored, thus making the executing computer system less demanding.

Thus, the association relations are addressed. The relation A - B generally has a different relevance value than the relation B - A. If A - B is significantly larger than B 4 A or if B 4 A is removed again, this has the meaning: "If I think of A I remember B But when I think of B, I can not think of A. "

For this purpose, a modified Hebb algorithm is advantageously used. It is characterized by the fact that the action potential of the participating neurons only gradually decreases. This achieves a stronger similarity to the learning processes in biological brains.

The above-described steps or actions are thus carried out in a weaker form also in the subsequent sentences, until the action potentials have fallen below a predetermined threshold value. After reinforcing a relation in the coexistence of two words in a sentence, therefore, the non-occurrence of these words in the following sentences leads to an attenuation. The subsequent weakening is dimensioned so that it is significantly lower than the original reinforcement.

The synaptic connections are initially kept in a temporary memory area ("short-term memory"), in which case they can be completely or partially forgotten by the attenuation mechanisms described above After a certain time, the connections are transferred into a permanent memory ("long-term memory") , The information is then learned permanently. The relevance values of the neurons can be amplified later at most, but not attenuated again.

If in a sentence a negation word, for example "not", "none" occurs, the word-pair relationships to be trained for this sentence are inverted in their effect. They thus lead to a weakening of the relevant word associations. If this gives the relevance value a negative sign, the compound is interpreted as an inhibitory synapse.

It has already been mentioned that for those words pair relationships are established, which occur together in one sentence. A sentence is usually terminated by an appropriate punctuation mark such as ".", "!", "?" If texts are to be trained that do not have such a sentence structure, such as tabular information, the unit within which the pair relationships are created may also be a line, a paragraph, or the entire record.

Different parts of speech enter the association generation with different weights. For example, nouns and verbs are included in the relevance value calculation at full strength, while adjectives are used at lesser weight.

The invention will now be described with reference to the figures in conjunction with simple words and Sets explained in more detail. Hereby show:

Fig. 1 representation of synapse connections between a neuron of

Lemma layers and the neurons of all words belonging to a word family;

Fig. 2 representation of synapse connections between a neuron of

Synonym layers and neurons of all words with the same meaning;

Fig. 3 representation of synapse connections between a neuron of

Lexeme layers and the neurons of the components of a multi-word notation with assigned order;

4 shows the representation of synapse connections between a neuron of the

Associative layers and the neurons of all associated words;

Fig. 5 representation of synapse connections between word neurons to

Formation of an association relation;

Fig. 6 enhancement of the relevance values of association relations;

FIG. 7 different relevance values of two association relations; FIG. and

Fig. 8 inhibiting synapse connections between two words.

1 shows diagrammatically as synapse connections shown as arrows between a neuron 5 of the lemma layer 4 and the neurons of all words belonging to a word family which are part of the word layer 3 with neurons 2 of the word layer contained therein. For the representation of word families, the so-called lemma layer 4 is provided. A lemma neuron 5 represents the group of all words into a word stem.

Fig. 2 shows how between a neuron 7 of the synonym layer 6 and the neurons of all words with the same meaning, which are part of the word layer 3, consist of synapse connections shown as arrows.

According to FIG. 3, there is a neuron 9 in the lexeme layer 8 for each multi-word concept or composite. This also applies to multi-word terms which are written separately. For example, "Federal Republic of Germany." Synapse connections exist between a neuron 9 of the lexeme layer 8 and the components of a compound term, with the synapses being assigned an order.

Free word associations are illustrated in FIG. 4, in which an associative neuron 11 exists in an associative layer 10 for each word known to have associations with other words. Every single word association is realized via a synapse between the associative neuron and a word neuron.

The creation of the association relations between the words "Federal Chancellor" and "Adenauer" will now be demonstrated on an example text in conjunction with FIGS. 5, 6 and 7.

For associative training, the following sentences are used:

Adenauer has left a lasting influence as chancellor.

The influence on the word-pair relationships is illustrated in FIGS. 5 to 7 by way of example for the words "Adenauer" and "Chancellor".

When the words "Federal Chancellor" and "Adenauer" are first found in the same sentence, association relations between these two words are created. The connection between "Federal Chancellor" and "Adenauer" and "Adenauer" with "Chancellor" initially has a medium relevance value that is the same for both directions, cf. In this case, each word of this sentence is connected to every other word by a provisional association relation, for example. For this purpose, synapse connections are generated between the relevant word neurons.

After that, the text contains, for example, the following sentence:

So everyone thought for many years immediately to "Adenauer" when he heard the word "Chancellor".

Since the words "Federal Chancellor" and "Adenauer" appear again in the next sentence, the relevance values of the relations are strengthened. They are still the same for both directions, cf. Fig. 6. Another sentence is:

But when we think of "Chancellor" today, we tend to think of others

Name.

In Fig. 7 the word "Federal Chancellor" now appears without "Adenauer". This means that the connection between "Chancellor Adenauer" and "Adenauer" is weakened, and the connection between "Adenauer" and "Chancellor" remains unchanged, meaning that the two relations have different relevance values.

After associative training, the system thus carries strong associations from "Adenauer" to "Federal Chancellor". The association of "Federal Chancellor" to "Adenauer" is weaker. It could be omitted after further reduction steps.

The following sentence refers to FIG. 8:

But despite his popularity, Thomas Gottschalk never

Become Chancellor.

The words "Chancellor" and "Gottschalk" come together with the negation word "never." Therefore, between "Chancellor" and "Gottschalk" inhibitory connections are created.

"Gottschalk" thus stands in an inhibitory connection to "Chancellor". This means that if one could associate in a context with "Chancellor", this will be reduced to some extent, if in this context also the talk of "Gottschalk" is.

All disclosed in the application documents features are claimed as essential to the invention.

Claims

claims

A method for automated generation and / or recognition of context information in plain text files by means of a computer using a neural network, wherein the neural network is structured in multiple levels, comprising the following steps:

Supplying plain text files with any plurality of words; in particular to a volume of more than 10 million lines of text and preferably 100 million lines of text,

Assigning individual neurons (2) to words in at least a first level (3), whereby the assignment of individual neurons to a specific word is done by training and updating using association relations between words of the plain text population and wherein for each new word an additional word neuron (2 ) is created;

Assigning individual neurons (5, 7, 9, 11) to at least one group of words by means of synapses in at least one further level (4, 6, 8, 10), wherein different words in a respective textual context by means of new and / or existing Association relations generated between these words are updatable, and wherein additional synapses are generated in the neural network, the synapses representing the association relations while training plain text are each assigned an updatable relevance value defining the strength of the association relation;

- Forming association relations between each word of a unit of plain text and all other words of this text unit by establishing synapse connections between the neurons associated with the respective words from the first and / or further level, the newly generated and / or in their relevance - value updated association relations are stored in a temporary memory; and Storing the association relations and synapse connections to the neurons in the levels and between the levels in a permanent memory after a predeterminable time to form a structured knowledge in a database, in which a search for arbitrary words and corresponding search results by means of the association relations and synapse connections Context is feasible.

2. The method according to claim 1, characterized in that the assignment of individual neurons (2, 5, 7, 9, 11) and the formation of association relations and synapse connections within a subsection of the plaintext occurs, in particular within a sentence, a text line or a tabular element.

3. The method according to any one of claims 1 or 2, characterized in that the assignment of individual neurons (5) in the at least one further level (4) to at least one word family is carried out with the same root word, wherein between the neurons and the neurons all to the respective word family belonging words synapse connections are made.

4. The method according to any one of claims 1 to 3, characterized in that the assignment of individual neurons (7) in the at least one further level (6) to at least one synonym group is carried out, wherein between the neurons and the neurons of all belonging to the respective synonym group Words synapse connections are made.

5. The method according to any one of claims 1 to 4, characterized in that the assignment of individual neurons (9) in the at least one further level (8) to at least one multi-word concept and / or one Wortkompositum takes place, wherein between the neurons and the Neurons of all belonging to the respective multi-word term and / or compound words synapse connections are made and storing the context information of the synapses in the order of the respective words.

6. The method according to any one of claims 1 to 5, characterized in that the assignment of individual neurons (11) in one or more further levels (10) to at least one group of free word associations takes place, wherein between the neurons of the respective Words additional synapse connections between at least one associative neuron and the at least one word Neuron, which can be selected from all neurons of a respective word family and / or from all neurons of a respective multi-word term and / or compound belonging words.

7. The method of claim 1 to 6, characterized in that the generation of association relations from a plain text inventory, preferably automated steps comprises:

- splitting the plain text content into text units and / or words;

- Forming association relations between each word of a text unit and all other words of this text unit by establishing synapse connections between the neurons which correspond to the respective words, the respective word families and / or the respective multi-word concepts and / or composites in several levels (3, 4 , 6, 8) are assigned; and

Generating free word associations by means of associative newons of an additional level (10) between words of which associations are known.

8. The method as claimed in claim 7, characterized in that the free word associations are generated according to a neural learning rule using a positive or negative weighting, wherein the association of the synaptic connections can be changed in a time-dependent manner based on the relevance values.

Method according to one of claims 1 to 8, characterized in that the relevance value of the association relation is increased if a word pair relationship is repeatedly determined for an existing synapse at different plain text positions.

10. The method according to any one of claims 1 to 8, characterized in that the relevance value of the association relation is reduced if, for an existing synapse at a plain text position whose source word, but not the target word of the association concerned is found and / or that an association relation is deleted if the relevance value falls below a predeterminable threshold value. Method according to one of Claims 1 to 10, characterized in that, for the calculation of the relevance values of the association relations, the part of speech of the source and target words of the association is taken into account and / or weighted.