US20080215313A1 - Speech and Textual Analysis Device and Corresponding Method - Google Patents

Speech and Textual Analysis Device and Corresponding Method Download PDF

Info

Publication number
US20080215313A1
US20080215313A1 US11/659,955 US65995504A US2008215313A1 US 20080215313 A1 US20080215313 A1 US 20080215313A1 US 65995504 A US65995504 A US 65995504A US 2008215313 A1 US2008215313 A1 US 2008215313A1
Authority
US
United States
Prior art keywords
language
text
data
text analysis
basis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US11/659,955
Inventor
Paul Waelti
Carlo A. Trugenberger
Frank Cuypers
Christoph R. Waelti
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Infocodex AG
Original Assignee
Swiss Reinsurance Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Swiss Reinsurance Co Ltd filed Critical Swiss Reinsurance Co Ltd
Assigned to SWISS REINSURANCE COMPANY reassignment SWISS REINSURANCE COMPANY ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: WAELTI, PAUL, CUYPERS, FRANK, TRUGENBERGER, CARLO A., WAELTI, CHRISTOPH P.
Publication of US20080215313A1 publication Critical patent/US20080215313A1/en
Assigned to INFOCODEX AG reassignment INFOCODEX AG ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: SWISS REINSURANCE COMPANY
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the invention relates to a system and a method for automated language and text analysis by the formation of a search and/or classification catalog, with data records being recorded by means of a linguistic databank and with speech and/or text data being classified and/or sorted on the basis of the data records (keywords and/or search terms).
  • the invention relates in particular to a computer program product for carrying out this method.
  • the problem involved in searching for and/or cataloging of text documents in one or more databanks include the following: (1) indexing or cataloging of the content of the documents to be processed (content synthesis), (2) processing of a search request of the indexed and/or catalogued documents (content retrieval).
  • the data to be indexed and/or catalogued normally comprises unstructured documents, such as text, descriptions, links.
  • the documents may also multimedia data, such as images, voice/audio data, video data, etc. In the Internet, this may for example be data which can be downloaded from a website by means of links.
  • U.S. Pat. No. 6,714,939 discloses a method and a system such as this for conversion of plain text or text documents to structured data.
  • the system according to the prior art can be used in particular to check for and/or to find data in a databank.
  • Neural networks are known in the prior art and are used, for example, to solve optimization tasks, for pattern recognition, and for artificial intelligence, etc.
  • a neural network comprises a large number of network nodes, so-called neurons, which are connected to one another via weighted links (synapses).
  • the neurons are organized and interconnected in network layers.
  • the individual neurons are activated as a function of their input signals and produce a corresponding output signal.
  • a neuron is activated via an individual weighting factor by the summation over the input signals.
  • Neural networks such as these have a learning capability in that the weighting factors are varied systematically as a function of predetermined exemplary input and output values until the neural network produces a desired response in a defined predictable error range, such as the prediction of output values for future input values. Neural networks therefore have adaptive capabilities for learning and storage of knowledge, and associated capabilities for comparison of new information with stored knowledge.
  • the neurons can assume a rest state or an energized state.
  • Each neuron has a plurality of inputs and one, and only one, output, which is connected to the inputs of other neurons in the next network layer or, in the case of an output node, represents a corresponding output value.
  • a neuron changes to the energized state when a sufficient number of inputs to the neuron are energized above a specific threshold value of that neuron, that is to say when the summation over the inputs reaches a specific threshold value.
  • the knowledge is stored by adaptation in the weightings of the inputs of a neuron and in the threshold value of that neuron.
  • the weightings of a neural network are trained by means of a learning process (see for example G. Cybenko, “Approximation by Superpositions of a sigmoidal function”, Math. Control, Sig. Syst., 2, 1989, pp 303-314; M. T. Hagan, M. B. Menjaj, “Training Feedforward Networks with the Marquardt Algorithm”, IEEE Transactions on Neural Networks, Vol. 5, No. 6, pp 989-993, November 1994; K. Hornik, M. Stinchcombe, H. White, “Multilayer Feedforward Networks are universal Approximators”, Neural Networks, 2, 1989, pp 359-366 etc.).
  • no desired output pattern is predetermined for the neural network for the learning process of unsupervised learning neural nets.
  • the neural network itself attempts to achieve a representation of the input data that is as sensible as possible.
  • So-called topological feature maps (TFM) such as Kohonen maps are known, for example, in the prior art.
  • TMM topological feature maps
  • the network attempts to distribute the input data as sensibly as possible over a predetermined number of classes. In this case, it is therefore used as a classifier.
  • Classifiers attempt to subdivide a feature space, that is to say a set of input data, as sensibly as possible into a total of N sub-groups. In most cases, the number of sub-groups or classes is defined in advance.
  • a large number of undefined interpretations can be used for the word “sensible”.
  • one normal interpretation for a classifier would be: “form the classes such that the sum of the distances between the feature vectors and the class center points of the classes with which they are associated is as small as possible.”
  • a criterion is thus introduced which is intended to be either minimized or maximized.
  • the object of the classification algorithm is to carry out the classification process for this criterion and the given input data in the shortest possible time.
  • Topological feature maps such as Kohonen maps allow a multi-dimensional feature space to be mapped into one with fewer dimensions, while retaining the most important characteristics. They differ from other classes of neural network in that no explicit or implicit output pattern is predetermined for an input pattern in the learning phase. During the learning phase of topological feature maps, they themselves adapt the characteristics of the feature space being used.
  • the link between a classical classifier and a self-organizing neural network or a topological feature map (TFM) is that the output pattern of a topological feature map generally comprises a single energized neuron. The input pattern is associated with the same class as the energized output neuron.
  • One object of this invention is to propose a novel system and automated method for the formation of a search and/or classification catalog, which does not have the abovementioned disadvantages of the prior art.
  • One particular aim is to propose an automated, simple and economic method in order to classify and/or to sort, and/or to index for a search request, a large amount of language and/or text data which, for example, is stored such that it can be accessed via one or more databanks.
  • the aim of the invention is to produce an indexing method for efficient and reliable thematic searching, that is to say for finding documents which are as similar as possible to a given request, comprising an entire text document or individual keywords.
  • a further aim of the invention is to produce a clearly defined measure for objective assessment of the similarity of two documents when compared, and for the ranking of documents.
  • An additional aim of the invention is to produce a method for identification of associated document clusters, that is to say of documents which are virtually identical (different versions of the same document with minor changes).
  • a language and text analysis apparatus is used to form a search and/or classification catalog which has at least one linguistic databank for association of linguistic terms with data records, in which case the linguistic terms comprise at least keywords and/or search terms, and language and/or text data can be classified and/or sorted corresponding to the data records, in that the language and text analysis apparatus has a taxonomy table with variable taxon nodes on the basis of the linguistic databank, in which case one or more data records can be associated with one taxon node in the taxonomy table, and in which case each data record has a variable significance factor for weighting of the terms on the basis of at least filling words and/or linking words and/or keywords, in that each taxon node additionally has a weighting parameter for recording of frequencies of occurrence of terms within the language and/or text data to be sorted and/or to be classified, in that the language and/or text analysis apparatus has an integration module for determination of a predefinable number of a
  • the linguistic databank may, for example, comprise multilingual data records.
  • This embodiment variant has the advantage, inter alia, that documents, collections or entirely general data items can be grouped logically for example in databanks, in particular decentralized databanks without human intervention (for example no training of a network, no preparation of content-specific taxonomy, etc.).
  • it is simple to produce an overview of the thematic content of a document collection by means of a topological map.
  • This apparatus and automated method can thus be regarded as considerable progress for “table of content” methods.
  • the invention produces an extremely reliable and efficient tool for thematic searching (identification of documents on the basis of a search input in a natural language) in addition to conventional searching from the prior art by means of a combination of search terms.
  • search results can very easily be displayed clearly by means of the projection onto the topological and/or geographic map as a so-called “heat map” display, in contrast to the conventional uncategorized list formats.
  • the invention produces a measure which can be checked easily for comparison and/or similarity assessment of documents.
  • the invention likewise produces real multilingual knowledge management with search functions in more than one language. Until now, this has not been possible with the prior art.
  • the invention even allows automated generation of “descriptors”, in which case descriptors reflect the contact characteristic of a document (also with cross-language characteristics).
  • the invention thus produces an indexing method for efficient and reliable thematic searching, that is to say for finding documents which are as similar as possible to a given request, comprising an entire text document or individual keywords.
  • the invention likewise produces a clearly defined measure for objective assessment of the similarity of two documents for the comparison and ranking of documents.
  • the invention produces a method for identification of associated document clusters, that is to say documents which are virtually identical (different versions of the same document with minor changes).
  • the neural network module has at least one or more self-organizing Kohonen maps.
  • This embodiment variant has the same advantages, inter alia, as the previous embodiment variant.
  • the use of self-organizing network techniques, for example SON or Kohonen maps allows further automation of the method.
  • the language and text analysis apparatus has an entropy module for determination of an entropy parameter, which can be stored in a memory module, on the basis of the distribution of a data record in the language and/or text data.
  • This embodiment variant has the advantage, inter alia, that an additional relevance parameter can be determined.
  • a term which appears in a well scattered form over all of the language and/or text data and/or over all the documents has a high “entropy” and will contribute little to the process of distinguishing between the documents. The entropy can thus make a considerable contribution to the efficiency of the apparatus and/or method according to the invention.
  • the apparatus has a hash table which is associated with the linguistic databank in which case a hash value can be used to identify linguistically linked data records in the hash table.
  • This embodiment variant has the advantage, inter alia, that linguistically linked data records such as “common”, “sense” and “common sense” can be found much more quickly and considerably more efficiently.
  • the data records for example, can be associated with a language by means of a language parameter and can be identified as a synonym in the taxonomy table.
  • This embodiment variant has the advantage, inter alia, that multilingual text or language data can also be classified and/or sorted by means of the language and text analysis apparatus.
  • the agglomerates form an n-dimensional content space and, for example, n may be equal to 100.
  • n may be equal to 100.
  • this embodiment variant has the advantage, inter alia, that it for the first time allows efficient association with the self-organizing networks since, otherwise, the content space has too many degrees of freedom than that for which it would still be valid, or too few, so that it is likewise no longer valid.
  • the language and text analysis apparatus has descriptors which can be used to determine constraints which correspond to defineable descriptors for a subject group.
  • This embodiment variant has the advantage, inter alia, that the documents are brought to the correct global region by means of the SOM technique.
  • the taxon nodes in the taxonomy table are produced on the basis of a universal, subject-independent, linguistic databank, with the databank having the universal, subject-independent, linguistic databank.
  • This embodiment variant has the advantage, inter alia, that this makes it possible for the first time to carry out cataloguing and/or indexing in a completely automated manner on the basis of a taxonomy which is not subject-specific and therefore does not need to be defined in advance.
  • the present invention relates not only to the method according to the invention but also to an apparatus for carrying out this method. Furthermore, it is not restricted to the stated system and method, but likewise relates to a computer program product for implementation of the method according to the invention.
  • FIG. 1 shows a block diagram which schematically illustrates the method and/or system according to the invention.
  • FIG. 2 likewise shows a block diagram, which illustrates the use of an apparatus according to the invention in a network with decentralized databanks and/or data sources for thematic recording and/or cataloging and/or monitoring of the data flows in the network.
  • FIG. 3 shows a block diagram which illustrates the structure of a taxonomy table 21 .
  • FIG. 4 shows a block diagram which schematically illustrates the formation of agglomeration clusters in the taxonomy table.
  • FIG. 5 shows a block diagram which schematically illustrates one example of the combination of an agglomeration cluster into subject regions.
  • FIG. 6 shows a block diagram which schematically illustrates an information map or Kohonen map.
  • FIG. 7 shows a flowchart which illustrates method steps for the initial analysis of document collections, as a so-called text mining step.
  • FIG. 8 shows a scheme for the generation of clusters in a neuron.
  • DocEps corresponds to a tolerance which can be defined for the maximum distance between the members of a cluster.
  • FIGS. 1 to 6 schematically illustrate an architecture which can be used for implementation of the invention.
  • the language and text analysis apparatus has at least one linguistic databank 22 for association of linguistic terms with data records in order to form a search and/or classification catalog.
  • the linguistic databank 22 may also include, for example, multilingual data records.
  • the data records can be associated with a language and, for example, can be identified as synonyms in the taxonomy table by means of a language parameter.
  • the linguistic databank 22 may, for example, have an associated hash table, in which case linguistically linked data records in the hash table can be identified by means of a hash value.
  • the language and text analysis apparatus can be used to classify and/or sort language and/or text data 10 on the basis of the data records.
  • the linguistic terms comprise at least keywords and/or search terms.
  • the language and/or text data may also include data of an entirely general nature, such as multimedia data, that is to say inter alia digital data such as texts, graphics, images, maps, animations, moving images, video, QuickTime, audio records, programs (software), data accompanying programs and hyperlinks or references to multimedia data.
  • multimedia data that is to say inter alia digital data such as texts, graphics, images, maps, animations, moving images, video, QuickTime, audio records, programs (software), data accompanying programs and hyperlinks or references to multimedia data.
  • MPx MP3
  • MPEGx MPEG4 or 7 Standards, as are defined by the Moving Picture Experts Group.
  • the language and text analysis apparatus has a taxonomy table 21 with variable taxon nodes.
  • One or more data records can be associated with one taxon node in the taxonomy table 21 .
  • Each data record has a variable significance factor for weighting of the terms on the basis at least of filling words and/or linking words and/or keywords.
  • the language and text analysis apparatus has a weighting module 23 .
  • a weighting parameter is stored, associated with each taxon node, in order to record the frequencies of occurrence of terms within the language and/or text data 10 to be classified and/or sorted.
  • the language and/or text analysis apparatus has an integration module 24 in order to determine a predefinable number of agglomerates on the basis of weighting parameters of the taxon nodes in the taxonomy table 21 .
  • An agglomerate has at least one taxon node.
  • the agglomerates may, for example, form an n-dimensional content space. As an exemplary embodiment, n may be chosen to be equal to 100, for example.
  • the language and/or text analysis apparatus has at least one neural network module 26 for classification and/or sorting of the language and/or text data 10 on the basis of the agglomerates in the taxonomy table 21 .
  • the neural network module 26 may, for example, have at least one topological feature map (TFM), such as a self-organizing Kohonen map.
  • TFM topological feature map
  • Appropriate constraints for example, for a subject group can be determined by means of defineable descriptors.
  • the language and text analysis apparatus may additionally have, for example, an entropy module 25 in order to determine an entropy parameter, which is stored in a memory module, on the basis of the distribution of a data record in the language and/or text data 10 .
  • the entropy module 25 may, for example, be in the form of software and/or hardware.
  • the entropy parameter may, for example, be given by
  • Entropy DR In(freqsum DR ) ⁇ F DR In( F DR )/freqsum DR
  • the results may for example be displayed on an output unit 28 for a user, for example additionally via a network 40 , 41 , 42 .
  • the text or language data to be analyzed can be subdivided into the following components: a) an n-dimensional vector for characterization of the thematic content of the document, for example n may be chosen to be equal to 100; b) m descriptors which are characteristic of a document and represent constraints for optimization.
  • a URL Unified Resource Location
  • File Format PDF Portable Document Format
  • Microsoft Word HTML (Hyper Text Markup Language), HDML (Handheld Device Markup Language), WML (Wir
  • the axes of the n-dimensional content space depend on the thematic linking and/or on the inner links between all of the language and/or text data 10 to be analyzed.
  • the axes can sensibly be constructed such that the relevant subject areas of the language and/or text data 10 are reproduced as well as possible, and irrelevant background (noise) is not displayed, or is greatly suppressed.
  • the generation of the axes and the projection are based on the linguistic and, for example multilingual databank 22 that has been mentioned and is associated with a universal taxonomy and/or a universal taxonomy tree. Universal means that there is no need to predefine a specific region by means of the taxonomy before the cataloging and/or indexing of the text and/or language data 10 . Until now, this has not been possible in this way in the prior art.
  • Words, expressions and/or terms which occur in a text document are compared with a large list of words which are stored in the linguistic databank 22 .
  • “terms” is intended to mean linked words such as the expression “nuclear power plant”, “Commission of Human Rights”, or “European Patent Office”.
  • 2.2 million entries have been found to be sufficient for a linguistic databank 22 such as this for the languages of English, French, German and Italian, although the databank 22 may, of course, also have a greater or a lesser number of entries, as required.
  • Words/terms with the same meaning can be linked for example in synonym groups (synsets), for example also jointly for all languages.
  • FIG. 3 shows such a structure for a taxonomy table 21 .
  • entries for each language can be structured as follows:
  • the method steps for an initial analysis of the language and/or text data 10 may appear as follows: (1) input a document, that is to say language and/or text data 10 ; (2) a first estimate of the document; (3) word processing: i) extraction of the expression/term. ii) comparison with the entries in the linguistic databank taking account of the appropriate language and lexicography rules for correct association. Generation of the synset and hyperonym codes, of the significance and language by means of the databank. iii) generation of new expressions and/or synsets for expressions or terms that have not been found. iv) determination of the frequency of the expression/term per document. v) adaptation of the language, if necessary; (4) associated storage of the information; (5) next document or next language and/or text data 10 .
  • the frequency is calculated for each synset (isyn) and each language and/or text data item 10 or document (idoc) on the following basis:
  • F isyn ⁇ ( idoc ) norm ⁇ ( idoc ) ⁇ ⁇ word ⁇ isyn ⁇ f word ⁇ sig word ,
  • fword frequency of the word in idoc
  • sig word significance of the word on the basis of the linguistic databank (0, . . . ,4)
  • norm ⁇ ( idoc ) min ⁇ ( weighted ⁇ ⁇ number ⁇ ⁇ of ⁇ ⁇ expressions ⁇ ⁇ in ⁇ ⁇ idoc , 500 ) weighted ⁇ ⁇ number ⁇ ⁇ of ⁇ ⁇ expressions ⁇ ⁇ in ⁇ ⁇ idoc
  • the weight is given by sig word .
  • the factor norm may be introduced, for example, in order to prevent very large documents from having a dominant effect on a specific data link.
  • the factor may be determined empirically.
  • a synset which appears to be well scattered over all of the language and/or text data items 10 and/or over all of the documents has a high “entropy” and will contribute little to distinguishing between the documents.
  • “Neuen Zurcher Bio” will appear in all of or a large number of the articles without having any distinguishing power for the content of the documents.
  • the term “Relevance Index” RISK can be defined as a measure for a general relevance of a synset isyn by:
  • the relevance of a hyperonym is determined by integrating all of the text and/or language data items 10 to be analyzed, over all of the relevance indices.
  • This relevance is a measure of the total hit frequency of a taxon node by all of the text and/or language data 10 . This measure indicates which subject region and/or subject regions is or are predominant in a document collection.
  • each taxon node can be associated with one axis in the content space. For example, this would result in a content space with a dimension of more than 4000, which would correspond to an enormous overhead and, furthermore, far too many degrees of freedom for content determination.
  • the cluster is formed at the lowest possible level of the taxonomy tree or taxonomy table. This method can be compared, for example, with the formation of agglomerates in demography.
  • Each cluster (with all of the corresponding synsets which refer to it) is associated with one axis in the n-dimensional content space.
  • Axis n ⁇ 1 is used, for example, for synsets which do not refer to one of the agglomeration clusters, and the axes n are reserved for numbers.
  • FIG. 4 shows, schematically, the formation of agglomeration clusters such as these in the taxonomy table.
  • ntop subject regions are formed which are each composed of a specific sub-group of agglomeration clusters (ntop may, for example, be in the order of magnitude of 10 to 20).
  • the agglomerates are formed in such a way that the taxon nodes of an agglomeration cluster which is associated with the same subject region (topics) has a common mother node in the hierarchy of the taxonomy table.
  • the transformation rule which results from this may, for example, be as follows: each synset refers to one of the selected agglomeration clusters, corresponding to one axis in the content space or an axis n ⁇ 1. A large number of synsets in turn refer to one of the ntop subject regions at a higher aggregation level.
  • FIG. 5 shows one example of the combination of an agglomeration cluster into subject regions.
  • the vector component c i for the i-th axis of the content space can be defined for each document idoc by:
  • the unit (metric) for the n-dimensional space is determined by means of the overall entropy of all of the synset which refer to one axis I( ⁇ Synsets Axis i ) in which case the overall entropy can be determined in an analogous way to that for the entropy of the synsets as defined above.
  • the weights g i for the i-th component can then be determined, for example, by:
  • a synset relevance value Relev isyn is determined for the choice of the m most typical descriptors of a document, that is to say specific language and/or text data 10 , for each synset isyn in the document idoc, for example by:
  • the m synsets with the highest relevance value Relev isyn may be selected, for example, as the m descriptors which are most typical for a document idoc. These descriptors which, for example, can be stored associated with their corresponding hyperonyms are used for cataloging and/or indexing. They include the most important characteristics of a document even in those situations in which the projection onto the content space is reproduced in a non-optimal manner by the content of a specific document.
  • the method mentioned above which is based on the statistical and/or linguistic analysis method mentioned, is combined with one or more neural network modules 26 .
  • This statistical and/or linguistic analysis method is used, as described, to produce a comprehensive universal taxonomy table 21 for identification of the subject content.
  • the results of the linguistic analysis are combined with neural network technologies.
  • so-called self-organizing map (SOM) techniques for example Kohonen, can be very highly suitable.
  • SOM self-organizing map
  • other neural network techniques may also be worthwhile or more suitable for specific applications without restricting the scope of protection of the patent in any way.
  • n the number of bits in the SOM technique.
  • This method can considerably speed up the iteration process and can minimize the risk of the SOM technique ending in a non-optimum configuration (for example a local minimum).
  • the distance between two vectors (documents idoc) a and b can be determined for the SOM algorithm for example by:
  • KL a,b is the Kullback-Leibler distance between two documents in the following sense, that the assignment of a document idoc is measured by means of a content vector c to a subject region jtop using
  • DescriptorPart ⁇ ⁇ Descriptors jtop ⁇ ( Relev isyn ⁇ ( idoc ) ) 2
  • ⁇ Descriptors jtop corresponds to all of the descriptors which refer to jtop.
  • ErrMS is the estimate of the mean square error (discrepancy) where, for example, ErrMS ⁇ 10 ⁇ 5 .
  • the Kullback-Leibler distance between two documents idoc and kdoc with the content vectors a and b is given, for example, by
  • KL a , b 1 ntop ⁇ ⁇ jtop ⁇ ( P jtop , a - P jtop , b ) ⁇ ln ⁇ ( P jtop , a / P jtop , b )
  • the Kullback-Leibler part in the total distance includes the fact that the documents have been moved to the correct global region by the SOM technique.
  • the Kullback-Leibler part thus acts as a constraint of the SOM technique.
  • the metric part in the total distance is responsible for local placing in the individual neurons in a subject region.
  • FIG. 6 shows the result of an information map or Kohonen map such as this.
  • the documents in a neuron are thus similar to one another, in terms of their subject content.
  • the neurons are grouped in such a manner that they are located in the global subject region with which they are mainly associated, and neurons linked by subject are located close to one another (see FIG. 6 with the subject regions a, . . . ,k).
  • a search request may, for example, comprise a pair of search expressions or a text document in a natural language.
  • the search text may, for example, the entire content of a document, in order to search for similar documents in the indexed and/or catalogued document collection.
  • the search text may, however, also for example include only a small portion of the relevant document. For this reason, in some circumstances, the metric distance between the search text and the documents may not be a reliable criterion for finding the documents which are closest to the search text.
  • a more reliable measure for the comparison and the hierarchical assessment is produced by the scalar product of the content vectors. A measure such as this guarantees that the common parts between the search text and the documents are effectively taken into account.
  • a similarity measure between the search text and a document may be defined, for example, by
  • DescrSim comprises the weighted sum of different descriptor pairs, in which case pairs with identical descriptors in the search text and in the searched document can be weighted, for example, with 100 points.
  • Pairs with descriptors which relate to a common hyperonym may, for example, be weighted with 30 points, if the common taxon node is the direct taxon node of the descriptors, with 10 points if the common taxon node is a hierarchy level above this, three points if the common taxon node is two hierarchy levels above this, and one point if the common taxon node is three hierarchy levels above this.
  • Relev isyn ( ) as the relevance value of the descriptors in a document, it is possible, for example, to determine that:
  • the scalar products in the similarity measure as defined above corresponds to the similarity between a neuron (partial collection of the documents) and the search text.
  • the term DescrSim quantifies the details for the individual documents in a given neuron.
  • the factor “0.01” in the definition of DescrSim may, for example, be determined on an empirical basis. For example, it can be determined in such a manner that the scalar product (cross positioning) and the individual extensions (DescrSim) are split into a balanced form.
  • nDoc documents are found which are closest to a specific search text.
  • the subarea with the neurons with the highest scalar products is looked for until the number of selected documents exceeds, for example, the limit value of 3 ⁇ nDoc.
  • the selected documents are then sorted on the basis of their similarity values (including the extension DescrSim) in decreasing order.
  • the first nDoc documents form the desired documents in the assessment order.
  • the selection can be made by, for example, using the search index for the individual synsets within a document.
  • the similarity measure defined further above may, for example, extend from 0 to 2.
  • the transformation to a weighting percentage can be achieved, for example, using:
  • Weighting ⁇ ⁇ percentage ( Similarity 2 ) 1 / 3 ⁇ 100 ⁇ %
  • the identification of document derivatives means the identification of clusters of documents whose content is virtually identical. By way of example, these may be different copies of the same document with minor changes, such as those which may apply to patent texts in a patent family, whose text and/or scope of protection may vary slightly from one country to another.
  • the apparatus according to the invention and/or the method allow/allows the automated identification of document clusters with virtually identical documents. Furthermore, this makes it possible to suppress older document versions, and may be a tool in order to manage document collections such as these and to keep them up to date (for example by means of a regular clean-up).
  • the similarity measure which is used for comparison and/or weighting of the documents for a search text may in certain circumstances not produce satisfactory results to discover document clusters such as these.
  • the distance between two documents idoc 1 and idoc 2 with their content vectors a and b is measured by
  • DocDist ⁇ i ⁇ g i ⁇ ( a i ⁇ a ⁇ - b i ⁇ b ⁇ ) 2 + DescrDist
  • DescrDist is the weighted sum of the derivative of the descriptors.
  • non-matching descriptor pairs are weighted with one point if they have one direct common taxon node, with two points if they have one common taxon node in a hierarchy level above this, and five points for the other cases.
  • Relev isyn ( ) as the relevance value of the descriptor within a document, it is possible, for example, to determine that:
  • the factor “0.1” in the definition of DescDist may, for example, be determined empirically, for example by weighting the metric distance and the derivatives of the descriptors in a balanced manner with respect to one another.
  • the SOM algorithm with constraints guarantees that the candidates for a specific document cluster are placed in the same neuron. This makes it possible to achieve the clustering for each neuron individually.
  • the distance matrix can be determined (symmetrical matrix with all zero elements on the diagonal) with DocDist for the documents within one neuron.
  • FIG. 8 shows a scheme for the generation of clusters in one neuron. DocEps corresponds to a tolerance which can be defined for the maximum distance between the members of a cluster.
  • the present invention can be used not only as a language and text analysis apparatus 20 for the formation of a search and/or classification catalog.
  • the applications are wide, for any point of view. For example, it is thus possible to automatically identify data within one or more networks 40 , 41 , 42 , such as the Internet, and to associate them with one region.
  • networks 40 , 41 , 42 such as the Internet
  • the apparatus can in fact be used, for example, not only to find specific data, but also for automated monitoring and/or control of data flows in networks.
  • the invention can thus also be used for antiterrorism purposes (for example early identification of an act of terror) or for combating other criminality over the Internet (for example malaria, pedophilia, etc.).

Abstract

A speech and textual analysis device and method for forming a search and/or classification catalog. The device is based on a linguistic database and includes a taxonomy table containing variable taxon nodes. The speech and textual analysis device includes a weighting module, a weighting parameter being additionally assigned to each taxon node to register the recurrence frequency of terms in the linguistic and/or textual data that is to be classified and/or sorted. The speech and/or textual analysis device includes an integration module for determining a predefinable number of agglomerates based on the weighting parameters of the taxon nodes in the taxonomy table and at least one neuronal network module for classifying and/or sorting the speech and/or textual data based on the agglomerates in the taxonomy table.

Description

  • The invention relates to a system and a method for automated language and text analysis by the formation of a search and/or classification catalog, with data records being recorded by means of a linguistic databank and with speech and/or text data being classified and/or sorted on the basis of the data records (keywords and/or search terms). The invention relates in particular to a computer program product for carrying out this method.
  • In recent years, the importance of large databanks, in particular databanks linked in a decentralized form, for example by networks such as the world wide backbone network Internet, has increased exponentially. More and more information, goods and/or services are being offered via such databanks or networks. This is evident just from the omnipresence of the Internet nowadays. The availability and amount of such data in particular has now resulted, for example, in Internet tools for searching for and finding relevant documents and/or for classification of documents that have been found having incredible importance. Tools such as these for decentralized databank structures or databanks in general are known. In this context, the expression “search engines” is frequently used in the Internet, such as the known Google™, Alta Vista™ or structured presorted link tables such as Yahoo™.
  • The problem involved in searching for and/or cataloging of text documents in one or more databanks include the following: (1) indexing or cataloging of the content of the documents to be processed (content synthesis), (2) processing of a search request of the indexed and/or catalogued documents (content retrieval). The data to be indexed and/or catalogued normally comprises unstructured documents, such as text, descriptions, links. In more complex databanks, the documents may also multimedia data, such as images, voice/audio data, video data, etc. In the Internet, this may for example be data which can be downloaded from a website by means of links.
  • U.S. Pat. No. 6,714,939 discloses a method and a system such as this for conversion of plain text or text documents to structured data. The system according to the prior art can be used in particular to check for and/or to find data in a databank.
  • Neural networks are known in the prior art and are used, for example, to solve optimization tasks, for pattern recognition, and for artificial intelligence, etc. Corresponding to biological nerve networks, a neural network comprises a large number of network nodes, so-called neurons, which are connected to one another via weighted links (synapses). The neurons are organized and interconnected in network layers. The individual neurons are activated as a function of their input signals and produce a corresponding output signal. A neuron is activated via an individual weighting factor by the summation over the input signals. Neural networks such as these have a learning capability in that the weighting factors are varied systematically as a function of predetermined exemplary input and output values until the neural network produces a desired response in a defined predictable error range, such as the prediction of output values for future input values. Neural networks therefore have adaptive capabilities for learning and storage of knowledge, and associated capabilities for comparison of new information with stored knowledge. The neurons (network nodes) can assume a rest state or an energized state. Each neuron has a plurality of inputs and one, and only one, output, which is connected to the inputs of other neurons in the next network layer or, in the case of an output node, represents a corresponding output value. A neuron changes to the energized state when a sufficient number of inputs to the neuron are energized above a specific threshold value of that neuron, that is to say when the summation over the inputs reaches a specific threshold value. The knowledge is stored by adaptation in the weightings of the inputs of a neuron and in the threshold value of that neuron.
  • The weightings of a neural network are trained by means of a learning process (see for example G. Cybenko, “Approximation by Superpositions of a sigmoidal function”, Math. Control, Sig. Syst., 2, 1989, pp 303-314; M. T. Hagan, M. B. Menjaj, “Training Feedforward Networks with the Marquardt Algorithm”, IEEE Transactions on Neural Networks, Vol. 5, No. 6, pp 989-993, November 1994; K. Hornik, M. Stinchcombe, H. White, “Multilayer Feedforward Networks are universal Approximators”, Neural Networks, 2, 1989, pp 359-366 etc.).
  • In contrast to supervised learning neural nets, no desired output pattern is predetermined for the neural network for the learning process of unsupervised learning neural nets. In this case, the neural network itself attempts to achieve a representation of the input data that is as sensible as possible. So-called topological feature maps (TFM) such as Kohonen maps are known, for example, in the prior art. In the case of topological feature maps, the network attempts to distribute the input data as sensibly as possible over a predetermined number of classes. In this case, it is therefore used as a classifier. Classifiers attempt to subdivide a feature space, that is to say a set of input data, as sensibly as possible into a total of N sub-groups. In most cases, the number of sub-groups or classes is defined in advance. A large number of undefined interpretations can be used for the word “sensible”. By way of example, one normal interpretation for a classifier would be: “form the classes such that the sum of the distances between the feature vectors and the class center points of the classes with which they are associated is as small as possible.” A criterion is thus introduced which is intended to be either minimized or maximized. The object of the classification algorithm is to carry out the classification process for this criterion and the given input data in the shortest possible time.
  • Topological feature maps such as Kohonen maps allow a multi-dimensional feature space to be mapped into one with fewer dimensions, while retaining the most important characteristics. They differ from other classes of neural network in that no explicit or implicit output pattern is predetermined for an input pattern in the learning phase. During the learning phase of topological feature maps, they themselves adapt the characteristics of the feature space being used. The link between a classical classifier and a self-organizing neural network or a topological feature map (TFM) is that the output pattern of a topological feature map generally comprises a single energized neuron. The input pattern is associated with the same class as the energized output neuron. In the case of topological feature maps in which a plurality of neurons in the output layer can be energized, that having the highest energization level is generally simply assessed as the class associated with the input pattern. The continuous model of a classifier in which a feature is associated with specific grades of a class is thus changed to a discrete model.
  • One object of this invention is to propose a novel system and automated method for the formation of a search and/or classification catalog, which does not have the abovementioned disadvantages of the prior art. One particular aim is to propose an automated, simple and economic method in order to classify and/or to sort, and/or to index for a search request, a large amount of language and/or text data which, for example, is stored such that it can be accessed via one or more databanks. The aim of the invention is to produce an indexing method for efficient and reliable thematic searching, that is to say for finding documents which are as similar as possible to a given request, comprising an entire text document or individual keywords. A further aim of the invention is to produce a clearly defined measure for objective assessment of the similarity of two documents when compared, and for the ranking of documents. An additional aim of the invention is to produce a method for identification of associated document clusters, that is to say of documents which are virtually identical (different versions of the same document with minor changes).
  • According to the present invention, this aim is achieved in particular by the elements of the independent claims. Further advantageous embodiments are also disclosed in the dependent claims and in the description.
  • In particular, these aims are achieved by the invention in that a language and text analysis apparatus is used to form a search and/or classification catalog which has at least one linguistic databank for association of linguistic terms with data records, in which case the linguistic terms comprise at least keywords and/or search terms, and language and/or text data can be classified and/or sorted corresponding to the data records, in that the language and text analysis apparatus has a taxonomy table with variable taxon nodes on the basis of the linguistic databank, in which case one or more data records can be associated with one taxon node in the taxonomy table, and in which case each data record has a variable significance factor for weighting of the terms on the basis of at least filling words and/or linking words and/or keywords, in that each taxon node additionally has a weighting parameter for recording of frequencies of occurrence of terms within the language and/or text data to be sorted and/or to be classified, in that the language and/or text analysis apparatus has an integration module for determination of a predefinable number of agglomerates on the basis of the weighting parameters of the taxon nodes in the taxonomy table, with one agglomerate comprising at least one taxon node, and in that the language and/or text analysis apparatus has at least one neural network module for classification and/or for sorting of the language and/or text data on the basis of the agglomerates in the taxonomy table. The linguistic databank may, for example, comprise multilingual data records. This embodiment variant has the advantage, inter alia, that documents, collections or entirely general data items can be grouped logically for example in databanks, in particular decentralized databanks without human intervention (for example no training of a network, no preparation of content-specific taxonomy, etc.). Furthermore, it is simple to produce an overview of the thematic content of a document collection by means of a topological map. This apparatus and automated method can thus be regarded as considerable progress for “table of content” methods. In particular, the invention produces an extremely reliable and efficient tool for thematic searching (identification of documents on the basis of a search input in a natural language) in addition to conventional searching from the prior art by means of a combination of search terms. In particular, search results can very easily be displayed clearly by means of the projection onto the topological and/or geographic map as a so-called “heat map” display, in contrast to the conventional uncategorized list formats. Furthermore, the invention produces a measure which can be checked easily for comparison and/or similarity assessment of documents. The invention likewise produces real multilingual knowledge management with search functions in more than one language. Until now, this has not been possible with the prior art. In conclusion, the invention even allows automated generation of “descriptors”, in which case descriptors reflect the contact characteristic of a document (also with cross-language characteristics). The invention thus produces an indexing method for efficient and reliable thematic searching, that is to say for finding documents which are as similar as possible to a given request, comprising an entire text document or individual keywords. The invention likewise produces a clearly defined measure for objective assessment of the similarity of two documents for the comparison and ranking of documents. In addition, the invention produces a method for identification of associated document clusters, that is to say documents which are virtually identical (different versions of the same document with minor changes).
  • In one embodiment variant, the neural network module has at least one or more self-organizing Kohonen maps. This embodiment variant has the same advantages, inter alia, as the previous embodiment variant. In addition, the use of self-organizing network techniques, for example SON or Kohonen maps, allows further automation of the method.
  • In another embodiment variant, the language and text analysis apparatus has an entropy module for determination of an entropy parameter, which can be stored in a memory module, on the basis of the distribution of a data record in the language and/or text data. The entropy parameter may be given, for example, by EntropyDR=In (freqsumDR)−Σ FDR In(FDR)/freqsumDR. This embodiment variant has the advantage, inter alia, that an additional relevance parameter can be determined. A term which appears in a well scattered form over all of the language and/or text data and/or over all the documents has a high “entropy” and will contribute little to the process of distinguishing between the documents. The entropy can thus make a considerable contribution to the efficiency of the apparatus and/or method according to the invention.
  • In yet another embodiment variant, the apparatus has a hash table which is associated with the linguistic databank in which case a hash value can be used to identify linguistically linked data records in the hash table. This embodiment variant has the advantage, inter alia, that linguistically linked data records such as “common”, “sense” and “common sense” can be found much more quickly and considerably more efficiently.
  • In a further embodiment variant, the data records, for example, can be associated with a language by means of a language parameter and can be identified as a synonym in the taxonomy table. This embodiment variant has the advantage, inter alia, that multilingual text or language data can also be classified and/or sorted by means of the language and text analysis apparatus.
  • In one embodiment variant, the agglomerates form an n-dimensional content space and, for example, n may be equal to 100. However, it should be noted that any other desired natural number may likewise be worthwhile for specific applications. This embodiment variant has the advantage, inter alia, that it for the first time allows efficient association with the self-organizing networks since, otherwise, the content space has too many degrees of freedom than that for which it would still be valid, or too few, so that it is likewise no longer valid.
  • In another embodiment variant, the language and text analysis apparatus has descriptors which can be used to determine constraints which correspond to defineable descriptors for a subject group. This embodiment variant has the advantage, inter alia, that the documents are brought to the correct global region by means of the SOM technique.
  • In a further embodiment variant, the taxon nodes in the taxonomy table are produced on the basis of a universal, subject-independent, linguistic databank, with the databank having the universal, subject-independent, linguistic databank. This embodiment variant has the advantage, inter alia, that this makes it possible for the first time to carry out cataloguing and/or indexing in a completely automated manner on the basis of a taxonomy which is not subject-specific and therefore does not need to be defined in advance.
  • At this point, it should be stated that the present invention relates not only to the method according to the invention but also to an apparatus for carrying out this method. Furthermore, it is not restricted to the stated system and method, but likewise relates to a computer program product for implementation of the method according to the invention.
  • Embodiment variants of the present invention will be described in the following text with reference to examples. The examples of the embodiments are illustrated by the following attached figures:
  • FIG. 1 shows a block diagram which schematically illustrates the method and/or system according to the invention.
  • FIG. 2 likewise shows a block diagram, which illustrates the use of an apparatus according to the invention in a network with decentralized databanks and/or data sources for thematic recording and/or cataloging and/or monitoring of the data flows in the network.
  • FIG. 3 shows a block diagram which illustrates the structure of a taxonomy table 21.
  • FIG. 4 shows a block diagram which schematically illustrates the formation of agglomeration clusters in the taxonomy table.
  • FIG. 5 shows a block diagram which schematically illustrates one example of the combination of an agglomeration cluster into subject regions.
  • FIG. 6 shows a block diagram which schematically illustrates an information map or Kohonen map. The documents to be analyzed, that is to say all of the text and/or language data 10 are grouped by means of the SOM technique, with constraints, by the neural network module 26 into a two-dimensional array of neurons (=information map).
  • FIG. 7 shows a flowchart which illustrates method steps for the initial analysis of document collections, as a so-called text mining step.
  • FIG. 8 shows a scheme for the generation of clusters in a neuron. DocEps corresponds to a tolerance which can be defined for the maximum distance between the members of a cluster.
  • FIGS. 1 to 6 schematically illustrate an architecture which can be used for implementation of the invention.
  • In this exemplary embodiment, the language and text analysis apparatus has at least one linguistic databank 22 for association of linguistic terms with data records in order to form a search and/or classification catalog. The linguistic databank 22 may also include, for example, multilingual data records. The data records can be associated with a language and, for example, can be identified as synonyms in the taxonomy table by means of a language parameter. The linguistic databank 22 may, for example, have an associated hash table, in which case linguistically linked data records in the hash table can be identified by means of a hash value. The language and text analysis apparatus can be used to classify and/or sort language and/or text data 10 on the basis of the data records. The linguistic terms comprise at least keywords and/or search terms. It is important to note that the language and/or text data may also include data of an entirely general nature, such as multimedia data, that is to say inter alia digital data such as texts, graphics, images, maps, animations, moving images, video, QuickTime, audio records, programs (software), data accompanying programs and hyperlinks or references to multimedia data. This also includes, for example, MPx (MP3) or MPEGx (MPEG4 or 7) Standards, as are defined by the Moving Picture Experts Group.
  • The language and text analysis apparatus has a taxonomy table 21 with variable taxon nodes. One or more data records can be associated with one taxon node in the taxonomy table 21. Each data record has a variable significance factor for weighting of the terms on the basis at least of filling words and/or linking words and/or keywords. The language and text analysis apparatus has a weighting module 23. In addition, a weighting parameter is stored, associated with each taxon node, in order to record the frequencies of occurrence of terms within the language and/or text data 10 to be classified and/or sorted. The language and/or text analysis apparatus has an integration module 24 in order to determine a predefinable number of agglomerates on the basis of weighting parameters of the taxon nodes in the taxonomy table 21. An agglomerate has at least one taxon node. The agglomerates may, for example, form an n-dimensional content space. As an exemplary embodiment, n may be chosen to be equal to 100, for example. The language and/or text analysis apparatus has at least one neural network module 26 for classification and/or sorting of the language and/or text data 10 on the basis of the agglomerates in the taxonomy table 21. The neural network module 26 may, for example, have at least one topological feature map (TFM), such as a self-organizing Kohonen map. Appropriate constraints, for example, for a subject group can be determined by means of defineable descriptors.
  • The language and text analysis apparatus may additionally have, for example, an entropy module 25 in order to determine an entropy parameter, which is stored in a memory module, on the basis of the distribution of a data record in the language and/or text data 10. The entropy module 25 may, for example, be in the form of software and/or hardware. The entropy parameter may, for example, be given by

  • EntropyDR=In(freqsumDR)−Σ F DR In(F DR)/freqsumDR
  • The results, that is to say the output, may for example be displayed on an output unit 28 for a user, for example additionally via a network 40, 41, 42.
  • For the analysis and search functions, the text or language data to be analyzed, such as a pure text document, can be subdivided into the following components: a) an n-dimensional vector for characterization of the thematic content of the document, for example n may be chosen to be equal to 100; b) m descriptors which are characteristic of a document and represent constraints for optimization. The number of descriptors may, for example, be m=20; c) a set of meta data, which can be automatically extracted from the document, that is to say for example the title of the document, the author, the dates of publication of the document, the location or address of the document, such as a URL (Unified Resource Location), File Format PDF (Portable Document Format), Microsoft Word, HTML (Hyper Text Markup Language), HDML (Handheld Device Markup Language), WML (Wireless Markup Language) VRML (Virtual Reality Modeling Language), XML (Extensible Markup Language), JPEG (Joint Photographic Experts Group) etc., MPEG (Moving Picture Experts Group), number of words and/or terms, number of integer and/or rational numbers, language of the majority of the terms in the document, additional rules or characteristics, etc.
  • The axes of the n-dimensional content space depend on the thematic linking and/or on the inner links between all of the language and/or text data 10 to be analyzed. The axes can sensibly be constructed such that the relevant subject areas of the language and/or text data 10 are reproduced as well as possible, and irrelevant background (noise) is not displayed, or is greatly suppressed. The generation of the axes and the projection are based on the linguistic and, for example multilingual databank 22 that has been mentioned and is associated with a universal taxonomy and/or a universal taxonomy tree. Universal means that there is no need to predefine a specific region by means of the taxonomy before the cataloging and/or indexing of the text and/or language data 10. Until now, this has not been possible in this way in the prior art.
  • Words, expressions and/or terms which occur in a text document are compared with a large list of words which are stored in the linguistic databank 22. In this context, “terms” is intended to mean linked words such as the expression “nuclear power plant”, “Commission of Human Rights”, or “European Patent Office”. In the exemplary embodiment, 2.2 million entries have been found to be sufficient for a linguistic databank 22 such as this for the languages of English, French, German and Italian, although the databank 22 may, of course, also have a greater or a lesser number of entries, as required. Words/terms with the same meaning (synonyms) can be linked for example in synonym groups (synsets), for example also jointly for all languages. These synsets are then associated with a taxon node in the hierarchical taxonomy table or taxonomy tree. The distribution of the taxon node hits (entries) for specific language and/or text data 10 or for a document to be analyzed is a reliable measure for its subject content.
  • FIG. 3 shows such a structure for a taxonomy table 21. By way of example, entries for each language can be structured as follows:
  • Column Format Content
    1 N Classification code (for example
    decimal) for the taxon node
    (universal code for all languages)
    2 T35 Name of the taxon node
    (hyperonym/generic term)
    3 N1 Hierarchy level in the taxonomy tree
    4 N1.3 Statistical weighting of the node
    (governed by means of local entropy
    specifically for long document
    collections that are rich in content)
    1 N6 Synset code (the same for all
    languages)
    2 N2 Sequential number within a synset and
    a language (0 may, for example,
    correspond to a major term within the
    group per language)
    3 N1 Type of expression or term
    (1 = noun/2 = verb/3 = adjective/
    4 = adverb, pronouns etc./5 = name)
    4 N1 Significance of the expression/word
    (0 = filling word [“glue” word]/1 = low
    significance/2 = medium/3 = high/
    4 = very high significance)
    5 N 64-bit hash value for faster
    identification of terms (expression
    comprising a plurality of words)
    6 T35 Expression/term
    7 N Hyperonym code (association with a
    taxon node in the taxonomy table with
    a (for example decimal)
    classification)
    8 N1 Language code (0 = language-
    independent name/1 = English/2 = German/
    3 = French/4 = Italian)
    9 N2 Flag for expressions/terms which
    appear in more than one synset
    (synonym group)*
    *The expression “gift” exists in English and German, but has a completely different meaning in the two languages. In addition, expressions exist with a different meaning in the same language. The English word “fly”, for example is used as “flight” or “trouser fly”. The expression “window” means an opening/window, but “windows” can relate to an opening or an operating system. On the other hand, “Windows XP” is in turn unique.
  • By way of example, the method steps for an initial analysis of the language and/or text data 10 may appear as follows: (1) input a document, that is to say language and/or text data 10; (2) a first estimate of the document; (3) word processing: i) extraction of the expression/term. ii) comparison with the entries in the linguistic databank taking account of the appropriate language and lexicography rules for correct association. Generation of the synset and hyperonym codes, of the significance and language by means of the databank. iii) generation of new expressions and/or synsets for expressions or terms that have not been found. iv) determination of the frequency of the expression/term per document. v) adaptation of the language, if necessary; (4) associated storage of the information; (5) next document or next language and/or text data 10.
  • In order to determine the entropy and a relevance index per synset (synonym group), the frequency is calculated for each synset (isyn) and each language and/or text data item 10 or document (idoc) on the following basis:
  • F isyn ( idoc ) = norm ( idoc ) word isyn f word · sig word ,
  • where fword=frequency of the word in idoc; sigword=significance of the word on the basis of the linguistic databank (0, . . . ,4)
  • norm ( idoc ) = min ( weighted number of expressions in idoc , 500 ) weighted number of expressions in idoc
  • The weight is given by sigword.
  • The factor norm (idoc) may be introduced, for example, in order to prevent very large documents from having a dominant effect on a specific data link. By way of example, the factor may be determined empirically.
  • The information-theory entropy of a synset isyn can thus be defined by:
  • Entropy isyn = log ( freqsum isyn ) - idoc F isyn 0 · log ( F isyn ) / freqsum isyn where : freqsum isyn = idoc F isyn 0
  • A synset which appears to be well scattered over all of the language and/or text data items 10 and/or over all of the documents has a high “entropy” and will contribute little to distinguishing between the documents. For example, if documents/articles in a databank from the Neuen Zürcher Zeitung [New Zurich Newspaper] are intended to be analyzed, it is obvious that “Neuen Zurcher Zeitung” will appear in all of or a large number of the articles without having any distinguishing power for the content of the documents. For example, the term “Relevance Index” RISK can be defined as a measure for a general relevance of a synset isyn by:

  • RI isny=freqsumisyn/Entropyisyn
  • In order to define the axes of the n-dimensional content space (in this exemplary embodiment, n was chosen to be equal to 100), the relevance of a hyperonym (taxon node in the taxonomy table 21) is determined by integrating all of the text and/or language data items 10 to be analyzed, over all of the relevance indices. This relevance is a measure of the total hit frequency of a taxon node by all of the text and/or language data 10. This measure indicates which subject region and/or subject regions is or are predominant in a document collection. Theoretically, each taxon node can be associated with one axis in the content space. For example, this would result in a content space with a dimension of more than 4000, which would correspond to an enormous overhead and, furthermore, far too many degrees of freedom for content determination.
  • For this reason, the taxon nodes may, for example, be clustered, for example into n−2 (for example n−2=98) different clusters, for example by means of a condition that the accumulated relevance of the “mother node” of a cluster of taxon nodes and all its sub-nodes corresponds to at least one predefineable threshold value (for example 0.5%) of the total relevance. The cluster is formed at the lowest possible level of the taxonomy tree or taxonomy table. This method can be compared, for example, with the formation of agglomerates in demography. Each cluster (with all of the corresponding synsets which refer to it) is associated with one axis in the n-dimensional content space. Axis n−1 is used, for example, for synsets which do not refer to one of the agglomeration clusters, and the axes n are reserved for numbers. FIG. 4 shows, schematically, the formation of agglomeration clusters such as these in the taxonomy table.
  • Finally, for example, ntop subject regions are formed which are each composed of a specific sub-group of agglomeration clusters (ntop may, for example, be in the order of magnitude of 10 to 20). The agglomerates are formed in such a way that the taxon nodes of an agglomeration cluster which is associated with the same subject region (topics) has a common mother node in the hierarchy of the taxonomy table. The transformation rule which results from this may, for example, be as follows: each synset refers to one of the selected agglomeration clusters, corresponding to one axis in the content space or an axis n−1. A large number of synsets in turn refer to one of the ntop subject regions at a higher aggregation level. FIG. 5 shows one example of the combination of an agglomeration cluster into subject regions.
  • For projection of the documents to be analyzed, that is to say the language and/or text data 10, onto the n-dimensional content space, the vector component ci for the i-th axis of the content space can be defined for each document idoc by:
  • c i = log ( 1 + w i ) where w i = Synsets Axis i F isyn ( idoc )
  • where Fisyn(idoc) is given by the above formula.
  • The unit (metric) for the n-dimensional space is determined by means of the overall entropy of all of the synset which refer to one axis I(∀SynsetsAxis i ) in which case the overall entropy can be determined in an analogous way to that for the entropy of the synsets as defined above. The weights gi for the i-th component can then be determined, for example, by:

  • g i=1/(overall entropy of the i-th component)
  • This definition results, for example, in components with a low entropy (that is to say with a low degree of distribution (high discrimination effect)) having a correspondingly high weight.
  • A synset relevance value Relevisyn is determined for the choice of the m most typical descriptors of a document, that is to say specific language and/or text data 10, for each synset isyn in the document idoc, for example by:

  • Relevisyn(idoc)=(In(1+F isn(idoc))/In(1+freqsumisyn))/Entropyisyn
  • The m synsets with the highest relevance value Relevisyn may be selected, for example, as the m descriptors which are most typical for a document idoc. These descriptors which, for example, can be stored associated with their corresponding hyperonyms are used for cataloging and/or indexing. They include the most important characteristics of a document even in those situations in which the projection onto the content space is reproduced in a non-optimal manner by the content of a specific document.
  • For automated cataloging and/or indexing, the method mentioned above, which is based on the statistical and/or linguistic analysis method mentioned, is combined with one or more neural network modules 26. This statistical and/or linguistic analysis method is used, as described, to produce a comprehensive universal taxonomy table 21 for identification of the subject content. In order now to provide an overview of all of the text and/or language data 10, that is to say of all of documents idoc to be analyzed, and in order on the other hand to generate a function for the similarity comparison, the results of the linguistic analysis are combined with neural network technologies. It has been found that so-called self-organizing map (SOM) techniques, for example Kohonen, can be very highly suitable. In contrast, it is obvious to a person skilled in the art that other neural network techniques may also be worthwhile or more suitable for specific applications without restricting the scope of protection of the patent in any way.
  • The SOM technique can be applied to the described projection method for the text and/or language data 10 to be analyzed, that is to say the documents idoc to the n-dimensional content space (for example n=100). Before the neural network iterations are started by means of the neural network module 26 (unsupervised learning), it is possible, for example, to use a rough compensation method for the group, in order to obtain a reliable initial estimate for the SOM technique. This method can considerably speed up the iteration process and can minimize the risk of the SOM technique ending in a non-optimum configuration (for example a local minimum). The distance between two vectors (documents idoc) a and b can be determined for the SOM algorithm for example by:
  • Distance = i g i ( a i a - b i b ) 2 + KL a , b
  • where KLa,b is the Kullback-Leibler distance between two documents in the following sense, that the assignment of a document idoc is measured by means of a content vector c to a subject region jtop using
  • h jtop , c = VectorPart + DescriptionPart + ErrMS where VectorPart = Components jtop g i ( c i c ) 2
  • where ∀Componentsjtop corresponds to all of the components which refer to jtop.
  • DescriptorPart = Descriptors jtop ( Relev isyn ( idoc ) ) 2
  • Where, once again, ∀Descriptorsjtop corresponds to all of the descriptors which refer to jtop. ErrMS is the estimate of the mean square error (discrepancy) where, for example, ErrMS≧10−5. The normalized dimensions
  • P jtop , c = h jtop , c / itop h itop
  • may, for example, be interpreted as probabilities of the document idoc belonging to a specific subject region jtop. The Kullback-Leibler distance between two documents idoc and kdoc with the content vectors a and b is given, for example, by
  • KL a , b = 1 ntop jtop ( P jtop , a - P jtop , b ) ln ( P jtop , a / P jtop , b )
  • The Kullback-Leibler part in the total distance includes the fact that the documents have been moved to the correct global region by the SOM technique. The Kullback-Leibler part thus acts as a constraint of the SOM technique. The metric part in the total distance, in contrast, is responsible for local placing in the individual neurons in a subject region. The SOM technique with constraints is used to group the documents to be analyzed, that is to say all of the text and/or language data 10 by means of the neural network module 26 in a two-dimensional array of neurons (=information map). FIG. 6 shows the result of an information map or Kohonen map such as this. The documents in a neuron are thus similar to one another, in terms of their subject content. The neurons are grouped in such a manner that they are located in the global subject region with which they are mainly associated, and neurons linked by subject are located close to one another (see FIG. 6 with the subject regions a, . . . ,k).
  • In the comparison and assessment method, a search request may, for example, comprise a pair of search expressions or a text document in a natural language. The search text may, for example, the entire content of a document, in order to search for similar documents in the indexed and/or catalogued document collection. The search text may, however, also for example include only a small portion of the relevant document. For this reason, in some circumstances, the metric distance between the search text and the documents may not be a reliable criterion for finding the documents which are closest to the search text. A more reliable measure for the comparison and the hierarchical assessment is produced by the scalar product of the content vectors. A measure such as this guarantees that the common parts between the search text and the documents are effectively taken into account. A similarity measure between the search text and a document may be defined, for example, by
  • Similarity = i q i · c i q · c + DescrSim
  • where q is the content vector of the search text, c is the content vector of the neuron in which the document is placed, and DescrSim is the measure of the similarity between the m descriptors of the search text and the document (for example m=20), as described further below. The term DescrSim comprises the weighted sum of different descriptor pairs, in which case pairs with identical descriptors in the search text and in the searched document can be weighted, for example, with 100 points. Pairs with descriptors which relate to a common hyperonym (taxon node in the taxonomy table) may, for example, be weighted with 30 points, if the common taxon node is the direct taxon node of the descriptors, with 10 points if the common taxon node is a hierarchy level above this, three points if the common taxon node is two hierarchy levels above this, and one point if the common taxon node is three hierarchy levels above this. With Relevisyn( ) as the relevance value of the descriptors in a document, it is possible, for example, to determine that:
  • DescrSim = 0.01 S norm Pairs ( Weighting for pairs isyn 1 / isyn 2 ) · weight isyn 1 , isyn 2 where weight isyn 1 , isyn 2 = Relev isyn 1 ( Search text ) · Relev isyn 2 ( Document ) S norm = ( m / m 1 ) · Pairs weight isyn 1 , isyn 2
  • where m1=number of matching pairs (m1<m). The scalar products in the similarity measure as defined above corresponds to the similarity between a neuron (partial collection of the documents) and the search text. The term DescrSim quantifies the details for the individual documents in a given neuron. The factor “0.01” in the definition of DescrSim may, for example, be determined on an empirical basis. For example, it can be determined in such a manner that the scalar product (cross positioning) and the individual extensions (DescrSim) are split into a balanced form.
  • The comparison method is clear for the comparison and the assessment with the similarity measure as defined above. For example, nDoc documents are found which are closest to a specific search text. First of all, the subarea with the neurons with the highest scalar products is looked for until the number of selected documents exceeds, for example, the limit value of 3·nDoc. The selected documents are then sorted on the basis of their similarity values (including the extension DescrSim) in decreasing order. The first nDoc documents form the desired documents in the assessment order. In the situation in which the subject search does not result in any sense, that is to say for example if the search request is composed of only a few words, which do not contribute to any distinguishing content, the selection can be made by, for example, using the search index for the individual synsets within a document. The similarity measure defined further above may, for example, extend from 0 to 2. The transformation to a weighting percentage can be achieved, for example, using:
  • Weighting percentage = ( Similarity 2 ) 1 / 3 · 100 %
  • The identification of document derivatives means the identification of clusters of documents whose content is virtually identical. By way of example, these may be different copies of the same document with minor changes, such as those which may apply to patent texts in a patent family, whose text and/or scope of protection may vary slightly from one country to another. The apparatus according to the invention and/or the method allow/allows the automated identification of document clusters with virtually identical documents. Furthermore, this makes it possible to suppress older document versions, and may be a tool in order to manage document collections such as these and to keep them up to date (for example by means of a regular clean-up).
  • In the case of cluster identification, the similarity measure which is used for comparison and/or weighting of the documents for a search text, may in certain circumstances not produce satisfactory results to discover document clusters such as these. For document clustering, the distance between two documents idoc1 and idoc2 with their content vectors a and b is measured by
  • DocDist = i g i ( a i a - b i b ) 2 + DescrDist
  • where DescrDist is the weighted sum of the derivative of the descriptors. In this case, for example, it is possible to determine that matching descriptor pairs from two sets of m descriptors (for example m=20) contribute nothing, while non-matching descriptor pairs are weighted with one point if they have one direct common taxon node, with two points if they have one common taxon node in a hierarchy level above this, and five points for the other cases. Using Relevisyn( ) as the relevance value of the descriptor within a document, it is possible, for example, to determine that:
  • DescrDist = 0.1 D norm Pairs ( Result for pairs isyn 1 / isyn 2 ) · Relev isyn 1 ( idoc 1 ) · Relev isyn 2 ( idoc 2 ) where D norm = Pairs Relev isyn 1 ( idoc 1 ) Relev isyn 2 ( idoc 2 )
  • The factor “0.1” in the definition of DescDist may, for example, be determined empirically, for example by weighting the metric distance and the derivatives of the descriptors in a balanced manner with respect to one another.
  • The SOM algorithm with constraints guarantees that the candidates for a specific document cluster are placed in the same neuron. This makes it possible to achieve the clustering for each neuron individually. For example, as described above, the distance matrix can be determined (symmetrical matrix with all zero elements on the diagonal) with DocDist for the documents within one neuron. FIG. 8 shows a scheme for the generation of clusters in one neuron. DocEps corresponds to a tolerance which can be defined for the maximum distance between the members of a cluster.
  • It should be noted that the present invention can be used not only as a language and text analysis apparatus 20 for the formation of a search and/or classification catalog. The applications are wide, for any point of view. For example, it is thus possible to automatically identify data within one or more networks 40, 41, 42, such as the Internet, and to associate them with one region. Until now, this has not been possible in the prior art since the use of a universal taxonomy table was not possible in conjunction with automated cataloging and/or indexing. The communication networks 40, 41, 42 are in the form, for example, of a GSM network or a UMTS network, or a satellite-based mobile radio network, and/or one or more landline networks, for example the public switching telephone network, the worldwide Internet or a suitable LAN (Local Area Network) or WAN (Wide Area Network). In particular, this also includes ISDN and XDSL links. Users may, for example, access the one or more networks 40, 41, 42 by means of any network- compatible terminals 30, 31, 32, 33, such as any CPE (Customer Premise Equipments), Personal Computers 30, laptops 31, PDAs 32, mobile radio devices 33, etc. Nowadays, the apparatus can in fact be used, for example, not only to find specific data, but also for automated monitoring and/or control of data flows in networks. For example, the invention can thus also be used for antiterrorism purposes (for example early identification of an act of terror) or for combating other criminality over the Internet (for example racism, pedophilia, etc.).

Claims (23)

1-22. (canceled)
23. A language and text analysis apparatus for formation of a search and/or classification catalog, comprising:
at least one linguistic databank for association of linguistic terms with data records, in which case the language and text analysis apparatus can be used to classify and/or to sort language and/or text data corresponding to the data records, and in which the linguistic terms comprise at least keywords and/or search terms,
wherein:
the language and text analysis apparatus includes a taxonomy table with variable taxon nodes on the basis of the linguistic databank, in which case one or more data records can be associated with one taxon node in the taxonomy table, and in which case each data record includes a variable significance factor for weighting of terms on the basis of at least filling words and/or linking words and/or keywords,
the language and text analysis apparatus includes a weighting module, in which a weighting parameter for recording of frequencies of occurrence of terms within the language and/or text data to be sorted and/or to be classified is additionally stored associated with each taxon node,
the language and/or text analysis apparatus includes an integration module for determination of a predefinable number of agglomerates on the basis of the weighting parameters of the taxon nodes in the taxonomy table, with one agglomerate including at least one taxon node, and
the language and/or text analysis apparatus includes at least one neural network module for classification and/or for sorting of the language and/or text data on the basis of the agglomerates in the taxonomy table.
24. The language and text analysis apparatus as claimed in claim 23, wherein the neural network module includes at least one self-organizing Kohonen map.
25. The language and text analysis apparatus as claimed in claim 23, wherein the language and text analysis apparatus includes an entropy module for determination of an entropy parameter, which can be stored in a memory module, on the basis of distribution of a data record in the language and/or text data.
26. The language and text analysis apparatus as claimed in claim 23, wherein the linguistic databank includes multilingual data records.
27. The language and text analysis apparatus as claimed in claim 23, wherein the language and text analysis apparatus includes a hash table that is associated with the linguistic databank in which case a hash value can be used to identify linguistically linked data records in the hash table.
28. The language and text analysis apparatus as claimed in claim 23, wherein a language parameter can be used to associate the data records with a language and can be identified as a synonym in the taxonomy table.
29. The language and text analysis apparatus as claimed in claim 23, wherein the entropy parameter is given by:

EntropyDR=In(freqsumDR)−Σ F DR In(F DR)/freqsumDR
in which freqsumisyn=Σ Fisyn( ) and Fisyn is the frequency for each synonym group and each item of language and/or text data.
30. The language and text analysis apparatus as claimed in claim 23, wherein the agglomerates form an n-dimensional content space.
31. The language and text analysis apparatus as claimed in claim 30, wherein n is equal to 100.
32. The language and text analysis apparatus as claimed in claim 23, wherein the language and text analysis apparatus includes descriptors by which constraints that correspond to definable descriptors can be determined for a subject group.
33. An automated language and text analysis method for formation of a search and/or classification catalog, with a linguistic databank being used to record data records and to classify and/or sort language and/or text data on the basis of the data records, wherein:
the data records in the linguistic databank are stored associated with a taxon node in a taxonomy table, with each data record including a variable significance factor for weighting of terms based at least on filling words and/or linking words and/or keywords,
the language and/or text data is recorded on the basis of the taxonomy table, with the frequency of individual data records in the language and/or text data being determined by a weighting module and being associated with a weighting parameter for the taxon node,
an integration module is used to determine a determinable number of agglomerates in the taxonomy table on the basis of the weighting parameters of one or more taxon nodes,
a neural network module is used to classify and/or sort the language and/or text data on the basis of the agglomerates in the taxonomy table.
34. The automated language and text analysis method as claimed in claim 33, wherein the neural network module includes at least one self-organizing Kohonen map.
35. The automated language and text analysis method as claimed in claim 33, wherein an entropy module is used to determine an entropy factor on the basis of distribution of a data record in the language and/or text data.
36. The automated language and text analysis method as claimed in claim 33, wherein the linguistic databank includes multilingual data records.
37. The automated language and text analysis method as claimed in claim 33, wherein a hash table is stored associated with the linguistic databank, with the hash table including an identification of linked data records by a hash value.
38. The automated language and text analysis method as claimed in claim 33, wherein the data records can be associated with a language and can be weighted synonymously in the taxonomy table by a language parameter.
39. The automated language and text analysis method as claimed in claim 33, wherein the entropy factor is given by the term

EntropyDR=In(freqsumDR)−Σ F DR In(F DR)/freqsumDR
in which freqsumisyn=Σ FDR( ) and Fisyn is the frequency for each synonym group and each item of language and/or text data.
40. The automated language and text analysis method as claimed in claim 33, wherein the agglomerates form an n-dimensional content space.
41. The automated language and text analysis method as claimed in claim 40, wherein n is equal to 100.
42. The automated language and text analysis method as claimed in claim 33, wherein definable descriptors can be used to determine corresponding constraints for a subject group.
43. A computer program product on a computer-readable medium with computer program code means contained therein to control one or more processors in a computer-based system for automated language and text analysis by formation of a search and/or classification catalog, with data records being recorded on the basis of a linguistic databank, and with language and/or text data being classified and/or sorted on the basis of the data records,
wherein:
the computer program product can be used to store the data records in the linguistic databank associated with a taxon node in a taxonomy table, with each data record including a variable significance factor for weighting of terms on the basis at least of filling words and/or linking words and/or keywords,
the computer program product can be used to record the language and/or text data on the basis of the taxonomy table, with frequency of individual data records in the language and/or text data determining a weighting parameter for the taxon nodes,
the computer program product can be used to determine a determinable number of agglomerates in the taxonomy table on the basis of the weighting parameter of one or more taxon nodes,
the computer program product can be used to generate a neural network, which can be used to classify and/or sort the language and/or text data on the basis of the agglomerates in the taxonomy table, the language, and/or text data.
44. A computer program product which can be loaded in an internal memory of a digital computer and includes software code sections by which the operation as claimed in claim 33 can be carried out when the product is run on a computer, in which case the neural networks can be generated with software and/or hardware.
US11/659,955 2004-08-13 2004-08-13 Speech and Textual Analysis Device and Corresponding Method Abandoned US20080215313A1 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2004/051798 WO2006018041A1 (en) 2004-08-13 2004-08-13 Speech and textual analysis device and corresponding method

Publications (1)

Publication Number Publication Date
US20080215313A1 true US20080215313A1 (en) 2008-09-04

Family

ID=34958240

Family Applications (2)

Application Number Title Priority Date Filing Date
US11/659,955 Abandoned US20080215313A1 (en) 2004-08-13 2004-08-13 Speech and Textual Analysis Device and Corresponding Method
US11/659,962 Active 2029-08-07 US8428935B2 (en) 2004-08-13 2005-08-09 Neural network for classifying speech and textural data based on agglomerates in a taxonomy table

Family Applications After (1)

Application Number Title Priority Date Filing Date
US11/659,962 Active 2029-08-07 US8428935B2 (en) 2004-08-13 2005-08-09 Neural network for classifying speech and textural data based on agglomerates in a taxonomy table

Country Status (6)

Country Link
US (2) US20080215313A1 (en)
EP (2) EP1779263A1 (en)
AT (1) ATE382903T1 (en)
DE (1) DE502005002442D1 (en)
ES (1) ES2300046T3 (en)
WO (2) WO2006018041A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080172595A1 (en) * 2006-12-29 2008-07-17 Olaf Schmidt Document Link Management
US20080275694A1 (en) * 2007-05-04 2008-11-06 Expert System S.P.A. Method and system for automatically extracting relations between concepts included in text
US20090024616A1 (en) * 2007-07-19 2009-01-22 Yosuke Ohashi Content retrieving device and retrieving method
WO2011096969A1 (en) * 2010-02-02 2011-08-11 Alibaba Group Holding Limited Method and system for text classification
US9031926B2 (en) 2007-06-04 2015-05-12 Linguamatics Ltd. Extracting and displaying compact and sorted results from queries over unstructured or semi-structured text

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GR20050100216A (en) * 2005-05-04 2006-12-18 i-sieve :����������� ����������� ������������ ����������� �.�.�. Method for probabilistic information fusion to filter multi-lingual, semi-structured and multimedia electronic content.
JP4803709B2 (en) * 2005-07-12 2011-10-26 独立行政法人情報通信研究機構 Word usage difference information acquisition program and apparatus
US8340957B2 (en) * 2006-08-31 2012-12-25 Waggener Edstrom Worldwide, Inc. Media content assessment and control systems
US20090319505A1 (en) * 2008-06-19 2009-12-24 Microsoft Corporation Techniques for extracting authorship dates of documents
US8560298B2 (en) * 2008-10-21 2013-10-15 Microsoft Corporation Named entity transliteration using comparable CORPRA
US20100153366A1 (en) * 2008-12-15 2010-06-17 Motorola, Inc. Assigning an indexing weight to a search term
US8332205B2 (en) * 2009-01-09 2012-12-11 Microsoft Corporation Mining transliterations for out-of-vocabulary query terms
EP2545462A1 (en) * 2010-03-12 2013-01-16 Telefonaktiebolaget LM Ericsson (publ) System and method for matching entities and synonym group organizer used therein
JP2012027723A (en) 2010-07-23 2012-02-09 Sony Corp Information processor, information processing method and information processing program
DE102011009376A1 (en) * 2011-01-25 2012-07-26 SUPERWISE Technologies AG Automatic classification of a document pool with a neural system
DE102011009378A1 (en) * 2011-01-25 2012-07-26 SUPERWISE Technologies AG Automatic extraction of information about semantic relationships from a pool of documents with a neural system
US9495352B1 (en) 2011-09-24 2016-11-15 Athena Ann Smyros Natural language determiner to identify functions of a device equal to a user manual
US9721039B2 (en) * 2011-12-16 2017-08-01 Palo Alto Research Center Incorporated Generating a relationship visualization for nonhomogeneous entities
US10163063B2 (en) * 2012-03-07 2018-12-25 International Business Machines Corporation Automatically mining patterns for rule based data standardization systems
US20140279598A1 (en) * 2013-03-15 2014-09-18 Desire2Learn Incorporated Systems and methods for automating collection of information
WO2014197730A1 (en) * 2013-06-08 2014-12-11 Apple Inc. Application gateway for providing different user interfaces for limited distraction and non-limited distraction contexts
WO2017019705A1 (en) * 2015-07-27 2017-02-02 Texas State Technical College System Systems and methods for domain-specific machine-interpretation of input data
CN105868236A (en) * 2015-12-09 2016-08-17 乐视网信息技术(北京)股份有限公司 Synonym data mining method and system
US10043070B2 (en) * 2016-01-29 2018-08-07 Microsoft Technology Licensing, Llc Image-based quality control
US20190207946A1 (en) * 2016-12-20 2019-07-04 Google Inc. Conditional provision of access by interactive assistant modules
US10127227B1 (en) 2017-05-15 2018-11-13 Google Llc Providing access to user-controlled resources by automated assistants
US11436417B2 (en) 2017-05-15 2022-09-06 Google Llc Providing access to user-controlled resources by automated assistants
CN112262381B (en) 2018-08-07 2024-04-09 谷歌有限责任公司 Compiling and evaluating automatic assistant responses to privacy questions

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6711585B1 (en) * 1999-06-15 2004-03-23 Kanisa Inc. System and method for implementing a knowledge management system
US6886010B2 (en) * 2002-09-30 2005-04-26 The United States Of America As Represented By The Secretary Of The Navy Method for data and text mining and literature-based discovery

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5778362A (en) * 1996-06-21 1998-07-07 Kdl Technologies Limted Method and system for revealing information structures in collections of data items
US6446061B1 (en) * 1998-07-31 2002-09-03 International Business Machines Corporation Taxonomy generation for document collections
US6263334B1 (en) * 1998-11-11 2001-07-17 Microsoft Corporation Density-based indexing method for efficient execution of high dimensional nearest-neighbor queries on large databases
US20030069873A1 (en) * 1998-11-18 2003-04-10 Kevin L. Fox Multiple engine information retrieval and visualization system
IT1303603B1 (en) * 1998-12-16 2000-11-14 Giovanni Sacco DYNAMIC TAXONOMY PROCEDURE FOR FINDING INFORMATION ON LARGE HETEROGENEOUS DATABASES.
US6278987B1 (en) * 1999-07-30 2001-08-21 Unisys Corporation Data processing method for a semiotic decision making system used for responding to natural language queries and other purposes
US7451075B2 (en) * 2000-12-29 2008-11-11 Microsoft Corporation Compressed speech lexicon and method and apparatus for creating and accessing the speech lexicon
AUPR958901A0 (en) * 2001-12-18 2002-01-24 Telstra New Wave Pty Ltd Information resource taxonomy

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6711585B1 (en) * 1999-06-15 2004-03-23 Kanisa Inc. System and method for implementing a knowledge management system
US6886010B2 (en) * 2002-09-30 2005-04-26 The United States Of America As Represented By The Secretary Of The Navy Method for data and text mining and literature-based discovery

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080172595A1 (en) * 2006-12-29 2008-07-17 Olaf Schmidt Document Link Management
US7844890B2 (en) * 2006-12-29 2010-11-30 Sap Ag Document link management
US20080275694A1 (en) * 2007-05-04 2008-11-06 Expert System S.P.A. Method and system for automatically extracting relations between concepts included in text
US7899666B2 (en) * 2007-05-04 2011-03-01 Expert System S.P.A. Method and system for automatically extracting relations between concepts included in text
US9031926B2 (en) 2007-06-04 2015-05-12 Linguamatics Ltd. Extracting and displaying compact and sorted results from queries over unstructured or semi-structured text
US20090024616A1 (en) * 2007-07-19 2009-01-22 Yosuke Ohashi Content retrieving device and retrieving method
WO2011096969A1 (en) * 2010-02-02 2011-08-11 Alibaba Group Holding Limited Method and system for text classification
US8478054B2 (en) 2010-02-02 2013-07-02 Alibaba Group Holding Limited Method and system for text classification

Also Published As

Publication number Publication date
ES2300046T3 (en) 2008-06-01
EP1779271B1 (en) 2008-01-02
WO2006018411A3 (en) 2006-06-08
DE502005002442D1 (en) 2008-02-14
ATE382903T1 (en) 2008-01-15
US8428935B2 (en) 2013-04-23
WO2006018041A1 (en) 2006-02-23
EP1779271A2 (en) 2007-05-02
WO2006018411A2 (en) 2006-02-23
US20070282598A1 (en) 2007-12-06
EP1779263A1 (en) 2007-05-02

Similar Documents

Publication Publication Date Title
US8428935B2 (en) Neural network for classifying speech and textural data based on agglomerates in a taxonomy table
US7085771B2 (en) System and method for automatically discovering a hierarchy of concepts from a corpus of documents
Hofmann The cluster-abstraction model: Unsupervised learning of topic hierarchies from text data
US5960422A (en) System and method for optimized source selection in an information retrieval system
US8499008B2 (en) Mixing knowledge sources with auto learning for improved entity extraction
WO2021076606A1 (en) Conceptual, contextual, and semantic-based research system and method
US8156097B2 (en) Two stage search
US20120330977A1 (en) Method, computer system, and computer program for searching document data using search keyword
Ramanujam et al. An automatic multidocument text summarization approach based on Naive Bayesian classifier using timestamp strategy
US20060004561A1 (en) Method and system for clustering using generalized sentence patterns
JP2006525602A (en) Methods and systems for information retrieval and text mining using distributed latent semantic indexing
Shu et al. A neural network-based intelligent metasearch engine
US20170185672A1 (en) Rank aggregation based on a markov model
KR102046692B1 (en) Method and System for Entity summarization based on multilingual projected entity space
US20110022598A1 (en) Mixing knowledge sources for improved entity extraction
JP3847273B2 (en) Word classification device, word classification method, and word classification program
Deshmukh et al. A literature survey on latent semantic indexing
Vargas-Solar et al. Towards human-in-the-loop based query rewriting for exploring datasets
Abdullah et al. The effectiveness of classification on information retrieval system (case study)
Sheela et al. Caviar-Sunflower Optimization Algorithm-Based Deep Learning Classifier for Multi-Document Summarization
Noppakaow et al. Examinations on the performance of classification models for thai news articles
Soltanpoor et al. A new approach for better document retrieval and classification performance using supervised WSD and Concept Graph
He et al. Citation-based retrieval for scholarly publications
Zhu et al. Finding story chains in newswire articles using random walks
AlArfaj et al. An Intelligent Tree Extractive Text Summarization Deep Learning.

Legal Events

Date Code Title Description
AS Assignment

Owner name: SWISS REINSURANCE COMPANY, SWITZERLAND

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:WAELTI, PAUL;TRUGENBERGER, CARLO A.;CUYPERS, FRANK;AND OTHERS;REEL/FRAME:020124/0120;SIGNING DATES FROM 20070501 TO 20070515

Owner name: SWISS REINSURANCE COMPANY, SWITZERLAND

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:WAELTI, PAUL;TRUGENBERGER, CARLO A.;CUYPERS, FRANK;AND OTHERS;SIGNING DATES FROM 20070501 TO 20070515;REEL/FRAME:020124/0120

AS Assignment

Owner name: INFOCODEX AG, SWITZERLAND

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:SWISS REINSURANCE COMPANY;REEL/FRAME:025319/0026

Effective date: 20100927

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION