WO2008037848A1 - Filtering of data items - Google Patents

Filtering of data items Download PDF

Info

Publication number
WO2008037848A1
WO2008037848A1 PCT/FI2007/050511 FI2007050511W WO2008037848A1 WO 2008037848 A1 WO2008037848 A1 WO 2008037848A1 FI 2007050511 W FI2007050511 W FI 2007050511W WO 2008037848 A1 WO2008037848 A1 WO 2008037848A1
Authority
WO
WIPO (PCT)
Prior art keywords
terms
relevancy
semantic
vector
comparison
Prior art date
Application number
PCT/FI2007/050511
Other languages
French (fr)
Inventor
Juha Makkonen
Original Assignee
Whitevector Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Whitevector Oy filed Critical Whitevector Oy
Priority to EP07823148A priority Critical patent/EP2080126A4/en
Publication of WO2008037848A1 publication Critical patent/WO2008037848A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering

Definitions

  • the present invention relates to data processing, and particularly to a method for filtering of data items and a corresponding system.
  • Data refers herein to information that is operable by an information system.
  • a data item refers to a set of data that an information system may recognize to be mutually related and thus process as a separately treatable unit.
  • Operations on data items typically involve filtering, in the sense that during filtering a defined characteristic of the data item is analyzed, and on the basis of the analysis the data item is subjected to predefined procedures. Filtering is typically responsive to the similarity between one or more filter criteria and the data item under analyze. When the analysis indicates a match with a defined filter criteria the data item is considered relevant in respect of that filter.
  • data items are often text files that include words and filter criteria is formed by search terms provided by users of the information systems.
  • the relevancy of an analyzed text file is judged by hits between the search words and the words in the text files.
  • the public material is accessed through portals and search engines, and thus the capabilities of indexing have had a huge impact on availability of information.
  • the indexing involves a delay that varies from the operator to operator, typical refresh rates being between a day and some tens of days.
  • these delayed strategies work for archived information streams, but they do not really address the need of more dynamic monitoring, for example, monitoring on a daily basis.
  • Yet all news agencies and papers publish news online and new stories are published throughout the day.
  • Some of the news material is syndicated specifically as a web feed, while most of the daily material is published on a web page - a digital analogy of a traditional news paper.
  • Some web portals contain news from news agencies and news sites thus producing a collection of daily reportage. Discussion forums are updated when participants post comments, which can be several times a day.
  • An object of the present invention is thus to provide a solution so as to alleviate the above disadvantages.
  • the objects of the invention are achieved by a method, a node, a system, a computer program product, and a computer program distribution medium that are characterized by what is stated in the independent claims.
  • the preferred embodiments of the invention are disclosed in the dependent claims.
  • the invention is based on the idea of determining the semantic relevancy of a target data item by comparing it with a predefined group of comparison terms.
  • the terms are associated with semantic classes and the comparison is implemented in all applicable semantic classes using a similarity function specific for the semantic class.
  • the result of the comparison provides a result value that is used to select a function from a predefined group of functions associated with specific result values.
  • An advantage of the invention is that it enables fast semantic filtering of data items and improves accuracy of the filtering decisions in comparison with the state of the art solutions.
  • Figure 1 illustrates an embodiment of a system according to the in- vention
  • Figure 2 illustrates a term space comprising semantic classes
  • Figure 3A illustrates positioning of terms the structure of a levelled hierarchy
  • Figure 3B shows a corresponding simplified hierarchic structure for the location terms of Figure 3A.
  • Figure 4 illustrates the decision on similarity being based on the sign of the distance to the hyperplane
  • Figure 5 illustrates a procedural model of the operation of the embodied system
  • Figure 6 illustrates steps of the embodied method according to the invention.
  • Figure 7 illustrates the dynamic adjustment of the filtering.
  • Figure 1 illustrates an embodiment of a system according to the invention.
  • the structural block chart of Figure 1 represents a system 10 suitable for receiving data items, performing an analysis on the received data items and, on the basis of the analysis, determining an action for further processing of the data item.
  • a data item refers here to a set of data that a system may recognize to be mutually related and thus treat as a separately treatable unit.
  • a data item can comprise, for example, a stream of one or more bits, a group of one or more bit streams, a record in a file, a group of records in a file, a complete file, and the like.
  • the data may be of any media type, including text, audio, video, and the like.
  • the system 10 comprises an interface unit 11 that provides an input interface 12 for feeding data items for processing the system 10 and an output interface 13 for outputting data from the system 10 to connected systems and/or processes.
  • the implementation of an interface unit 11 varies according to the application, and may be a simple boundary for, for example, an external physical connection, or a sophisticated functional element that provides con- nection to a plurality of network nodes over various communication access types.
  • the client/server model provides a convenient way to interconnect sources of data items and users of the system that are distributed across different locations.
  • Advantageous system architectures comprise at least client-server, master/slave, and peer-to-peer configurations.
  • the system 10 also comprises a control unit 14, an element that comprises an arithmetic logic unit, a number of special registers and control circuits.
  • a memory unit 15 Connected to the control unit 14 is a memory unit 15, a data medium where computer-readable data or programs or application data can be stored.
  • the memory unit typically comprises data storages that allow both reading and writing (RAM), and a memory whose contents can only be read (ROM).
  • the system 10 also comprises a user interface unit 16 with input unit 17 for inputting data by the user for internal processing in the system, and output unit 18 for outputting user data from the processes of the system 10.
  • input unit comprise a keypad, a touch screen, a microphone, or the like, or the combination of them.
  • output unit comprise a screen, a touch screen, a loudspeaker, or the like, or the combination of them.
  • the interface unit 11 , the control unit 14, the memory unit 15, and the user interface unit 16 are electrically interconnected for performing systematic execution of operations on the received and/or stored data according to the predefined, essentially programmed processes.
  • the operations comprise functions for implementing the operations and interfaces of the system 10. These operations are described in more detail with Figures 2 to 6.
  • a data item to be processed by the system comprises a group of one or more terms.
  • a term is a data unit that has a meaning, a message that is intended or ex- pressed or signified by the appearance of the term.
  • the system is also configured with a group of one or more comparison terms. During operation of the system the meanings of the terms of the data item, later called as target terms, are compared with the meanings of the comparison terms, and a subsequent routing action is determined on the basis of the comparison result. In different semantic environments, terms may have a different meaning. Using an analogy from card games, it is commonly known that a card has a meaning defined by the rules of the game.
  • the meaning of a card does not depend directly on its name on its relationship with the other cards in the game.
  • the rules of the game define the relationships between the cards, and the role of colours, suits and values may vary from game to game.
  • a game thus establishes a semantic environment, a hand of cards corresponds with a data item, and an individual card corresponds with a term. It is clear that due to the different rules, a good hand in one game may be a disaster in another game.
  • interpretation of a term may be based on a variety of criteria and the meaning of a term lies in the way it is interpreted in each se- mantic environment.
  • the exemplary system embodied in Figure 1 operates on data items that are input to the system in form of bit streams.
  • a data item can be downloaded to the system through the interface unit from an external source, like an external database, or removable communication media, or retrieved from an internal database stored in the memory unit.
  • the control unit interprets the bit stream of the data unit according to the predefined data protocol of the system into audio, video or text data units, or the like, and identifies terms formed by these data units. For example, a bit stream that corresponds with a text data file may typically be mapped into characters.
  • the embodied system analyses the character strings of one or more characters within the text data file and extracts terms that correspond with the recognisable character strings.
  • a space may be considered a character, and a term may correspond with a character string that comprises one or more spaces. If a word equals to a character string without spaces, a term may correspond with one word or a combination of words, spaces and other used symbols between them.
  • languages that do not employ white space as a word boundary for example, Chinese, Japanese, Korean
  • words may be split and combined into terms in various ways. The extraction of terms from documents is routinely used in document indexing, and is thus generally known to a person skilled in the art. The selected procedure for generating terms is, as such, basically not relevant to the invention.
  • the term extraction may employ state of the art parses that are capable of syntactical, morphological and functional dependency parsing. For example, for English language documents, a variety of term extractors are available.
  • a semantic environment is associated with a term space W that includes a plurality of terms t, that signify a specific meaning and are thus relevant in the selected semantic environment.
  • a system applying the semantic environment needs to be able to detect and interpret these terms in the incoming data items. For example, if the system of Figure 1 is configured to operate on English language text documents, it is able to recognise character strings, like 'week' or 'last week' in the data item, and map these character strings into appropriate terms in W.
  • the term space W is divided into one or more semantic classes.
  • semantic classes are defined so that terms share a specific semantic characteristic.
  • this is illustrated by four exemplary classes N (names), L (locations), and M (miscellaneous).
  • the terms within a semantic class may be fully identical (cf. 'cat' and 'cat'), closely similar (cf. 'cat' and 'kitten') or remotely similar (cf. 'cat' and 'tiger'), so a pair of terms in the same semantic class can be assigned a real value that denotes the similarity between the meanings of the terms in that particular semantic class.
  • a semantic class is associated with a specific similarity function that, in a way appropriate for the semantic class, on the basis of two input terms returns a value that represents the similarity of these two terms.
  • ⁇ structure a data item employing these classes and functions may be called as a ⁇ structure.
  • the comparison of two ⁇ structure data items is thus possible, because they have the same structure defined by the language ⁇ .
  • similarity function is based on a spe- cific predefined structure that may be used to map the terms into a value that represents the mutual relationship of the terms within the semantic class.
  • a widely applied method for computing the term-term similarities is a term-frequency - inverted document frequency (TFIDF) where the term frequency in the data item is multiplied by the informativeness of the term.
  • the informativeness, or the specificity of the term is expressed as the logarithm of the inverse of the document frequency. If we denote with A a vector of a data item comprising terms ai, a2, ..., a s in semantic class Si, the weight of the term a, occurring in vector A is where fA(a) equals to the number of occurrences of term a, in vector A, N d , is the total number of documents and d(a) equals to the number of documents in which the term a, occurs.
  • the term-term similarity function ⁇ k TFIDF(a b) is simply multiplication of the term-weights, i.e.
  • COVER is a similarity measure based on the common path of two terms with respect to the path lengths to the terms.
  • COVER is a function
  • B a vector of a comparison terms, comprising terms bi, b2, ..., b s in semantic class S b
  • fA(a) and f B (b) are the frequencies of terms a and b
  • j(a,b) is the Jaccard coefficient of the lengths of the two paths
  • Figure 3A shows a simplified example on how the terms in the semantic class L are positioned in the structure of the 4-level hierarchy.
  • the system detects a term t ⁇ in class L, it also determines the type of the location, and positions the term into correct level in the structure of Figure 3A.
  • the first level corresponds with one node, the world; the second level corresponds with continents, the third level with countries and the fourth level with cities. It is clear that any level of detail and other information structures are applicable within the scope of protection.
  • Figure 3B shows a corresponding simplified hierarchic structure for the location terms of Figure 3A.
  • Each node in the tree stands for a location.
  • the common path of terms 'Paris' and 'Lyon' starts from the root and covers 'Europe' and 'France'.
  • the length of the common path is thus 2, and the lengths of pahts of the individual terms from the root is 3.
  • f(a) is the frequency of the location term a in the background corpus
  • T is the number of all location term occurrences.
  • an occurrence of all location term is also an occurrence of all locations above it all the way to the root. For example, on encountering 'Paris', we consider it also an occurrence of 'France' and 'Europe'. This way, the fre- quency of the root node ('World') equals to T and thus its information content is zero. Furthermore, the probability can only increase (and the information con- tent decrease) as one moves up in the taxonomy tree.
  • the similarity of two terms is based on the common node furthest from the root, or equivalently, the common node with the lowest probability and thus the highest information content. If S(a,b) comprises the common path of location terms a and b, the term-term similarity is a function
  • f A (a) and f B (b) are the raw term frequencies of a and b.
  • the similarity function for a semantic class advantageously reflects the similarity in a way that is relevant for the purpose and/or for the subject of the terms in the semantic class.
  • similarity of geographical names may be based on their absolute geospatial distance.
  • comparing 'New York' to 'Newark' would result in higher similarity than comparing 'New York' to 'York' or Toronto', because 'Newark' is located closer than the other two.
  • similarity could be based on, for example, on ontology of a particular field (for example, taxonomy of bacteria, music genres), on physical measures (for example, spatial distance, temporal distance), statistical models (for example, bigram models, distributional similarity), or product names (for example, EM64T could be a processor that is 'almost' similar to IA64).
  • ontology of a particular field for example, taxonomy of bacteria, music genres
  • physical measures for example, spatial distance, temporal distance
  • statistical models for example, bigram models, distributional similarity
  • product names for example, EM64T could be a processor that is 'almost' similar to IA64.
  • a data item may comprise recognizable terms in one or more semantic classes, and for the purposes of the analysis the terms may be written out as a combination of vectors, each distinct for a specific semantic class.
  • This combination of vectors corresponding with a data item is hereinafter called as a multivector.
  • the comparison of two multivectors is carried out for each semantic class separately. While the comparison of two traditional document vectors produces one real valued similarity, the result of class-wise com- parison of two multivectors is a similarity vector v.
  • the similarity ⁇ k of two vectors A k and B k of a semantic class S k may thus be determined from the equation: where m is the number of terms in the semantic class of k and ⁇ k(a,,b,) the class-specific pair wise similarity of terms a, and b,. If ⁇ k is based on identity, the function ⁇ k is equivalent if the cosine of the two vectors.
  • Figure 4 shows a simplified illustration of a multivector A of the target terms, and a multivector B of the comparison terms.
  • the semantic environment comprises three semantic classes N (names), L (locations), and M (miscellaneous).
  • N names
  • L locations
  • M micellaneous
  • the terms of the two multivectors are divided into these semantic classes, and the comparison of the two multivectors is performed class per class.
  • the specific similarity function is used.
  • TFIDF in the semantic class L
  • threshold is also substantially multidimensional.
  • an n-dimensional threshold In an environment with a number of n semantic classes, an n-dimensional threshold, a hyper- plane, is needed to separate similarity from dissimilarity. The assumption is that in a semantic environment, similar documents are fundamentally similar in the same way, and the similar and non-similar vectors form learnable clusters in the specific vector space.
  • a hyperplane On a basis of an analyzed training material that comprises relevant and non-relevant pairs of multivectors, it is possible to determine a hyperplane to be utilized for subsequent classification of similarity vectors.
  • the similarity vectors from a plurality of comparisons between two multivectors are mapped into the space of similarity vectors, and the decision on similarity is based on the relation between the hyperplane and the coordinates of the similarity vector.
  • the decision on similarity may be based on the sign of the distance to the hyperplane.
  • the plus signs are used to denote coordinates that correspond with a similarity vector computed on the basis of mul- tivectors of target terms and comparison terms, and where the multivectors are considered appropriately similar for relevancy.
  • the minus signs denote coordinates corresponding with a similarity vector of multivectors that are considered dissimilar for relevancy.
  • the similarity space is divided into parts and the comparison with the hyperplane returns a result value R that indicates which part of the similarity space the similarity vector is associated with.
  • the hyperplane divides the similarity space into two parts, and one only needs to know which of the two parts the similarity vector belongs to.
  • the weights factor for a semantic class may be determined by running comparisons on pairs of documents in training data, and drawing stratified random samples of a plurality of comparisons. The samples may be used to train a perceptron, and use the trained perceptron against all other samples. The cross-validation produces a number of candidate weight vectors that may be averaged for the weight vector w to be used in the evaluations between the similarity vector and the hyperplane.
  • Figure 5 illustrates a procedural model of the operation of the embodied system.
  • the process P1 denotes the stage of comparing the multivec- tors A and B in semantic classes with specific similarity functions in order to compute the similarity vector v, as described above.
  • the process P2 denotes the stage of comparing the similarity vector v with the hyperplane in order to determine the part of the vector space the similarity vector v belongs to, as described above.
  • the process P3 denotes a subsequent stage of determining a func- tion F to be performed by the system according to the derived result value R.
  • the system is configured with a group of pre-programmed functions, each corresponding with a result value such that whenever a result value R is derived as a result of procedures P1 and P2, a corresponding function F is initiated by the system.
  • the procedures of F may vary considerably according to the application.
  • the system serves to determine whether an incoming document is relevant in view of the group of comparison terms.
  • the result value R may be determined from the inner product of the similarity vector and the weight, as described above, and a positive value re- lates to relevant documents and a negative value relates to irrelevant docu- merits.
  • the system is thus configured with a function F1 that is initiated in response to negative values of R and a function F2 that is initiated in response to positive values of R.
  • F1 may comprise a procedure for creating a record on the analyzed data item and stores the re- cord in the list of relevant documents.
  • the record comprises at least an address in which the data item is accessible for retrieval, and advantageously additional metadata applicable for the purpose of organizing the relevant documents according to a defined classification factor (date, source, etc.).
  • F2 may comprise a procedure for rejecting the analyzed data item.
  • the retrieval address may subsequently be used as a routing address for the upload request initiated by a user clicking a hyperlink in a result view of relevant documents in the user interface. This enables a method of providing users with straightforward access to a continuously updated list of documents relevant in respect of the filter criteria formed by the comparison terms.
  • F1 may comprise a procedure for routing the analyzed data item to a first address and F2 may comprise a procedure for routing the analyzed data item to a second address.
  • the comparison terms in the multivector B may correspond to any group of input terms.
  • the comparison terms may be formed from, for example, a group of search terms input by the user of the system, or extracted from one or more documents referred to by the user of the system.
  • the flow chart of Figure 6 illustrates a simplified embodiment of the method implemented in the embodied system according to the invention. A more detailed description of the steps may be referred from the corresponding descriptions of earlier Figure 1 to 4. In the beginning, the system is configured (step 60) with a group of comparison terms.
  • the group of comparison terms may, for example, correspond with a group of search terms input by a user of the system, they may be extracted from a document or averaged from a group of documents.
  • the analysis of a new data item begins with reception (step 61 ) of a data item DOC A .
  • the data item is parsed (step 62) and the terms relevant in the selected semantic environment are extracted from the data.
  • These target terms extracted from DOC A are compiled (step 63) into a multivector A, and arranged into semantic classes applicable in the selected semantic environment.
  • the group of predefined comparison terms arranged into a multivector B applying the same semantic classes are retrieved for processing.
  • a comparison is performed in each applied semantic class according to the similarity function specific to the semantic class, and as a result a similarity vector v illustrating the semantic similarity between multivector A of the target terms and multivector B of the comparison terms is determined (step 64).
  • a similarity vector v illustrating the semantic similarity between multivector A of the target terms and multivector B of the comparison terms is determined (step 64).
  • Each applicable comparison value R corresponds with a predefined function of the system and a function F(R) corresponding with the derived result value R is initiated (step 66).
  • the system then moves to a standby state where reception or input of a new data item is checked (step 67).
  • reception or input of a new data item is checked (step 67).
  • step 68 the procedure continues from step 61.
  • the filter is formed by the multivector of the comparison terms and the multivector is computed as an average of terms from a group of documents that deal with a particular subject. For example, one may collect a group of documents recognizably dealing with a defined subject, for example 'Zidane headbutt'. From the documents one may extract and derive a group of terms that form the average vector of the documents relevant in respect of the defined incident. The average vector thus does not represent any single comparison document but forms a combination of terms that in an optimal way combine the semantic context around the subject. Use of average vector also allows weighting the averaged terms in a selected manner.
  • all terms in the group of documents considered relevant for a particular incident may be included in the group of com- parison terms.
  • the weight of each of the terms may then be set to the average of the times the term appears in the documents. For example, in case the term 'headbutt' appears in one of the documents once and in another one of the documents seven times, the weight of the term in the average vector would be the average of one and seven, i.e. four.
  • Using the average vector as the vector of the comparison terms enables the system to detect and record new documents dealing with the incident.
  • Use of average vectors facilitates a further advantage in form of dynamic adaptivity of the system. Adaptivity in this context means that the semantic interpretation of the terms may be adapted through the results of the performed analysis.
  • a monitored subject typically evolves, and the amount of data accumulates during the monitoring period.
  • the average vector is recalculated using the new relevant document as one of the relevant documents.
  • the new terms and phrases used in describing the monitored instance are dynamically updated to the semantic interpretation of the data items.
  • the semantic interpretation of the subject may be continuously focused, which further improves the operation of the system.
  • the system is configured (step 70) with a group of comparison terms that represent an average of terms determined from a group of data items known relevant in respect of the defined subject.
  • Steps 71 to 75 correspond substantially with steps 61 to 65 of Figure 6.
  • the system checks (step 76) whether the result is applicable for adjusting the average vector in use.
  • the condition for the adjustment decision may vary according to the implementation. For example, the adjustment may be done on a basis of a result value that indicates great similarity with the average vector (adjustment according to a positive finding).
  • the ad- justment may be done on a basis of a result value that indicates very weak similarity with the average vector (adjustment according to a negative finding).
  • a result value that indicates very weak similarity with the average vector
  • the system decides (step 77) that the data item could be used for focusing the average vector, is recalculates the average vector whereafter the new updated average vector will be used in comparison with the target terms (step 78). If the update is considered unnecessary, the system will continue through steps 79 to 81 as in the corresponding steps of 66 to 68 of Figure 6.
  • the use of average vector allows also adjustment of the semantic interpretation on the basis of computational criteria that are relatively easy to determine during the operation of the system, or are even available as a by-product of the computation of the similarity functions.
  • the informativeness of a term may be determined from its frequency in the series of consecutive data items. At times, a term that fairly seldom appears in the data items may pop out very frequently, and at those times the term itself becomes far less informative.
  • the term 'Lebanon' may for a long period be relatively rarely cited, and the informativeness of the term remains at a defined level.
  • the amount of data items including the term 'Lebanon' increases, but the data items may actually be relevant in several semantic subjects, for example, 'Oil damages', 'UN operations', 'tourism'.
  • the informativeness of the term 'Lebanon' may be decreased and increased in relation with the frequency of the appearance of the term in the consecutive data items.
  • the average vector represents a compilation of common terms in the averaged relevant documents, it is possible to adjust the filtering by focusing the average to the most common terms.
  • the vector v is expanded with a power set where
  • data items A and B may be processed into multivectors A(ni, h, mi) and B(n 2 , I2, m 2 ), as described above.
  • a weight vector w with which the new determined similarity vectors may be detected may represent relevancy or irrelevancy.
  • the decision may be based on the distance from v to w, calculated with inner product.
  • a positive value of the inner product may be interpreted to indicate relevancy, a negative value irrelevancy.
  • the vectors may be expanded.
  • the expansion is made with dimensions N*M, N*L, M*L, N*M*L.
  • determination of the distance between the similarity vector and the hyperplane require that the distance of the hyperplane w is taken into consideration.
  • the invention provides a computer program product encoding a computer program of instructions for executing a computer process.
  • the invention provides a computer program distribution medium readable by a computer and encoding a computer program of instructions for executing a computer process.
  • the distribution medium may include a computer readable medium, a program storage medium, a record medium, a computer readable memory, a computer readable software distribution package, a computer readable signal, a computer readable telecommunications signal, and/or a computer readable compressed software package.
  • Embodiments of the computer process are shown and described in conjunction with Figure 6.
  • the computer program may be executed in the control unit of a computing node receiving input data items.

Abstract

A system for filtering data items. Terms in a semantic environment are divided into semantic classes. A relevancy vector is computed between comparison terms and target terms from a data item with a relevancy function specific to a semantic class. The position of the relevancy vector is compared with a predefined relevancy hyperplane that divides the relevancy vector space into parts. Each relevancy vector space part encloses a subspace of the rele- vancy vector space and corresponds with a specific filtering function. The solution enables fast semantic filtering of data items and improves accuracy of the filtering decisions.

Description

FILTERING OF DATA ITEMS
FIELD OF THE INVENTION
The present invention relates to data processing, and particularly to a method for filtering of data items and a corresponding system.
BACKGROUND OF THE INVENTION
Data refers herein to information that is operable by an information system. Correspondingly, a data item refers to a set of data that an information system may recognize to be mutually related and thus process as a separately treatable unit. Operations on data items typically involve filtering, in the sense that during filtering a defined characteristic of the data item is analyzed, and on the basis of the analysis the data item is subjected to predefined procedures. Filtering is typically responsive to the similarity between one or more filter criteria and the data item under analyze. When the analysis indicates a match with a defined filter criteria the data item is considered relevant in respect of that filter.
In conventional systems data items are often text files that include words and filter criteria is formed by search terms provided by users of the information systems. The relevancy of an analyzed text file is judged by hits between the search words and the words in the text files. In the wealth of digital information sources available and becoming available, these conventional systems are becoming less and less effective, especially in terms of applicability to continuous online monitoring of new subjects.
In 2005, the number of publicly available web pages was estimated to close in on 12 billion, up from 1999 estimate of 800 million pages. The num- ber of non-public pages with restricted access, like intranet, extranet, and subscribed material, was estimated to be 400 to 550 times the public web. The rate at which new pages are introduced to the web has been estimated 8% per week, which practically means doubling the size of web in less than every three months. At the same time, pages disappear from the web, and after a year only 20% of them persist.
The public material is accessed through portals and search engines, and thus the capabilities of indexing have had a huge impact on availability of information. The indexing, however, involves a delay that varies from the operator to operator, typical refresh rates being between a day and some tens of days. At best, these delayed strategies work for archived information streams, but they do not really address the need of more dynamic monitoring, for example, monitoring on a daily basis. Still, practically all news agencies and papers publish news online and new stories are published throughout the day. Some of the news material is syndicated specifically as a web feed, while most of the daily material is published on a web page - a digital analogy of a traditional news paper. Some web portals contain news from news agencies and news sites thus producing a collection of daily reportage. Discussion forums are updated when participants post comments, which can be several times a day.
In consequence, it is clear that data is produced more rapidly than it is indexed. This means that relevant information remains undetected for some time before becoming adequately indexed or even after indexing become buried in a mass of less relevant and completely irrelevant documents.
SUMMARY OF THE INVENTION
An object of the present invention is thus to provide a solution so as to alleviate the above disadvantages. The objects of the invention are achieved by a method, a node, a system, a computer program product, and a computer program distribution medium that are characterized by what is stated in the independent claims. The preferred embodiments of the invention are disclosed in the dependent claims. The invention is based on the idea of determining the semantic relevancy of a target data item by comparing it with a predefined group of comparison terms. The terms are associated with semantic classes and the comparison is implemented in all applicable semantic classes using a similarity function specific for the semantic class. The result of the comparison provides a result value that is used to select a function from a predefined group of functions associated with specific result values.
An advantage of the invention is that it enables fast semantic filtering of data items and improves accuracy of the filtering decisions in comparison with the state of the art solutions.
BRIEF DESCRIPTION OF THE DRAWINGS
In the following the invention will be described in greater detail by means of preferred embodiments with reference to the attached drawings, in which
Figure 1 illustrates an embodiment of a system according to the in- vention; Figure 2 illustrates a term space comprising semantic classes;
Figure 3A illustrates positioning of terms the structure of a levelled hierarchy;
Figure 3B shows a corresponding simplified hierarchic structure for the location terms of Figure 3A.;
Figure 4 illustrates the decision on similarity being based on the sign of the distance to the hyperplane;
Figure 5 illustrates a procedural model of the operation of the embodied system; Figure 6 illustrates steps of the embodied method according to the invention; and
Figure 7 illustrates the dynamic adjustment of the filtering.
DETAILED DESCRIPTION OF THE INVENTION
The following embodiments are exemplary implementations of the present invention. Although the specification may refer to "an", "one", or "some" embodiment(s), reference is not necessarily made to the same embodiment(s), and/or a feature does not apply to a single embodiment only. Single features of different embodiments of this specification may be combined to provide further embodiments. Figure 1 illustrates an embodiment of a system according to the invention. The structural block chart of Figure 1 represents a system 10 suitable for receiving data items, performing an analysis on the received data items and, on the basis of the analysis, determining an action for further processing of the data item. A data item refers here to a set of data that a system may recognize to be mutually related and thus treat as a separately treatable unit. A data item can comprise, for example, a stream of one or more bits, a group of one or more bit streams, a record in a file, a group of records in a file, a complete file, and the like. The data may be of any media type, including text, audio, video, and the like. The system 10 comprises an interface unit 11 that provides an input interface 12 for feeding data items for processing the system 10 and an output interface 13 for outputting data from the system 10 to connected systems and/or processes. The implementation of an interface unit 11 varies according to the application, and may be a simple boundary for, for example, an external physical connection, or a sophisticated functional element that provides con- nection to a plurality of network nodes over various communication access types. In network systems, the client/server model provides a convenient way to interconnect sources of data items and users of the system that are distributed across different locations. Advantageous system architectures comprise at least client-server, master/slave, and peer-to-peer configurations.
The system 10 also comprises a control unit 14, an element that comprises an arithmetic logic unit, a number of special registers and control circuits. Connected to the control unit 14 is a memory unit 15, a data medium where computer-readable data or programs or application data can be stored. The memory unit typically comprises data storages that allow both reading and writing (RAM), and a memory whose contents can only be read (ROM).
Advantageously, the system 10 also comprises a user interface unit 16 with input unit 17 for inputting data by the user for internal processing in the system, and output unit 18 for outputting user data from the processes of the system 10. Examples of said input unit comprise a keypad, a touch screen, a microphone, or the like, or the combination of them. Examples of said output unit comprise a screen, a touch screen, a loudspeaker, or the like, or the combination of them.
The interface unit 11 , the control unit 14, the memory unit 15, and the user interface unit 16 are electrically interconnected for performing systematic execution of operations on the received and/or stored data according to the predefined, essentially programmed processes. In solutions according to the invention, the operations comprise functions for implementing the operations and interfaces of the system 10. These operations are described in more detail with Figures 2 to 6.
The operation of the system is based on a semantic analysis of the input data items on the basis of a selected semantic environment. A data item to be processed by the system comprises a group of one or more terms. A term is a data unit that has a meaning, a message that is intended or ex- pressed or signified by the appearance of the term. The system is also configured with a group of one or more comparison terms. During operation of the system the meanings of the terms of the data item, later called as target terms, are compared with the meanings of the comparison terms, and a subsequent routing action is determined on the basis of the comparison result. In different semantic environments, terms may have a different meaning. Using an analogy from card games, it is commonly known that a card has a meaning defined by the rules of the game. The meaning of a card does not depend directly on its name on its relationship with the other cards in the game. The rules of the game define the relationships between the cards, and the role of colours, suits and values may vary from game to game. A game thus establishes a semantic environment, a hand of cards corresponds with a data item, and an individual card corresponds with a term. It is clear that due to the different rules, a good hand in one game may be a disaster in another game. Correspondingly, interpretation of a term may be based on a variety of criteria and the meaning of a term lies in the way it is interpreted in each se- mantic environment.
The exemplary system embodied in Figure 1 operates on data items that are input to the system in form of bit streams. A data item can be downloaded to the system through the interface unit from an external source, like an external database, or removable communication media, or retrieved from an internal database stored in the memory unit. The control unit interprets the bit stream of the data unit according to the predefined data protocol of the system into audio, video or text data units, or the like, and identifies terms formed by these data units. For example, a bit stream that corresponds with a text data file may typically be mapped into characters. The embodied system analyses the character strings of one or more characters within the text data file and extracts terms that correspond with the recognisable character strings. A space may be considered a character, and a term may correspond with a character string that comprises one or more spaces. If a word equals to a character string without spaces, a term may correspond with one word or a combination of words, spaces and other used symbols between them. In languages that do not employ white space as a word boundary (for example, Chinese, Japanese, Korean), words may be split and combined into terms in various ways. The extraction of terms from documents is routinely used in document indexing, and is thus generally known to a person skilled in the art. The selected procedure for generating terms is, as such, basically not relevant to the invention.
It should be noted that the given examples are merely intended to illustrate the possible implementations, and should not be used to restrict the interpretation of the scope. The data items and terms do not need to be in form of bitstreams but may be stored and transferred in any other format identifiable to the system. In addition to text data, any other communication media types, for example audio and video data, are also applicable in the implementation of the system.
In the embodied system that uses character strings, the term extraction may employ state of the art parses that are capable of syntactical, morphological and functional dependency parsing. For example, for English language documents, a variety of term extractors are available.
The easiest comparison between two terms is naturally a one-to- one comparison. In conventional search engines, a text document is parsed into words and, where necessary, the words are returned into some sort of ba- sic form (for example, Ηelsinkiin', Helsingissa' ~ 'Helsinki'). Words and their combinations (for example, 'Helena' and 'St. Helena') are mapped into applicable terms, and a match between a search term and a term in the target document may be determined by simple comparison of the character strings of the two terms. This is a straightforward method, but typically produces a con- siderable amount of matches and thereby documents for the manual analysis by the user. On the other hand, in any semantic environment there are groups of terms (for example, 'cat' and 'tabby') that cannot be linked by this kind of one-to-one character string similarity, but do have the same kind of meaning. Consequently, documents that deal with similar or the same issues, but that are written using other type of vocabulary do not come appropriately out through conventional searches.
When operating on text data, the meaning of a term is in the way people use it to communicate and convey ideas successfully. The communication between people is based on the semantic environment of languages, each of which has an inherent logic that defines valid and invalid uses of words. The rules of the language form a predefined convention that is known to members that are familiar with the language. The grammar of an applied language provides a group of syntactical combination rules that are to an extent used in conventional search engines. In the embodied system also an ontological simi- larity between the terms is recognized.
As illustrated in Figure 2, a semantic environment is associated with a term space W that includes a plurality of terms t, that signify a specific meaning and are thus relevant in the selected semantic environment. A system applying the semantic environment needs to be able to detect and interpret these terms in the incoming data items. For example, if the system of Figure 1 is configured to operate on English language text documents, it is able to recognise character strings, like 'week' or 'last week' in the data item, and map these character strings into appropriate terms in W.
In the system according to the invention the term space W is divided into one or more semantic classes. Typically semantic classes are defined so that terms share a specific semantic characteristic. In Figure 2, this is illustrated by four exemplary classes N (names), L (locations), and M (miscellaneous). The terms within a semantic class may be fully identical (cf. 'cat' and 'cat'), closely similar (cf. 'cat' and 'kitten') or remotely similar (cf. 'cat' and 'tiger'), so a pair of terms in the same semantic class can be assigned a real value that denotes the similarity between the meanings of the terms in that particular semantic class. A semantic class is associated with a specific similarity function that, in a way appropriate for the semantic class, on the basis of two input terms returns a value that represents the similarity of these two terms.
Accordingly, we may define a term-term similarity measure of a se- mantic class Si from a function & : <& x 5« —* IR. The term-term similarity maps each pair of terms in a semantic class onto the real axis. The more the two terms are related, the higher is their mutual similarity. For example, the conventional one-to-one comparison could thus be expressed as
If with Σ we denote a language comprising semantic classes Si, S2,
..., Sn and similarity functions δi, δ2, ... , δn , a data item employing these classes and functions may be called as a Σ structure. This implicates that in the system the group of comparison terms and the group of target terms are structured according to the same semantic environment (Σ structure); the terms are classified according to the same unary relations (Si, S2, ..., Sn), and their similarity is determined with the same similarity functions (δi, δ2, ... , δn). The comparison of two Σ structure data items is thus possible, because they have the same structure defined by the language Σ.
For example, in the exemplary class L, several different types of similarity functions are applicable. Data items relevant in respect of a specific location in Siberia, could comprise terms such as 'Russia', 'Lena', 'Vilyuy', 'Lensk', and 'Yakutsk'. Clearly, these terms have nothing in common on one- to-one term comparison; their relevance cannot be understood without a defined geographical ontology. In this case similarity function is based on a spe- cific predefined structure that may be used to map the terms into a value that represents the mutual relationship of the terms within the semantic class.
The range of similarity functions and corresponding similarity measures varies from application to application. In the following, some similarity functions are described briefly without, however, limiting the scope to the par- ticular exemplary measures and functions.
TFIDF
A widely applied method for computing the term-term similarities is a term-frequency - inverted document frequency (TFIDF) where the term frequency in the data item is multiplied by the informativeness of the term. The informativeness, or the specificity of the term is expressed as the logarithm of the inverse of the document frequency. If we denote with A a vector of a data item comprising terms ai, a2, ..., as in semantic class Si, the weight of the term a, occurring in vector A is
Figure imgf000009_0001
where fA(a) equals to the number of occurrences of term a, in vector A, Nd, is the total number of documents and d(a) equals to the number of documents in which the term a, occurs. The term-term similarity function δk TFIDF(a b) is simply multiplication of the term-weights, i.e.
The TFIDF is based on identity δld, so term-term similarity is positive if and only if a=b.
Cover
Assuming taxonomy of terms, COVER is a similarity measure based on the common path of two terms with respect to the path lengths to the terms. COVER is a function
where B a vector of a comparison terms, comprising terms bi, b2, ..., bs in semantic class Sb, fA(a) and fB(b) are the frequencies of terms a and b, and j(a,b) is the Jaccard coefficient of the lengths of the two paths,
Figure imgf000009_0002
As a simplified example, geographical locations applicable in the embodied system may be divided into multileveled ontology. Figure 3A shows a simplified example on how the terms in the semantic class L are positioned in the structure of the 4-level hierarchy. When the system detects a term tι in class L, it also determines the type of the location, and positions the term into correct level in the structure of Figure 3A. The first level corresponds with one node, the world; the second level corresponds with continents, the third level with countries and the fourth level with cities. It is clear that any level of detail and other information structures are applicable within the scope of protection.
Figure 3B shows a corresponding simplified hierarchic structure for the location terms of Figure 3A. Each node in the tree stands for a location. Considering the simple geographical taxonomy of Figure 3B, the common path of terms 'Paris' and 'Lyon' starts from the root and covers 'Europe' and 'France'. The length of the common path is thus 2, and the lengths of pahts of the individual terms from the root is 3. The Jaccard coefficient of the terms is thus 2/(3+3-2)=0,5. As a further example, similarity between 'China' and 'Paris' yields 0/(2+3-O)=O, similarity between 'Paris' and 'Germany' yields 1/(3+2- 1 )=0,25, and similarity between 'Paris ' and 'France' yields 2/(2+3-2)=0,67. Clearly, the higher the Jaccard coefficient is, the closer the terms reside in the taxonomy; they 'cover' the same path.
Resnik
In order to determine the semantic similarity of terms in an 'is a' taxonomy, it has been proposed to use information content that arises from information theory. The information content of a term a is the negative log- likelihood, -log p(a). If the term is highly probable (frequent), its information content is low and vice versa.
In this context, the probabilities of, for example, location terms are simple relative frequencies
where f(a) is the frequency of the location term a in the background corpus, and T is the number of all location term occurrences. When calculating the probabilities, an occurrence of all location term is also an occurrence of all locations above it all the way to the root. For example, on encountering 'Paris', we consider it also an occurrence of 'France' and 'Europe'. This way, the fre- quency of the root node ('World') equals to T and thus its information content is zero. Furthermore, the probability can only increase (and the information con- tent decrease) as one moves up in the taxonomy tree.
As with COVER, the similarity of two terms is based on the common node furthest from the root, or equivalently, the common node with the lowest probability and thus the highest information content. If S(a,b) comprises the common path of location terms a and b, the term-term similarity is a function
where fA(a) and fB(b) are the raw term frequencies of a and b.
As discussed above, the similarity function for a semantic class advantageously reflects the similarity in a way that is relevant for the purpose and/or for the subject of the terms in the semantic class. For example, instead of the tree-like taxonomy presented above, similarity of geographical names may be based on their absolute geospatial distance. Thus comparing 'New York' to 'Newark' would result in higher similarity than comparing 'New York' to 'York' or Toronto', because 'Newark' is located closer than the other two. Con- sequently, similarity could be based on, for example, on ontology of a particular field (for example, taxonomy of bacteria, music genres), on physical measures (for example, spatial distance, temporal distance), statistical models (for example, bigram models, distributional similarity), or product names (for example, EM64T could be a processor that is 'almost' similar to IA64). On the basis of the above, implementations using other types of similarity functions are obvious for a person skilled in the art.
In the embodied system, a data item may comprise recognizable terms in one or more semantic classes, and for the purposes of the analysis the terms may be written out as a combination of vectors, each distinct for a specific semantic class. This combination of vectors corresponding with a data item is hereinafter called as a multivector. As each semantic class has a specific similarity measure, the comparison of two multivectors is carried out for each semantic class separately. While the comparison of two traditional document vectors produces one real valued similarity, the result of class-wise com- parison of two multivectors is a similarity vector v.
The similarity Δk of two vectors Ak and Bk of a semantic class Sk may thus be determined from the equation:
Figure imgf000011_0001
where m is the number of terms in the semantic class of k and δk(a,,b,) the class-specific pair wise similarity of terms a, and b,. If δk is based on identity, the function Δk is equivalent if the cosine of the two vectors.
Continuing the example of Figure 2, Figure 4 shows a simplified illustration of a multivector A of the target terms, and a multivector B of the comparison terms. The semantic environment comprises three semantic classes N (names), L (locations), and M (miscellaneous). According to the invention, the terms of the two multivectors are divided into these semantic classes, and the comparison of the two multivectors is performed class per class. In each semantic class the specific similarity function is used. For exam- pie, in the semantic class N one could employ TFIDF, in the semantic class L
Cover or Resnik, and in the semantic class M TFIDF. Assuming the similarity of names results in 0,39, the similarity of locations 0,35 and the similarity of miscellaneous terms 0,73, the similarity vector would be v=(0,39; 0,35; 0,73).
For a decision based on comparisons in a plurality of semantic classes, threshold is also substantially multidimensional. In an environment with a number of n semantic classes, an n-dimensional threshold, a hyper- plane, is needed to separate similarity from dissimilarity. The assumption is that in a semantic environment, similar documents are fundamentally similar in the same way, and the similar and non-similar vectors form learnable clusters in the specific vector space. On a basis of an analyzed training material that comprises relevant and non-relevant pairs of multivectors, it is possible to determine a hyperplane to be utilized for subsequent classification of similarity vectors. The similarity vectors from a plurality of comparisons between two multivectors are mapped into the space of similarity vectors, and the decision on similarity is based on the relation between the hyperplane and the coordinates of the similarity vector. For example, as shown in Figure 4, the decision on similarity may be based on the sign of the distance to the hyperplane.
In the example of Figure 4, the plus signs are used to denote coordinates that correspond with a similarity vector computed on the basis of mul- tivectors of target terms and comparison terms, and where the multivectors are considered appropriately similar for relevancy. Correspondingly, the minus signs denote coordinates corresponding with a similarity vector of multivectors that are considered dissimilar for relevancy. In the embodied case, the similarity space is divided into parts and the comparison with the hyperplane returns a result value R that indicates which part of the similarity space the similarity vector is associated with. In the embodied example, the hyperplane divides the similarity space into two parts, and one only needs to know which of the two parts the similarity vector belongs to. For a person skilled in the art it is clear that also some other methods of dividing the similarity space are possible, and that more than two parts may be employed. Generally, when one has a trained hyperplane w, one can compute the distance Ψ(v) of any new vector and the hyperplane (w, b) by their inner product:
where wi,...,wn are the weights of the semantic classes and w0 is the bias. The weights factor for a semantic class may be determined by running comparisons on pairs of documents in training data, and drawing stratified random samples of a plurality of comparisons. The samples may be used to train a perceptron, and use the trained perceptron against all other samples. The cross-validation produces a number of candidate weight vectors that may be averaged for the weight vector w to be used in the evaluations between the similarity vector and the hyperplane.
Figure 5 illustrates a procedural model of the operation of the embodied system. The process P1 denotes the stage of comparing the multivec- tors A and B in semantic classes with specific similarity functions in order to compute the similarity vector v, as described above. The process P2 denotes the stage of comparing the similarity vector v with the hyperplane in order to determine the part of the vector space the similarity vector v belongs to, as described above.
The process P3 denotes a subsequent stage of determining a func- tion F to be performed by the system according to the derived result value R. For this purpose, the system is configured with a group of pre-programmed functions, each corresponding with a result value such that whenever a result value R is derived as a result of procedures P1 and P2, a corresponding function F is initiated by the system. For a person skilled in the art it is clear that the procedures of F may vary considerably according to the application.
In the embodied exemplary case the system serves to determine whether an incoming document is relevant in view of the group of comparison terms. The result value R may be determined from the inner product of the similarity vector and the weight, as described above, and a positive value re- lates to relevant documents and a negative value relates to irrelevant docu- merits. The system is thus configured with a function F1 that is initiated in response to negative values of R and a function F2 that is initiated in response to positive values of R. As an example, in the positive case, F1 may comprise a procedure for creating a record on the analyzed data item and stores the re- cord in the list of relevant documents. The record comprises at least an address in which the data item is accessible for retrieval, and advantageously additional metadata applicable for the purpose of organizing the relevant documents according to a defined classification factor (date, source, etc.). Correspondingly, F2 may comprise a procedure for rejecting the analyzed data item. The retrieval address may subsequently be used as a routing address for the upload request initiated by a user clicking a hyperlink in a result view of relevant documents in the user interface. This enables a method of providing users with straightforward access to a continuously updated list of documents relevant in respect of the filter criteria formed by the comparison terms. As another example, F1 may comprise a procedure for routing the analyzed data item to a first address and F2 may comprise a procedure for routing the analyzed data item to a second address. This enables a method of routing data items according to the filtering done by the semantic meaning of their content. As discussed above, the comparison terms in the multivector B may correspond to any group of input terms. For a person skilled in the art it is obvious that the comparison terms may be formed from, for example, a group of search terms input by the user of the system, or extracted from one or more documents referred to by the user of the system. The flow chart of Figure 6 illustrates a simplified embodiment of the method implemented in the embodied system according to the invention. A more detailed description of the steps may be referred from the corresponding descriptions of earlier Figure 1 to 4. In the beginning, the system is configured (step 60) with a group of comparison terms. As discussed above, the group of comparison terms may, for example, correspond with a group of search terms input by a user of the system, they may be extracted from a document or averaged from a group of documents. The analysis of a new data item begins with reception (step 61 ) of a data item DOCA. The data item is parsed (step 62) and the terms relevant in the selected semantic environment are extracted from the data. These target terms extracted from DOCA are compiled (step 63) into a multivector A, and arranged into semantic classes applicable in the selected semantic environment. The group of predefined comparison terms arranged into a multivector B applying the same semantic classes are retrieved for processing. A comparison is performed in each applied semantic class according to the similarity function specific to the semantic class, and as a result a similarity vector v illustrating the semantic similarity between multivector A of the target terms and multivector B of the comparison terms is determined (step 64). For the semantic environment there exists a hyperplane w that is previously generated from a selected sampling of teaching material. Parameters of the hyperplane w of the semantic environment are retrieved for processing, and the similarity vector v is compared with the hyperplane in order to determine (step 65) a comparison value R that indicates the part of the vector space the similarity vector v belongs to. Each applicable comparison value R corresponds with a predefined function of the system and a function F(R) corresponding with the derived result value R is initiated (step 66). The system then moves to a standby state where reception or input of a new data item is checked (step 67). When a new data item arrives (step 68), the procedure continues from step 61.
In a further embodiment of the invention, the filter is formed by the multivector of the comparison terms and the multivector is computed as an average of terms from a group of documents that deal with a particular subject. For example, one may collect a group of documents recognizably dealing with a defined subject, for example 'Zidane headbutt'. From the documents one may extract and derive a group of terms that form the average vector of the documents relevant in respect of the defined incident. The average vector thus does not represent any single comparison document but forms a combination of terms that in an optimal way combine the semantic context around the subject. Use of average vector also allows weighting the averaged terms in a selected manner. As a simple example, all terms in the group of documents considered relevant for a particular incident may be included in the group of com- parison terms. The weight of each of the terms may then be set to the average of the times the term appears in the documents. For example, in case the term 'headbutt' appears in one of the documents once and in another one of the documents seven times, the weight of the term in the average vector would be the average of one and seven, i.e. four. Using the average vector as the vector of the comparison terms enables the system to detect and record new documents dealing with the incident. Use of average vectors facilitates a further advantage in form of dynamic adaptivity of the system. Adaptivity in this context means that the semantic interpretation of the terms may be adapted through the results of the performed analysis. For example, a monitored subject typically evolves, and the amount of data accumulates during the monitoring period. In adaptation, after detecting a relevant document the average vector is recalculated using the new relevant document as one of the relevant documents. Through the average vector, the new terms and phrases used in describing the monitored instance are dynamically updated to the semantic interpretation of the data items. Through feedback from the numerous iterations, the semantic interpretation of the subject may be continuously focused, which further improves the operation of the system.
The dynamic adjustment of the filtering is illustrated by means of the flow chart of Figure 7. In the embodied case, the system is configured (step 70) with a group of comparison terms that represent an average of terms determined from a group of data items known relevant in respect of the defined subject. Steps 71 to 75 correspond substantially with steps 61 to 65 of Figure 6. When the result value R that indicates the possible similarity or dissimilarity of the data in respect of the filter criteria formed by the comparison terms is available, the system checks (step 76) whether the result is applicable for adjusting the average vector in use. The condition for the adjustment decision may vary according to the implementation. For example, the adjustment may be done on a basis of a result value that indicates great similarity with the average vector (adjustment according to a positive finding). Alternatively, the ad- justment may be done on a basis of a result value that indicates very weak similarity with the average vector (adjustment according to a negative finding). For a person skilled in the art it is clear that any other criteria may be used within the scope of protection. If the system decides (step 77) that the data item could be used for focusing the average vector, is recalculates the average vector whereafter the new updated average vector will be used in comparison with the target terms (step 78). If the update is considered unnecessary, the system will continue through steps 79 to 81 as in the corresponding steps of 66 to 68 of Figure 6.
In addition, the use of average vector allows also adjustment of the semantic interpretation on the basis of computational criteria that are relatively easy to determine during the operation of the system, or are even available as a by-product of the computation of the similarity functions. For example, as discussed above, the informativeness of a term may be determined from its frequency in the series of consecutive data items. At times, a term that fairly seldom appears in the data items may pop out very frequently, and at those times the term itself becomes far less informative.
For example, the term 'Lebanon' may for a long period be relatively rarely cited, and the informativeness of the term remains at a defined level. During a time of restlessness, the amount of data items including the term 'Lebanon' increases, but the data items may actually be relevant in several semantic subjects, for example, 'Oil damages', 'UN operations', 'tourism'. In order to improve the accuracy of the dynamic filtering for detecting the relevant data items, the informativeness of the term 'Lebanon' may be decreased and increased in relation with the frequency of the appearance of the term in the consecutive data items. As another example, in case the average vector represents a compilation of common terms in the averaged relevant documents, it is possible to adjust the filtering by focusing the average to the most common terms. This reduces the possibility of erroneous positive interpretations ('noise') during the filtering. As a further example, it is also possible that as time progresses, the terms used in data items relevant in respect of the same subject change considerably. This may be alleviated by adjusting the weight of the terms in the average vector according to their frequency in the consecutive data items. For example, if a data item has not appeared in a defined number of consecutive data items, or within a defined period of time, its weight may be decreased. Appearances, or consecutive appearances of the term will, correspondingly, refresh the term by increasing its weight in the average vector.
These methods of adjusting the average vector are exemplary only and should not be used to interpret restrictions to the scope of protection. For a person skilled in the art it is obvious that the average vector may be adjusted in several various ways.
The space of vectors v is seldom linearly separable, and the number of semantic classes may sometimes be quite low. In order to enhance the learnability, in a further embodiment of the invention, the vector v is expanded with a power set where
Φ ; M" - JR.2" maps the vector v into a space where the 2n dimensions correspond with a subset of the semantic classes.
For example, in the semantic environment applying semantic classes N, L, M, data items A and B may be processed into multivectors A(ni, h, mi) and B(n2, I2, m2), as described above. Denoting the similarity functions of classes N, L, M with SN, SL, SM, correspondingly, the exemplary values for the similarity vector v could be:
Figure imgf000018_0001
v(L)= SLN(I1 , I2)=O,5.
As described above, in order to be able to determine whether the similarity vector v indicates appropriate semantic similarity (in this context referred to as relevancy) between the data items A and B, a way to interpret the similarity of the vectors in the applied semantic environment. Thus a number of data items known to be appropriately similar for relevancy are compared in the same manner, and a record of the computed similarity vectors v and a corresponding parameter value, like 1 or -1 , are stored in the teaching material. This generates a set of vector/value pairs
(<v1 ,1 >, <v2,-1 >, <v3,-1 >,...,<vn,1 >).
On the basis of statistic analysis or machine learning it is possible to determine a weight vector w with which the new determined similarity vectors may be detected to represent relevancy or irrelevancy. The decision may be based on the distance from v to w, calculated with inner product. A positive value of the inner product may be interpreted to indicate relevancy, a negative value irrelevancy.
It has been detected that three or four semantic classes, which would be well applicable for the semantic analysis, do not necessarily provide adequate number of classes for machine learning. In order to improve the accuracy of the comparisons with the hyperplane, the vectors may be expanded. In this embodiment, the expansion is made with dimensions N*M, N*L, M*L, N*M*L. In several cases determination of the distance between the similarity vector and the hyperplane require that the distance of the hyperplane w is taken into consideration. In order to streamline calculations, it is usually assumed that w[0]=b and a neutral element v[0]=1 is included in v. Thus the expanded vector becomes v' = (1 ; 0,2 ; 0,3 ; 0,5 ; 0,06 ; 0,1 ; 0,15 ;0,03). It has been noticed that in an eight-dimensional space a weight vector w' that detects negative and positive cases is easier to find. The above example is provided merely to illustrate the present embodiment. Application of other expansion methods is obvious for a person skilled in the art.
In an aspect, the invention provides a computer program product encoding a computer program of instructions for executing a computer process.
In another aspect, the invention provides a computer program distribution medium readable by a computer and encoding a computer program of instructions for executing a computer process. The distribution medium may include a computer readable medium, a program storage medium, a record medium, a computer readable memory, a computer readable software distribution package, a computer readable signal, a computer readable telecommunications signal, and/or a computer readable compressed software package. Embodiments of the computer process are shown and described in conjunction with Figure 6. The computer program may be executed in the control unit of a computing node receiving input data items.
Even though the invention is described above with reference to an example according to the accompanying drawings, it is clear that the invention is not restricted thereto but it can be modified in several ways within the scope of the appended claims.

Claims

Claims
1. A method of filtering data items comprising a plurality of terms, a term being a separately identifiable data unit that is associated with a meaning in a semantic environment, the method comprising: dividing terms in the semantic environment into semantic classes; defining a group of one or more comparison terms; determining from a data item a group of one or more target terms; computing relevancy vector between the comparison terms and the target terms, where an element of the relevancy vector is computed with a predefined relevancy function specific to a semantic class, the relevancy function being adjusted to return an element value that represents semantic similarity between comparison terms and target terms in the semantic class; comparing the position of the relevancy vector with a predefined relevancy hyperplane that divides the space of relevancy vectors into parts, each relevancy vector space part enclosing a subspace of the relevancy vector space and corresponding with a specific filtering function; determining the filtering function for the data item according to the part of the position of the vector.
2. A method according to claim ^ characterized by defining the comparison terms by extracting the comparison terms from a group of one or more search terms, input by a user of a system implementing the method.
3. A method according to claim ^characterized by the terms corresponding with a character string and the method comprising defining the comparison terms by extracting the comparison terms from a defined docu- ment.
4. A method according to claim ^characterized by the terms corresponding with a character string and the method comprising averaging the comparison terms from a group of data items.
5. A method according to claim 4, characterized by: selecting a part of the relevancy vector space to comprise positions relevant to averaging; in response to the relevancy vector of a data item having a position relevant to averaging, recomputing the comparison terms from a group of items including said data item.
6. A method according to claim 4, characterized by adjusting the weights according to statistical analysis on the appearance of the terms in a series of consecutive data items.
7. A method according to any of the preceding claims, characterized by the step of comparing comprising computing the distance of the relevancy vector and the hyperplane by their inner product, and considering the positive distance values indicating the vector being in the first part and the negative distance values indicating the vector being the second part.
8. A method according to any of the preceding claims, characterized by the semantic classes comprising at least one of the following classes: locations, times, and names.
9. A method according to any of the preceding claims, characterized by the predefined semantic relevancy in the semantic class of locations being computed on the basis of a hierarchic matrix, where each level of the matrix corresponds with the type of hierarchic location definition.
10. A method according to claim 10, c h a r a c t e r i z e d by the predefined semantic relevancy being computed by dividing the length of the common path in the hierarchic matrix with the sum of the paths to the elements.
11. A method according to any of the preceding claims, characterized by the step of comparing comprising use of additional dimensions generated from combinations of semantic classes.
12. A system comprising an interface unit for receiving data items comprising a plurality of terms, a term being a separately identifiable data unit that is associated with a meaning in a semantic environment; a control unit that electrically connected to the interface unit, the functions of the control unit being at least partially controlled by program code, said program code comprising program code configuring said system to divide terms in the semantic environment into semantic classes; program code configuring said system to define a group of one or more comparison terms; program code configuring said system to determine from a data item a group of one or more target terms; program code configuring said system to compute relevancy vector between the comparison terms and the target terms, where an element of the relevancy vector is computed with a predefined relevancy function specific to a semantic class, the relevancy function being adjusted to return an element value that represents semantic similarity between comparison terms and target terms in the semantic class; program code configuring said system to compare the position of the relevancy vector with a predefined relevancy hyperplane that divides the relevancy vector space into parts, each relevancy vector space part enclosing a subspace of the relevancy vector space and corresponding with a specific filtering function; program code configuring said system to determine the filtering function for the data item according to the part of the position of the vector.
13. A system according to claim 12, c h a r a c t e r i z e d by the program code configuring said system further to define the comparison terms by extracting the comparison terms from a group of one or more search terms, input by a user of a system.
14. A system according to claim 12, c h a r a c t e r i z e d by the terms corresponding with a character string and the program code configuring said system further to define the comparison terms by extracting the comparison terms from a defined document.
15. A system according to claim 12, c h a r a c t e r i z e d by the terms corresponding with a character string and the program code configuring said system further to average the comparison terms from a group of data items.
16. A system according to claim 15, characterized by the program code configuring said system further to select a part of the relevancy vector space to comprise positions relevant to averaging; in response to the relevancy vector of a data item having a position relevant to averaging, recompute the comparison terms from a group of items including said data item.
17. A system according to claim 15 or 16, c h a r a c t e r i z e d by the program code configuring said system further to adjust the weights according to statistical analysis on the appearance of the terms in a series of consecutive data items.
18. A system according to any of the preceding claims, c h a r a c - t e r i z e d by the step of comparing comprising computing the distance of the relevancy vector and the hyperplane by their inner product, and considering the positive distance values indicating the vector being in the first part and the negative distance values indicating the vector being the second part.
19. A system according to any of the preceding claims, c h a r a c t e r i z e d by the program code configuring said system further to use addi- tional dimensions generated from combinations of semantic classes.
20. A system according to any of the preceding claims, c h a r a c t e r i z e d by the system being a network server.
21. A computer program product encoding a computer process of instructions for executing a computer process for filtering data items compris- ing a plurality of terms, a term being a separately identifiable data unit that is associated with a meaning in a semantic environment, the process comprising: dividing terms in the semantic environment into semantic classes; defining a group of one or more comparison terms; determining from a data item a group of one or more target terms; computing relevancy vector between the comparison terms and the target terms, where an element of the relevancy vector is computed with a predefined relevancy function specific to a semantic class, the relevancy function being adjusted to return an element value that represents semantic similarity between comparison terms and target terms in the semantic class; comparing the position of the relevancy vector with a predefined relevancy hyperplane that divides the relevancy vector space into parts, each relevancy vector space part enclosing a subspace of the relevancy vector space and corresponding with a specific filtering function; determining the filtering function for the data item according to the part of the position of the vector.
22. A computer program distribution medium readable by a computer and encoding a computer program of instructions for executing a computer process for filtering data items comprising a plurality of terms, a term being a separately identifiable data unit that is associated with a meaning in a semantic environment, the process comprising: dividing terms in the semantic environment into semantic classes; defining a group of one or more comparison terms; determining from a data item a group of one or more target terms; computing relevancy vector between the comparison terms and the target terms, where an element of the relevancy vector is computed with a predefined relevancy function specific to a semantic class, the relevancy func- tion being adjusted to return an element value that represents semantic similarity between comparison terms and target terms in the semantic class; comparing the position of the relevancy vector with a predefined relevancy hyperplane that divides the relevancy vector space into parts, each relevancy vector space part enclosing a subspace of the relevancy vector space and corresponding with a specific filtering function; determining the filtering function for the data item according to the part of the position of the vector.
23. The computer program distribution medium of claim 22, the dis- tribution medium comprising a computer readable medium, a program storage medium, a record medium, a computer readable memory, a computer readable software distribution package, a computer readable signal, a computer readable telecommunications signal, and a computer readable compressed software package.
PCT/FI2007/050511 2006-09-26 2007-09-25 Filtering of data items WO2008037848A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP07823148A EP2080126A4 (en) 2006-09-26 2007-09-25 Filtering of data items

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FI20065591 2006-09-26
FI20065591A FI120807B (en) 2006-09-26 2006-09-26 Filtering of data units

Publications (1)

Publication Number Publication Date
WO2008037848A1 true WO2008037848A1 (en) 2008-04-03

Family

ID=37067242

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/FI2007/050511 WO2008037848A1 (en) 2006-09-26 2007-09-25 Filtering of data items

Country Status (3)

Country Link
EP (1) EP2080126A4 (en)
FI (1) FI120807B (en)
WO (1) WO2008037848A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9665643B2 (en) 2011-12-30 2017-05-30 Microsoft Technology Licensing, Llc Knowledge-based entity detection and disambiguation
CN107209760A (en) * 2014-12-10 2017-09-26 凯恩迪股份有限公司 The sub-symbol data coding of weighting
US9864817B2 (en) 2012-01-28 2018-01-09 Microsoft Technology Licensing, Llc Determination of relationships between collections of disparate media types
CN110532345A (en) * 2019-07-15 2019-12-03 北京小米智能科技有限公司 A kind of processing method of unlabeled data, device and storage medium
US10628504B2 (en) 2010-07-30 2020-04-21 Microsoft Technology Licensing, Llc System of providing suggestions based on accessible and contextual information
WO2020197636A1 (en) * 2019-03-28 2020-10-01 Microsoft Technology Licensing, Llc Encoder using machine-trained term frequency weighting factors that produces a dense embedding vector

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5694592A (en) * 1993-11-05 1997-12-02 University Of Central Florida Process for determination of text relevancy
WO2005041063A1 (en) * 2003-09-30 2005-05-06 British Telecommunications Public Limited Company Information retrieval
US20050228783A1 (en) * 2004-04-12 2005-10-13 Shanahan James G Method and apparatus for adjusting the model threshold of a support vector machine for text classification and filtering
EP1587010A2 (en) * 2004-04-15 2005-10-19 Microsoft Corporation Verifying relevance between keywords and web site contents
US20060212413A1 (en) * 1999-04-28 2006-09-21 Pal Rujan Classification method and apparatus

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5694592A (en) * 1993-11-05 1997-12-02 University Of Central Florida Process for determination of text relevancy
US20060212413A1 (en) * 1999-04-28 2006-09-21 Pal Rujan Classification method and apparatus
WO2005041063A1 (en) * 2003-09-30 2005-05-06 British Telecommunications Public Limited Company Information retrieval
US20050228783A1 (en) * 2004-04-12 2005-10-13 Shanahan James G Method and apparatus for adjusting the model threshold of a support vector machine for text classification and filtering
EP1587010A2 (en) * 2004-04-15 2005-10-19 Microsoft Corporation Verifying relevance between keywords and web site contents

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
HAMAMOTO M. ET AL.: "A comparative study of feature vector-based topic detection schemes for text streams", PROC. 2005 INT. WORKSHOP ON CHALLENGES IN WEB INFORMATION RETRIEVAL AND INTEGRATION (WIRI'05), IEEE, XP010860567 *
MAKKONEN J. ET AL.: "Simple semantics in topic detection and tracking", INFORMATION RETRIEVAL, vol. 7, 2004, pages 347 - 368, XP019207463 *
NALLAPATI R.: "Semantic language models for topic detection and tracking", PROC. HTL-NAACL 2003, STUDENT RESEARCH WORKSHOP, EDMONTON, 2003, pages 1 - 6, XP008138104, Retrieved from the Internet <URL:http://www.web.archive.org/web> *
See also references of EP2080126A4 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10628504B2 (en) 2010-07-30 2020-04-21 Microsoft Technology Licensing, Llc System of providing suggestions based on accessible and contextual information
US9665643B2 (en) 2011-12-30 2017-05-30 Microsoft Technology Licensing, Llc Knowledge-based entity detection and disambiguation
US9864817B2 (en) 2012-01-28 2018-01-09 Microsoft Technology Licensing, Llc Determination of relationships between collections of disparate media types
CN107209760A (en) * 2014-12-10 2017-09-26 凯恩迪股份有限公司 The sub-symbol data coding of weighting
US11061952B2 (en) 2014-12-10 2021-07-13 Kyndi, Inc. Weighted subsymbolic data encoding
WO2020197636A1 (en) * 2019-03-28 2020-10-01 Microsoft Technology Licensing, Llc Encoder using machine-trained term frequency weighting factors that produces a dense embedding vector
US11669558B2 (en) 2019-03-28 2023-06-06 Microsoft Technology Licensing, Llc Encoder using machine-trained term frequency weighting factors that produces a dense embedding vector
CN110532345A (en) * 2019-07-15 2019-12-03 北京小米智能科技有限公司 A kind of processing method of unlabeled data, device and storage medium
EP3767488A1 (en) * 2019-07-15 2021-01-20 Beijing Xiaomi Intelligent Technology Co., Ltd. Method and device for processing untagged data, and storage medium
US11334723B2 (en) 2019-07-15 2022-05-17 Beijing Xiaomi Intelligent Technology Co., Ltd. Method and device for processing untagged data, and storage medium

Also Published As

Publication number Publication date
FI20065591A (en) 2008-03-27
EP2080126A1 (en) 2009-07-22
EP2080126A4 (en) 2010-12-01
FI120807B (en) 2010-03-15
FI20065591A0 (en) 2006-09-26

Similar Documents

Publication Publication Date Title
CN114610515B (en) Multi-feature log anomaly detection method and system based on log full semantics
CN101449271B (en) Annotated by search
WO2018066445A1 (en) Causal relationship recognition apparatus and computer program therefor
CN105488077B (en) Method and device for generating content label
US20050021545A1 (en) Very-large-scale automatic categorizer for Web content
US20070294252A1 (en) Identifying a web page as belonging to a blog
JP5010885B2 (en) Document search apparatus, document search method, and document search program
JP2004005668A (en) System and method which grade, estimate and sort reliability about document in huge heterogeneous document set
CN108027814B (en) Stop word recognition method and device
CN107679035B (en) Information intention detection method, device, equipment and storage medium
EP2080126A1 (en) Filtering of data items
WO2009017464A1 (en) Relation extraction system
KR20030094966A (en) Rule based document auto taxonomy system and method
CN106886512A (en) Article sorting technique and device
KR101059557B1 (en) Computer-readable recording media containing information retrieval methods and programs capable of performing the information
CN113836274A (en) Abstract extraction method, device, equipment and medium based on semantic analysis
CN111460783B (en) Data processing method and device, computer equipment and storage medium
US20120117090A1 (en) System and method for managing digital contents
CN112347339A (en) Search result processing method and device
Kumar et al. Near-duplicate web page detection: an efficient approach using clustering, sentence feature and fingerprinting
CN114328800A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN102722526B (en) Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method
JP2007079730A (en) Word similarity decision device and method and program
CN113157857B (en) Hot topic detection method, device and equipment for news
CN111708872B (en) Dialogue method and device and electronic equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07823148

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2007823148

Country of ref document: EP