WO2008029150A1 - Categorisation of data using a model - Google Patents

Categorisation of data using a model Download PDF

Info

Publication number
WO2008029150A1
WO2008029150A1 PCT/GB2007/003370 GB2007003370W WO2008029150A1 WO 2008029150 A1 WO2008029150 A1 WO 2008029150A1 GB 2007003370 W GB2007003370 W GB 2007003370W WO 2008029150 A1 WO2008029150 A1 WO 2008029150A1
Authority
WO
WIPO (PCT)
Prior art keywords
data object
category
patterns
model
input data
Prior art date
Application number
PCT/GB2007/003370
Other languages
French (fr)
Inventor
Yuriy Byurher
Original Assignee
Xploite Plc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from GB0624665A external-priority patent/GB2442286A/en
Application filed by Xploite Plc filed Critical Xploite Plc
Publication of WO2008029150A1 publication Critical patent/WO2008029150A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Definitions

  • the present invention relates to a method and system for generating a categorisation model, and for categorising data using the model.
  • Categorisation of content such as web pages is useful for searching for information and for filtering information, such filtering web pages from user internet requests so as to exclude inappropriate material.
  • This method has several disadvantages. Firstly, the process overlooks some web pages or classes of web pages due to a non-systematic approach to categorisation. Secondly, multi-user input results in a lack of consistency of classification. And thirdly, the human time cost of classifying large sets of web pages, such as the majority of the internet, is very high.
  • this method is particularly unsuitable for producing a database which is capable of filtering web pages resulting from user internet requests.
  • Each method extracts feature vectors from a training set of pre-categorised web pages.
  • a feature vector is a vector of numeric features of objects within the training set. Feature vectors may include the occurrence of words or phrases, link information and image information. It should be noted that it is difficult to create a "perfect" training set where the categorised web pages contain no noise - features within a web page that contradict its categorisation.
  • a number of implementations of the automated methods use occurrence of words as their feature vectors and extract words from the web pages in the training set to build a vocabulary.
  • a large vocabulary will result in high dimensionality of feature vectors.
  • High dimensionality of feature vectors can cause the automated methods to overfit.
  • Overfitting is a phenomenon by which the classifier is tuned also to the contingent, rather than just the constitutive characteristics of the training set.
  • Classifiers which overfit the training data tend to be good at re-classifying the data they have been trained on, but much worse at classifying previously unseen data.
  • Overfitting is a significant problem for the Bayes algorithm method.
  • This algorithm learns from the training set the conditional probability of each word for each category. A new web page is categorised within the category with the highest posterior probability computed according to the Bayes rule. When the number of training samples (web pages) is insufficient with respect to number of features (words) used, the probabilities learnt may reflect noise in the training set and cannot be trusted to produce accurate categorisation.
  • the SVM method uses the feature vectors in the vocabulary to determine a hyperplane for each category.
  • Each category hyperplane is defined by support vectors on the edge of the hyperplane.
  • a category hyperplane is used to categorise a new web page as either within the category or not.
  • thresholds for relevance criteria are a complex task. It is dependent on the size and quality of the training set. As an example, thresholds can relate to the exclusion of common words (stop-words), replacement of words with their stems, and exclusion of very rare words.
  • a method for categorising an input data object using a model comprising a plurality of patterns associated with at least one weighting for a category including the steps of: i) identifying patterns within the input data object that correspond to at least some of the patterns within the model; ii) for each identified pattern, determining a weighting for at least one category from the model; iii) calculating a score for the input data object for at least one category based at least in part on the weightings of the identified patterns and the frequency of the identified patterns within the input data object; and iv) categorising the input data object based at least in part on the calculated score.
  • At least some of the identified patterns are associated with a plurality of weightings for a plurality of categories.
  • a plurality of scores may be calculated for the input data object for a plurality of categories.
  • the input data object is categorised in dependence on the calculated score only when the calculated score meets a predefined threshold.
  • the predefined threshold may be an empirically-derived threshold.
  • the score s, for a category q may be calculated as follows: score ⁇ C j )
  • the patterns may be words and the input data object may be a document or a web page.
  • model is generated in accordance with the second aspect of the invention.
  • a method for generating a model for categorising input data objects including the steps of: i) associating each data object of a plurality of data objects with one of a plurality of categories; ii) extracting a plurality of patterns from the plurality of data objects; iii) calculating a weighting for each pattern for at least one category based at least in part on the frequency of the pattern within data objects associated with that category compared to the frequency of the pattern within all data objects; and iv) inserting each weighting into the model.
  • the weighting is only inserted into the model if the weighting meets a predefined threshold.
  • the predefined threshold may be an empirically- derived threshold.
  • each data object is only associated with one category.
  • a weighting may be calculated for one or more of the patterns for a plurality of categories.
  • the patterns may be words and the data objects may be documents or web pages.
  • Figure 1 shows a schematic diagram illustrating a method of generating a categorisation model in accordance with an embodiment of the invention.
  • Figure 2 shows a schematic diagram illustrating a method of categorising a document using a model in accordance with an embodiment of the invention.
  • the present invention provides a method and system for generating a categorisation model and for categorising data using the model.
  • a set of training documents, where each document is associated with a category, are used to train the model.
  • a weighting for each word in the documents associated with each category is calculated based on the frequency of the word within the category compared to the combined frequency of the word in all categories.
  • the (word, category, weighting) tuple is inserted into the model if the weighting meets a threshold.
  • a new document is categorised by generating a weighting for each category by combining the weightings, extracted from the model, of each word in the document which is paired with that category.
  • the document is categorised within a category if the generated weighting meets a threshold.
  • the present invention will be described with reference to the categorisation of web pages using word frequency as feature vectors. However, it will be appreciated that the invention may be used to categorise any type of document or data object, such as a word document, an XML document, an image or a data stream. It will also be appreciated that feature vectors other than word frequency may be utilised, including frequency of phrases, structural elements, or any other pattern. The use of structural elements as feature vectors in categorisation is described in patent application CATEGORISATION OF DATA USING STRUCTURAL ANALYSIS.
  • a training set 11 is used to create the model 10 for categorisation.
  • the training set 11 can be an existing training set or can be compiled by manual or automatic means such as by querying a search engine with queries created for each category.
  • the training set 11 is comprised of a plurality 12 of documents D where each document d q e D is associated 13 with only one category in the set C.
  • the documents may be associated with multiple categories, and/or may be associated with a category or categories in accordance with a defined weight.
  • the weight may affect the weight given to words extracted from that document.
  • Each document d q e D appears as a sequence of words w, ⁇ W .
  • the words may be from one of any language, from any language, or an invented language such as web script. It will be further appreciated that, in alternative embodiments, any patterns could be used in place of words, such as phrases, code portions, or structural information about the document.
  • step 14 the words within the documents are extracted and each word w, is associated with the category C 1 that the corresponding document is associated with.
  • the word-category association forms a pair (w ( ,c y ) for each entry of the word w, in the document.
  • a set E is constructed comprising all pairs 15 from all the documents D.
  • step 16 the frequency of each word within each category is determined 17 and the combined frequency of each word within all categories is determined 18.
  • step 19 a weighting w v ' is calculated 20 for each word for each category (each unique pair (w,,C j ) in set E) equal to the proportion of frequency of that word in that category to the combined frequency of the word in all categories.
  • the weighting W 11 may be calculated in accordance with the following formula:
  • COUtU(W 1 , C j ) wv COUUt(W 1 ) where COUHt(W ⁇ c 1 ) - is the number of entries of the pair ⁇ w,,C j ) in the set £, and
  • COUHt(W 1 ) - is the number of pairs in the set E, in which the word W 1 is used.
  • the weighting w tJ ' is combined with the associated word and category (w,,c y ) to form a word, category, weighting tuple (w ( ,c ; 1 w'J. y
  • the weighting w v ' meets (is greater than or equal to) a threshold thO in step 21 then the corresponding tuple (w,,c 7 ,w' y ) is inserted 22 into the categorisation model 10, otherwise the tuple is discarded 23.
  • the threshold thO may be predetermined by empirical methods.
  • the threshold thO may be 0.7.
  • the tuple for a weighting w ⁇ ' is inserted into the categorisation model if count ⁇ w,) meets a threshold th1 and number of documents containing word w, meets a threshold th2.
  • the threshold th1 may be predetermined by empirical methods.
  • the threshold th1 may be three.
  • the threshold th2 may also be predetermined by empirical methods.
  • the threshold th2 may be ten.
  • the categorisation model is used to train an artificial neural network (ANN).
  • the ANN may be a standard feed-forward neural network with one input layer, one hidden layer and one output layer.
  • the sizes of input and output layers are equal to M (number of categories).
  • the size of hidden layer is configurable (minimal value may be equal to M).
  • a set of neural network patterns for training the neural network are first created using the model.
  • a neural network pattern is calculated for each document d q in the training set D and consists of sets V1 and V2.
  • V1 will form the input set for the neural network and V2 will form the output set for the neural network.
  • Each document is a sequence of words w,.
  • the set V1 is comprised of elements v h v 2 ..,v n where n is the number of categories in set C.
  • the weighting Wjj corresponding to the word w, and the category c is extracted from the model.
  • Vj is assigned the sum of the weightings corresponding to all the words in the document for category c, divided by the total number of words in the document d q .
  • the set V2 is comprised of elements v2i, v2 2 ...v2 n where n is the number of categories in set C. If d q is categorising within the training set as belonging to category c, then v2 y is given the value "1" otherwise v2 ⁇ is given the value "0".
  • the set of patterns may be used to train the neural network.
  • the model 31 includes a set of categories G, a set of words W and a set of weights W , which associate words in the set W with categories in the set C. Each word in the set W can be associated with one or more categories in the set C.
  • C j - is a category in the set C, w y e [ ⁇ ,l] -is a weight of association between the word w, and category c ⁇
  • the model may be generated as described in Figure 1.
  • the words 33 w, within the document 30, which correspond to words in the model, are extracted from the document in step 32. This ensures that only words for which weightings exist are considered.
  • each word W 1 is replaced by a set of pairs (c j ,w y ' ) where C 1 e C and w[ j are weights of the words W 1 associated with the category c 7 within the model.
  • the sets of pairs 35 form a set P.
  • a score s, for each category c is then calculated based on the combined weightings for each category compared to the total number of considered words w, within the document 30.
  • the score s may be calculated as follows: score ⁇ C j )
  • ⁇ P ⁇ - is the number of elements in the set P
  • score(C j ) - is a sum of weights w tJ ' for all pairs in the set P, which contain category c, .
  • the threshold th3 may be predetermined by empirical evidence.
  • the threshold th3 may be 0.3.
  • the neural network trained on the categorisation model will be used to categorise the input data object.
  • a neural network input set V1 is calculated for the input document 30.
  • the set V1 is comprised of elements V 1 , v 2 .., v n where n is the number of categories in set C.
  • the weighting Wy corresponding to the word w, and the category c is extracted from the model 10.
  • Vj is assigned the sum of the weightings corresponding to all the words in the document for category c, divided by the total number of words in the input document 30.
  • the input set V1 is provided to the trained neural network and an output set V2 is created.
  • the set V2 is comprised of elements v2i, v2 2 ...v2 n where n is the number of categories in set C. Each element is a value between zero and one.
  • the input document is categorised in category c, if v2 j meets a threshold th4.
  • the threshold value th4 may be predetermined and may be equal to 0.7.
  • the training set is comprised of four documents as follows:
  • a set of words W 1 associated with categories c y are extracted from the training set:
  • a weighting w y ' is calculated for each unique pair of words and categories ⁇ w l t C j ) in the set E:
  • the word, category and weighting ( w, , c 7 , w/ ) tuple is added to the model M if the weighting is over the predetermined threshold (0.7):
  • the categorisation model M will be used.
  • the document to be categorised is:
  • a set P is created comprising the words in the document that are also in the model:
  • a score S j is calculated for each category c 7 by summing the weights and dividing by the size of the set P:
  • the categorisation model M will also be used.
  • the document to be categorised is:
  • a set P is created comprising the words in the document that are also in the model:
  • a score _? y is calculated for each category c ⁇ by summing the weights and dividing by the size of the set P:
  • FWS Fast Word Statistics
  • each of FWS, Bayesian, and SVM were first trained on a training set comprised of web pages categorised into one of the four categories and then each method was tested against a testing set of web pages for which the correct categorisation is known.
  • the training and testing sets contain raw HTML pages downloaded from the internet.
  • the distribution of the web pages across the sets and categories is as follows:
  • An embodiment provides a fast, linearly scalable learning method that permits fast construction of a categorisation model from raw and average-to-low quality of input documents.
  • an embodiment is immune to low quality training sets.
  • an embodiment does not require pre-processing of the vocabularies or tuning to provide accurate categorisation.
  • An embodiment uses a statistical analysis approach and can be utilised in other fields such as data research and data mining.
  • An embodiment produces a different pattern of categorisation results to existing methods, in that the valid/invalid matches and uncategorised results for a set of documents are likely to be different to any other method. The consequence of this is that the embodiment is suited for combination with existing methods to produce an improved

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method for generating a model for categorising input data objects, including the steps of: associating each data object of a plurality of data objects with one of a plurality of categories; extracting a plurality of patterns from the plurality of data objects; calculating a weighting for each pattern for at least one category based at least in part on the frequency of the pattern within data objects associated with that category compared to the frequency of the pattern within all data objects; and inserting each weighting into the model.

Description

CATEGORISATION OF DATA USING A MODEL
Field of Invention
The present invention relates to a method and system for generating a categorisation model, and for categorising data using the model.
Background
Categorisation of content such as web pages is useful for searching for information and for filtering information, such filtering web pages from user internet requests so as to exclude inappropriate material.
Traditionally web pages have been categorised by collating categorisation suggestions from human users. An example of a system created by this method includes dmoz.org.
This method has several disadvantages. Firstly, the process overlooks some web pages or classes of web pages due to a non-systematic approach to categorisation. Secondly, multi-user input results in a lack of consistency of classification. And thirdly, the human time cost of classifying large sets of web pages, such as the majority of the internet, is very high.
Therefore this method is particularly unsuitable for producing a database which is capable of filtering web pages resulting from user internet requests.
Automated methods for categorising web pages have been explored. Two popular methods are the use of Bayesian algorithms and the use of support vector machines (SVM).
Each method extracts feature vectors from a training set of pre-categorised web pages. A feature vector is a vector of numeric features of objects within the training set. Feature vectors may include the occurrence of words or phrases, link information and image information. It should be noted that it is difficult to create a "perfect" training set where the categorised web pages contain no noise - features within a web page that contradict its categorisation.
A number of implementations of the automated methods use occurrence of words as their feature vectors and extract words from the web pages in the training set to build a vocabulary.
A large vocabulary will result in high dimensionality of feature vectors. High dimensionality of feature vectors can cause the automated methods to overfit. Overfitting is a phenomenon by which the classifier is tuned also to the contingent, rather than just the constitutive characteristics of the training set.
Classifiers which overfit the training data tend to be good at re-classifying the data they have been trained on, but much worse at classifying previously unseen data.
Overfitting is a significant problem for the Bayes algorithm method. This algorithm learns from the training set the conditional probability of each word for each category. A new web page is categorised within the category with the highest posterior probability computed according to the Bayes rule. When the number of training samples (web pages) is insufficient with respect to number of features (words) used, the probabilities learnt may reflect noise in the training set and cannot be trusted to produce accurate categorisation.
The SVM method uses the feature vectors in the vocabulary to determine a hyperplane for each category. Each category hyperplane is defined by support vectors on the edge of the hyperplane. A category hyperplane is used to categorise a new web page as either within the category or not.
Some commentators in the literature suggest that SVMs are also susceptible to overfitting (Chen Lin et. al., An Anti-Noise Text Categorization Method based on Support Vector Machines; Youshua Bengio et. al., The Curse of Dimensionality for Local Kernel Machines).
To prevent overfitting for both methods, the vocabulary (collection of feature vectors) will often need to be reduced in size.
The vocabulary is reduced by setting feature relevance thresholds. Selecting thresholds for relevance criteria is a complex task. It is dependent on the size and quality of the training set. As an example, thresholds can relate to the exclusion of common words (stop-words), replacement of words with their stems, and exclusion of very rare words.
Even with thresholds, the vocabulary will generally need to be tuned by an expert.
The consequences of these difficulties with the Bayesian and SVM methods is that they must be trained using large high-quality training sets, and that significant user intervention is required to tune the methods for effective categorisation.
When the quality of the training set affects the quality of the categorisation method, extra effort must be expended by the human user to "clean" the training set. In addition, it is much more difficult to assist these methods to dynamically learn using new training data.
There is a desire for a method of categorising content, such as documents, web pages or any data object, which can utilise low quality training data.
It is an object of the present invention to provide a method for generating a categorisation model and categorising data which overcomes the disadvantages of above methods, or to at least provide a useful alternative.
Summary of the Invention According to a first aspect of the invention there is provided a method for categorising an input data object using a model comprising a plurality of patterns associated with at least one weighting for a category, including the steps of: i) identifying patterns within the input data object that correspond to at least some of the patterns within the model; ii) for each identified pattern, determining a weighting for at least one category from the model; iii) calculating a score for the input data object for at least one category based at least in part on the weightings of the identified patterns and the frequency of the identified patterns within the input data object; and iv) categorising the input data object based at least in part on the calculated score.
Preferably at least some of the identified patterns are associated with a plurality of weightings for a plurality of categories.
A plurality of scores may be calculated for the input data object for a plurality of categories.
It is preferred that the input data object is categorised in dependence on the calculated score only when the calculated score meets a predefined threshold. The predefined threshold may be an empirically-derived threshold.
The score s, for a category q may be calculated as follows: score{Cj )
SJ =
where is the number of identified patterns in the input data object, and score(Cj) .g ^g s(jm of vveightings for all identified patterns associated with
the category Cj The patterns may be words and the input data object may be a document or a web page.
It is preferred that the model is generated in accordance with the second aspect of the invention.
According to a second aspect of the invention there is provided a method for generating a model for categorising input data objects, including the steps of: i) associating each data object of a plurality of data objects with one of a plurality of categories; ii) extracting a plurality of patterns from the plurality of data objects; iii) calculating a weighting for each pattern for at least one category based at least in part on the frequency of the pattern within data objects associated with that category compared to the frequency of the pattern within all data objects; and iv) inserting each weighting into the model.
Preferably the weighting is only inserted into the model if the weighting meets a predefined threshold. The predefined threshold may be an empirically- derived threshold.
It is also preferred that each data object is only associated with one category.
A weighting may be calculated for one or more of the patterns for a plurality of categories.
The weighting W'J for a pattern w, for a category c,- may be calculated as follows: count(w,,c ) w = —
COUKt(W1 )
where coun ^WI 'CJ ' - is the frequency of the pattern in all data objects associated with the category c/, and COUrIt(W 1) _ js the freqUenCy of the pattern in all data objects.
The patterns may be words and the data objects may be documents or web pages.
Brief Description of the Drawings
Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawing in which:
Figure 1 : shows a schematic diagram illustrating a method of generating a categorisation model in accordance with an embodiment of the invention.
Figure 2: shows a schematic diagram illustrating a method of categorising a document using a model in accordance with an embodiment of the invention.
Detailed Description of the Preferred Embodiments
The present invention provides a method and system for generating a categorisation model and for categorising data using the model.
A set of training documents, where each document is associated with a category, are used to train the model. A weighting for each word in the documents associated with each category is calculated based on the frequency of the word within the category compared to the combined frequency of the word in all categories. The (word, category, weighting) tuple is inserted into the model if the weighting meets a threshold.
A new document is categorised by generating a weighting for each category by combining the weightings, extracted from the model, of each word in the document which is paired with that category. The document is categorised within a category if the generated weighting meets a threshold. The present invention will be described with reference to the categorisation of web pages using word frequency as feature vectors. However, it will be appreciated that the invention may be used to categorise any type of document or data object, such as a word document, an XML document, an image or a data stream. It will also be appreciated that feature vectors other than word frequency may be utilised, including frequency of phrases, structural elements, or any other pattern. The use of structural elements as feature vectors in categorisation is described in patent application CATEGORISATION OF DATA USING STRUCTURAL ANALYSIS.
Referring to Figure 1 , an embodiment of the invention for creating a categorisation model 10 for a set of categories C will now be described.
A training set 11 is used to create the model 10 for categorisation.
The training set 11 can be an existing training set or can be compiled by manual or automatic means such as by querying a search engine with queries created for each category.
The training set 11 is comprised of a plurality 12 of documents D where each document dq e D is associated 13 with only one category in the set C.
In some embodiments the documents may be associated with multiple categories, and/or may be associated with a category or categories in accordance with a defined weight. The weight may affect the weight given to words extracted from that document.
Each document dq e D appears as a sequence of words w, <≡ W .
The words may be from one of any language, from any language, or an invented language such as web script. It will be further appreciated that, in alternative embodiments, any patterns could be used in place of words, such as phrases, code portions, or structural information about the document.
In step 14 the words within the documents are extracted and each word w, is associated with the category C1 that the corresponding document is associated with. The word-category association forms a pair (w(,cy ) for each entry of the word w, in the document. A set E is constructed comprising all pairs 15 from all the documents D.
In step 16 the frequency of each word within each category is determined 17 and the combined frequency of each word within all categories is determined 18.
In step 19 a weighting wv' is calculated 20 for each word for each category (each unique pair (w,,Cj ) in set E) equal to the proportion of frequency of that word in that category to the combined frequency of the word in all categories.
The weighting W11 may be calculated in accordance with the following formula:
, COUtU(W1 , Cj ) wv = COUUt(W1 ) where COUHt(W^c1) - is the number of entries of the pair {w,,Cj) in the set £, and
COUHt(W1) - is the number of pairs in the set E, in which the word W1 is used.
The weighting wtJ' is combined with the associated word and category (w,,cy ) to form a word, category, weighting tuple (w(,c; 1w'J. y
If the weighting wv' meets (is greater than or equal to) a threshold thO in step 21 then the corresponding tuple (w,,c7,w'y ) is inserted 22 into the categorisation model 10, otherwise the tuple is discarded 23.
The threshold thO may be predetermined by empirical methods. The threshold thO may be 0.7.
In one embodiment the tuple for a weighting wυ' is inserted into the categorisation model if countζw,) meets a threshold th1 and number of documents containing word w, meets a threshold th2.
The threshold th1 may be predetermined by empirical methods. The threshold th1 may be three.
The threshold th2 may also be predetermined by empirical methods. The threshold th2 may be ten.
In one embodiment of the invention the categorisation model is used to train an artificial neural network (ANN). The ANN may be a standard feed-forward neural network with one input layer, one hidden layer and one output layer. The sizes of input and output layers are equal to M (number of categories). The size of hidden layer is configurable (minimal value may be equal to M).
A set of neural network patterns for training the neural network are first created using the model.
A neural network pattern is calculated for each document dq in the training set D and consists of sets V1 and V2. V1 will form the input set for the neural network and V2 will form the output set for the neural network. Each document is a sequence of words w,. The set V1 is comprised of elements vh v2..,vn where n is the number of categories in set C.
For each word w,- in the document dq and for each category c,, the weighting Wjj corresponding to the word w, and the category c, is extracted from the model. Vj is assigned the sum of the weightings corresponding to all the words in the document for category c, divided by the total number of words in the document dq.
The set V2 is comprised of elements v2i, v22...v2n where n is the number of categories in set C. If dq is categorising within the training set as belonging to category c, then v2y is given the value "1" otherwise v2} is given the value "0".
After each neural network pattern is created the set of patterns may be used to train the neural network.
Referring to Figure 2, an embodiment of the invention for categorising a document 30 in accordance with a categorisation model 31 will be described.
The model 31 includes a set of categories G, a set of words W and a set of weights W , which associate words in the set W with categories in the set C. Each word in the set W can be associated with one or more categories in the set C.
Model
Figure imgf000011_0001
, ]p J 3Wy ]] where W1 - is a word in the set W,
Cj - is a category in the set C, wy e [θ,l] -is a weight of association between the word w, and category c}
The model may be generated as described in Figure 1. The words 33 w, , within the document 30, which correspond to words in the model, are extracted from the document in step 32. This ensures that only words for which weightings exist are considered.
In step 34 each word W1 is replaced by a set of pairs (cj,wy' ) where C1 e C and w[j are weights of the words W1 associated with the category c7 within the model. The sets of pairs 35 form a set P.
In steps 36 and 37 a score s, for each category c, is then calculated based on the combined weightings for each category compared to the total number of considered words w, within the document 30.
The score s, may be calculated as follows: score{Cj )
where \P\ - is the number of elements in the set P, score(Cj) - is a sum of weights wtJ' for all pairs in the set P, which contain category c, .
If the weighting for a category meets (is greater than or equal to) a threshold th3 in step 38 then the document is categorised 39 within that category.
If none of the weightings exceeds the threshold then the web page cannot be categorised 40.
The threshold th3 may be predetermined by empirical evidence. The threshold th3 may be 0.3.
In an alternative embodiment, the neural network trained on the categorisation model will be used to categorise the input data object.
A neural network input set V1 is calculated for the input document 30. The set V1 is comprised of elements V1, v2.., vn where n is the number of categories in set C.
For each word w, in the input document 30 and for each category q, the weighting Wy corresponding to the word w, and the category c, is extracted from the model 10. Vj is assigned the sum of the weightings corresponding to all the words in the document for category c, divided by the total number of words in the input document 30.
The input set V1 is provided to the trained neural network and an output set V2 is created. The set V2 is comprised of elements v2i, v22...v2n where n is the number of categories in set C. Each element is a value between zero and one. The input document is categorised in category c, if v2j meets a threshold th4.
The threshold value th4 may be predetermined and may be equal to 0.7.
It will be appreciated that the methods and systems described could be implemented in hardware or in software. Where the method or systems of the invention are implemented in software, any suitable programming language such as C++ or Java may be used. It will further be appreciated that data and/or processing involved within the methods and systems may be distributed across more than one computer system.
An example of the creation of a categorisation model in accordance with an embodiment of the invention will now be described.
The training set is comprised of four documents as follows:
Figure imgf000013_0001
Figure imgf000014_0001
A set of words W1 associated with categories cy are extracted from the training set:
Figure imgf000014_0002
A weighting wy' is calculated for each unique pair of words and categories { wl t Cj ) in the set E:
Figure imgf000014_0003
The word, category and weighting ( w, , c7 , w/ ) tuple is added to the model M if the weighting is over the predetermined threshold (0.7):
Figure imgf000014_0004
Figure imgf000015_0001
An example of the categorisation of a document using a categorisation model in accordance with an embodiment of the invention will now be described.
For the purposes of this example, the categorisation model M will be used.
The document to be categorised is:
Figure imgf000015_0002
A set P is created comprising the words in the document that are also in the model:
Figure imgf000015_0003
As the word XXX is not in the model M the number of words in P is 2 (|p| = 2)
A score Sj is calculated for each category c7 by summing the weights and dividing by the size of the set P:
Figure imgf000015_0004
A result category set is constructed from scores that exceed the predetermined threshold (0.3): result = {(C1, 0.5), (C2, 0.43)}
Therefore the document is categorised within C1 and C2. Another example of the categorisation of a document using a categorisation model in accordance with an embodiment of the invention will now be described.
For the purposes of this example, the categorisation model M will also be used.
The document to be categorised is:
Figure imgf000016_0001
A set P is created comprising the words in the document that are also in the model:
Figure imgf000016_0002
As the words XXX, YYY, and ZZZ are not in the model M the number of words in P is 1 (|P| = 1).
A score _?y is calculated for each category c} by summing the weights and dividing by the size of the set P:
Figure imgf000016_0003
A result category set is constructed from scores that exceed the predetermined threshold (0.3): result = {(C1, 0.86)} One embodiment of the invention - Fast Word Statistics (FWS) - has been used in the categorisation of actual web pages (HTML pages) to produce the following test results. Four categories (weapons, chat, nudity, and pornography) which are typically utilized for blocking or filtering content on the internet for minors are used in the test.
For the purposes of comparison, results from Bayesian categorization algorithm and Support Vector Machines (SVM) categorization algorithm were also generated. In this test the Bayesian and SVM algorithms used single categorisation mode while the embodiment of the invention utilised multiple categorization mode. Generally single categorisation mode gives better results for the Bayesian and SVM algorithms then multiple categorisation mode.
To implement the test each of FWS, Bayesian, and SVM were first trained on a training set comprised of web pages categorised into one of the four categories and then each method was tested against a testing set of web pages for which the correct categorisation is known.
The Bayesian and SVM algorithms were first significantly tuned (optimised) in accordance with known methods. FWS used the raw data without any tuning.
The training and testing sets contain raw HTML pages downloaded from the internet. The distribution of the web pages across the sets and categories is as follows:
Figure imgf000017_0001
The following table summarises the optimised performance results:
Figure imgf000017_0002
The following table summarises the accuracy of the results of all three methods:
Figure imgf000018_0001
In other tests the FWS method was show to be linearly scalable and was used for other languages where it has shown similar or better performance and accuracy of results.
Embodiments of the present invention have the following potential advantages:
1) An embodiment provides a fast, linearly scalable learning method that permits fast construction of a categorisation model from raw and average-to-low quality of input documents.
2) In contrast to the Bayesian and SVM methods, an embodiment is immune to low quality training sets.
3) In contrast to the Bayesian and SVM methods, an embodiment does not require pre-processing of the vocabularies or tuning to provide accurate categorisation.
4) The performance and accuracy of an embodiment is similar or sometimes better than Bayesian and SVM algorithms. 5) An embodiment provides consistent and stable results on new categories and languages while Bayesian and SVM require significant human intervention for preparation and tuning.
6) An embodiment uses a statistical analysis approach and can be utilised in other fields such as data research and data mining.
7) The complexity of implementation of an embodiment is minimal compared to other well-known text categorization techniques.
8) An embodiment produces a different pattern of categorisation results to existing methods, in that the valid/invalid matches and uncategorised results for a set of documents are likely to be different to any other method. The consequence of this is that the embodiment is suited for combination with existing methods to produce an improved
^ categorisation for a new document. The combination of categorisation methods is described in patent application CATEGORISATION OF DATA USING MULTIPLE CATEGORISATION ENGINES.
While the present invention has been illustrated by the description of the embodiments thereof, and while the embodiments have been described in considerable detail, it is not the intention of the applicant to restrict or in any way limit the scope of the appended claims to such detail. Additional advantages and modifications will readily appear to those skilled in the art. Therefore, the invention in its broader aspects is not limited to the specific details representative apparatus and method, and illustrative examples shown and described. Accordingly, departures may be made from such details without departure from the spirit or scope of applicant's general inventive concept.

Claims

Claims
1. A method for categorising an input data object using a model comprising a plurality of patterns associated with at least one weighting for a category, including the steps of: i) identifying patterns within the input data object that correspond to at least some of the patterns within the model; ii) for each identified pattern, determining a weighting for at least one category from the model; iii) calculating a score for the input data object for at least one category based at least in part on the weightings of the identified patterns and the frequency of the identified patterns within the input data object; and iv) categorising the input data object based at least in part on the calculated score.
2. A method as claimed in claim 2 wherein at least some of the identified patterns are associated with a plurality of weightings for a plurality of categories.
3. A method as claimed in any one of the preceding claims wherein a plurality of scores is calculated for the input data object for a plurality of categories.
4. A method as claimed in any one of the preceding claims wherein the input data object is categorised in dependence on the calculated score only when the calculated score meets a predefined threshold.
5. A method as claimed in claim 4 wherein the predefined threshold is an empirically-derived threshold.
6. A method as claimed in any one of the preceding claims wherein the score sj for a category q is calculated as follows:
Figure imgf000021_0001
where P\ is the number of identified patterns in the input data object, and scoreiCj) is the sum of weightings for all identified patterns associated with the category C1.
7. A method as claimed in any one of the preceding claims wherein the patterns are words.
8. A method as claimed in any one of the preceding claims wherein the input data object is a document.
9. A method as claimed in any one of the preceding claims wherein the input data object is a web page.
10. A method as claimed in any one of the preceding claims wherein the model is generated in accordance with claim 10.
11. A method as claimed in any one of the preceding claims wherein an artificial neural network is used to categorise the input data object using the calculated score.
12. A method as claimed in claim 11 wherein an input set is used by the neural network to categorise the input data object, and wherein the input set comprises one or more of the calculated scores.
13. A method as claimed in claim 12 wherein the calculated scores is calculated as the sum of the weightings for the identified patterns in the input data object divided by the number of identified patterns in the input data object.
14. A method for generating a model for categorising input data objects, including the steps of: i) associating each data object of a plurality of data objects with one of a plurality of categories; ii) extracting a plurality of patterns from the plurality of data objects; iii) calculating a weighting for each pattern for at least one category based at least in part on the frequency of the pattern within data objects associated with that category compared to the frequency of the pattern within all data objects; and iv) inserting each weighting into the model.
15. A method as claimed in claim 14 wherein the weighting is only inserted into the model if the weighting meets a predefined threshold.
16. A method as claimed in claim 15 wherein the predefined threshold is an empirically-derived threshold.
17. A method as claimed in any one of claims 14 to 16 wherein each data object is only associated with one category.
18. A method as claimed in any one of claims 14 to 17 wherein a weighting is calculated for one or more of the patterns for a plurality of categories.
19. A method as claimed in any one of claims 14 to 18 wherein the weighting wtJ' for a pattern w, for a category cy is calculated as follows: f countiw,, Cj) wu =
COUHt(W1 ) where count(w,,Cj) - is the frequency of the pattern in all data objects associated with the category c/, and
COUHt(W1) - is the frequency of the pattern in all data objects.
20. A method as claimed in any one of claims 14 to 19 wherein the patterns are words.
21. A method as claimed in any one of claims 14 to 20 wherein the data objects are documents.
22. A method as claimed in any one of claims 14 to 21 wherein the data objects are web pages.
23. A method as claimed in any one of claims 14 to 22 including the step of training an artificial neural network using the model.
24. A method as claimed in claim 23 wherein the neural network is trained on a set of neural network patterns for each data object.
25. A method as claimed in claim 24 wherein each neural network pattern includes an input set and an output set.
26. A method as claimed in claim 25 wherein the input set includes a set of elements for each category, and wherein each element is based at least in part on the sum of the weightings for the patterns in the data object divided by the number of patterns in the data object.
27. A system for categorising an input data object using a model, including: a processor arranged for generating a model for categorising input data objects by associating each data object of a plurality of data objects with one of a plurality of categories; extracting a plurality of patterns from the plurality of data objects; calculating a weighting for each pattern for at least one category in dependence on the frequency of each pattern within the data objects associated with that category compared to the frequency of the pattern within all data objects; and inserting each weighting into the model; a processor arranged for categorising the input data object using the model by identifying patterns within the input data object that correspond to at least some of the patterns within the model; for each identified pattern, determining a weighting for at least one category from the model; calculating a score for the input data object for at least one category based on the weightings of the identified patterns and the frequency of the identified patterns within the input data object; and categorising the input data object in dependence on the calculated score; and a memory arranged for storing the model.
28. A system arranged for performing the method of any one of claims 1 to 26.
29. A computer program arranged for performing the method or system of any one of the preceding claims.
30. A storage media arranged for storing a computer program as claimed in claim 29.
PCT/GB2007/003370 2006-09-07 2007-09-07 Categorisation of data using a model WO2008029150A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
UA200609646 2006-09-07
UA200609646 2006-09-07
GB0624665.6 2006-12-11
GB0624665A GB2442286A (en) 2006-09-07 2006-12-11 Categorisation of data e.g. web pages using a model

Publications (1)

Publication Number Publication Date
WO2008029150A1 true WO2008029150A1 (en) 2008-03-13

Family

ID=38734927

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2007/003370 WO2008029150A1 (en) 2006-09-07 2007-09-07 Categorisation of data using a model

Country Status (1)

Country Link
WO (1) WO2008029150A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120323866A1 (en) * 2011-02-28 2012-12-20 International Machines Corporation Efficient development of a rule-based system using crowd-sourcing

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020022956A1 (en) * 2000-05-25 2002-02-21 Igor Ukrainczyk System and method for automatically classifying text
US6446061B1 (en) * 1998-07-31 2002-09-03 International Business Machines Corporation Taxonomy generation for document collections
US20030167267A1 (en) * 2002-03-01 2003-09-04 Takahiko Kawatani Document classification method and apparatus

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6446061B1 (en) * 1998-07-31 2002-09-03 International Business Machines Corporation Taxonomy generation for document collections
US20020022956A1 (en) * 2000-05-25 2002-02-21 Igor Ukrainczyk System and method for automatically classifying text
US20030167267A1 (en) * 2002-03-01 2003-09-04 Takahiko Kawatani Document classification method and apparatus

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ARUL PRAKASH ASIRVATHAM ET AL: "Web Page Classification based on Document Structure", INTERNET CITATION, 2001, XP002454563, Retrieved from the Internet <URL:http://citeseer.ist.psu.edu/cache/papers/cs/25574/http:zSzzSzgdit.iii t.netzSz~kranthizSzprofessionalzSzpaperszSzieeeIndia_wpcds.pdf/asirva tham01web.pdf> [retrieved on 20071011] *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120323866A1 (en) * 2011-02-28 2012-12-20 International Machines Corporation Efficient development of a rule-based system using crowd-sourcing
US8635197B2 (en) 2011-02-28 2014-01-21 International Business Machines Corporation Systems and methods for efficient development of a rule-based system using crowd-sourcing
US8949204B2 (en) * 2011-02-28 2015-02-03 International Business Machines Corporation Efficient development of a rule-based system using crowd-sourcing

Similar Documents

Publication Publication Date Title
CN110297988B (en) Hot topic detection method based on weighted LDA and improved Single-Pass clustering algorithm
CN108228541B (en) Method and device for generating document abstract
US8078625B1 (en) URL-based content categorization
US8150822B2 (en) On-line iterative multistage search engine with text categorization and supervised learning
CN110287328B (en) Text classification method, device and equipment and computer readable storage medium
KR101715432B1 (en) Word pair acquisition device, word pair acquisition method, and recording medium
US8805026B1 (en) Scoring items
CN113268995B (en) Chinese academy keyword extraction method, device and storage medium
CN108920488B (en) Multi-system combined natural language processing method and device
CN108009135B (en) Method and device for generating document abstract
Liliana et al. Indonesian news classification using support vector machine
CN116134432A (en) System and method for providing answers to queries
Nanculef et al. Efficient classification of multi-labeled text streams by clashing
BaygIn Classification of text documents based on Naive Bayes using N-Gram features
Aquino et al. Keyword identification in spanish documents using neural networks
CN110866102A (en) Search processing method
CN110688479A (en) Evaluation method and sequencing network for generating abstract
Abdulkader et al. Low cost correction of OCR errors using learning in a multi-engine environment
Kotenko et al. Evaluation of text classification techniques for inappropriate web content blocking
GB2442286A (en) Categorisation of data e.g. web pages using a model
JP2008204374A (en) Cluster generating device and program
Trivedi et al. A study of ensemble based evolutionary classifiers for detecting unsolicited emails
Tauhid et al. Sentiment analysis of indonesians response to influencer in social media
Douibi et al. The homogeneous ensemble methods for mlknn algorithm
Mabrouk et al. Profile Categorization System based on Features Reduction.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07804170

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07804170

Country of ref document: EP

Kind code of ref document: A1