WO2008029150A1

WO2008029150A1 - Categorisation of data using a model

Info

Publication number: WO2008029150A1
Application number: PCT/GB2007/003370
Authority: WO
Inventors: Yuriy Byurher
Original assignee: Xploite Plc
Priority date: 2006-09-07
Filing date: 2007-09-07
Publication date: 2008-03-13

Abstract

A method for generating a model for categorising input data objects, including the steps of: associating each data object of a plurality of data objects with one of a plurality of categories; extracting a plurality of patterns from the plurality of data objects; calculating a weighting for each pattern for at least one category based at least in part on the frequency of the pattern within data objects associated with that category compared to the frequency of the pattern within all data objects; and inserting each weighting into the model.

Description

CATEGORISATION OF DATA USING A MODEL

Field of Invention

The present invention relates to a method and system for generating a categorisation model, and for categorising data using the model.

Background

Categorisation of content such as web pages is useful for searching for information and for filtering information, such filtering web pages from user internet requests so as to exclude inappropriate material.

Traditionally web pages have been categorised by collating categorisation suggestions from human users. An example of a system created by this method includes dmoz.org.

This method has several disadvantages. Firstly, the process overlooks some web pages or classes of web pages due to a non-systematic approach to categorisation. Secondly, multi-user input results in a lack of consistency of classification. And thirdly, the human time cost of classifying large sets of web pages, such as the majority of the internet, is very high.

Therefore this method is particularly unsuitable for producing a database which is capable of filtering web pages resulting from user internet requests.

Automated methods for categorising web pages have been explored. Two popular methods are the use of Bayesian algorithms and the use of support vector machines (SVM).

Each method extracts feature vectors from a training set of pre-categorised web pages. A feature vector is a vector of numeric features of objects within the training set. Feature vectors may include the occurrence of words or phrases, link information and image information. It should be noted that it is difficult to create a "perfect" training set where the categorised web pages contain no noise - features within a web page that contradict its categorisation.

A number of implementations of the automated methods use occurrence of words as their feature vectors and extract words from the web pages in the training set to build a vocabulary.

A large vocabulary will result in high dimensionality of feature vectors. High dimensionality of feature vectors can cause the automated methods to overfit. Overfitting is a phenomenon by which the classifier is tuned also to the contingent, rather than just the constitutive characteristics of the training set.

Classifiers which overfit the training data tend to be good at re-classifying the data they have been trained on, but much worse at classifying previously unseen data.

Overfitting is a significant problem for the Bayes algorithm method. This algorithm learns from the training set the conditional probability of each word for each category. A new web page is categorised within the category with the highest posterior probability computed according to the Bayes rule. When the number of training samples (web pages) is insufficient with respect to number of features (words) used, the probabilities learnt may reflect noise in the training set and cannot be trusted to produce accurate categorisation.

The SVM method uses the feature vectors in the vocabulary to determine a hyperplane for each category. Each category hyperplane is defined by support vectors on the edge of the hyperplane. A category hyperplane is used to categorise a new web page as either within the category or not.

Some commentators in the literature suggest that SVMs are also susceptible to overfitting (Chen Lin et. al., An Anti-Noise Text Categorization Method based on Support Vector Machines; Youshua Bengio et. al., The Curse of Dimensionality for Local Kernel Machines).

To prevent overfitting for both methods, the vocabulary (collection of feature vectors) will often need to be reduced in size.

The vocabulary is reduced by setting feature relevance thresholds. Selecting thresholds for relevance criteria is a complex task. It is dependent on the size and quality of the training set. As an example, thresholds can relate to the exclusion of common words (stop-words), replacement of words with their stems, and exclusion of very rare words.

Even with thresholds, the vocabulary will generally need to be tuned by an expert.

The consequences of these difficulties with the Bayesian and SVM methods is that they must be trained using large high-quality training sets, and that significant user intervention is required to tune the methods for effective categorisation.

When the quality of the training set affects the quality of the categorisation method, extra effort must be expended by the human user to "clean" the training set. In addition, it is much more difficult to assist these methods to dynamically learn using new training data.

There is a desire for a method of categorising content, such as documents, web pages or any data object, which can utilise low quality training data.

It is an object of the present invention to provide a method for generating a categorisation model and categorising data which overcomes the disadvantages of above methods, or to at least provide a useful alternative.

Summary of the Invention According to a first aspect of the invention there is provided a method for categorising an input data object using a model comprising a plurality of patterns associated with at least one weighting for a category, including the steps of: i) identifying patterns within the input data object that correspond to at least some of the patterns within the model; ii) for each identified pattern, determining a weighting for at least one category from the model; iii) calculating a score for the input data object for at least one category based at least in part on the weightings of the identified patterns and the frequency of the identified patterns within the input data object; and iv) categorising the input data object based at least in part on the calculated score.

Preferably at least some of the identified patterns are associated with a plurality of weightings for a plurality of categories.

A plurality of scores may be calculated for the input data object for a plurality of categories.

It is preferred that the input data object is categorised in dependence on the calculated score only when the calculated score meets a predefined threshold. The predefined threshold may be an empirically-derived threshold.

The score s, for a category q may be calculated as follows: score{C_j )

^SJ =

where is the number of identified patterns in the input data object, and score^(Cj⁾ ._g ^_{g s(jm o}f vveightings for all identified patterns associated with

the category ^Cj The patterns may be words and the input data object may be a document or a web page.

It is preferred that the model is generated in accordance with the second aspect of the invention.

According to a second aspect of the invention there is provided a method for generating a model for categorising input data objects, including the steps of: i) associating each data object of a plurality of data objects with one of a plurality of categories; ii) extracting a plurality of patterns from the plurality of data objects; iii) calculating a weighting for each pattern for at least one category based at least in part on the frequency of the pattern within data objects associated with that category compared to the frequency of the pattern within all data objects; and iv) inserting each weighting into the model.

Preferably the weighting is only inserted into the model if the weighting meets a predefined threshold. The predefined threshold may be an empirically- derived threshold.

It is also preferred that each data object is only associated with one category.

A weighting may be calculated for one or more of the patterns for a plurality of categories.

The weighting ^W'^J for a pattern w, for a category c,- may be calculated as follows: count(w,,c ) w = —

COUKt(W₁ )

where ^coun ^^WI '^CJ ' - is the frequency of the pattern in all data objects associated with the category c/, and ^COUr^It(^W ₁) _ _{js the} f_req_UenCy _of the pattern in all data objects.

The patterns may be words and the data objects may be documents or web pages.

Brief Description of the Drawings

Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawing in which:

Figure 1 : shows a schematic diagram illustrating a method of generating a categorisation model in accordance with an embodiment of the invention.

Figure 2: shows a schematic diagram illustrating a method of categorising a document using a model in accordance with an embodiment of the invention.

Detailed Description of the Preferred Embodiments

The present invention provides a method and system for generating a categorisation model and for categorising data using the model.

A set of training documents, where each document is associated with a category, are used to train the model. A weighting for each word in the documents associated with each category is calculated based on the frequency of the word within the category compared to the combined frequency of the word in all categories. The (word, category, weighting) tuple is inserted into the model if the weighting meets a threshold.

A new document is categorised by generating a weighting for each category by combining the weightings, extracted from the model, of each word in the document which is paired with that category. The document is categorised within a category if the generated weighting meets a threshold. The present invention will be described with reference to the categorisation of web pages using word frequency as feature vectors. However, it will be appreciated that the invention may be used to categorise any type of document or data object, such as a word document, an XML document, an image or a data stream. It will also be appreciated that feature vectors other than word frequency may be utilised, including frequency of phrases, structural elements, or any other pattern. The use of structural elements as feature vectors in categorisation is described in patent application CATEGORISATION OF DATA USING STRUCTURAL ANALYSIS.

Referring to Figure 1 , an embodiment of the invention for creating a categorisation model 10 for a set of categories C will now be described.

A training set 11 is used to create the model 10 for categorisation.

The training set 11 can be an existing training set or can be compiled by manual or automatic means such as by querying a search engine with queries created for each category.

The training set 11 is comprised of a plurality 12 of documents D where each document d_q e D is associated 13 with only one category in the set C.

In some embodiments the documents may be associated with multiple categories, and/or may be associated with a category or categories in accordance with a defined weight. The weight may affect the weight given to words extracted from that document.

Each document d_q e D appears as a sequence of words w, <≡ W .

The words may be from one of any language, from any language, or an invented language such as web script. It will be further appreciated that, in alternative embodiments, any patterns could be used in place of words, such as phrases, code portions, or structural information about the document.

In step 14 the words within the documents are extracted and each word w, is associated with the category C₁ that the corresponding document is associated with. The word-category association forms a pair (w₍,c_y ) for each entry of the word w, in the document. A set E is constructed comprising all pairs 15 from all the documents D.

In step 16 the frequency of each word within each category is determined 17 and the combined frequency of each word within all categories is determined 18.

In step 19 a weighting w_v' is calculated 20 for each word for each category (each unique pair (w,,C_j ) in set E) equal to the proportion of frequency of that word in that category to the combined frequency of the word in all categories.

The weighting W₁₁ may be calculated in accordance with the following formula:

, COUtU(W₁ , C_j ) wv = COUUt(W₁ ) where COUHt(W^c₁) - is the number of entries of the pair {w,,C_j) in the set £, and

COUHt(W₁) - is the number of pairs in the set E, in which the word W₁ is used.

The weighting w_tJ' is combined with the associated word and category (w,,c_y ) to form a word, category, weighting tuple (w₍,c_{; 1}w'J. y

If the weighting w_v' meets (is greater than or equal to) a threshold thO in step 21 then the corresponding tuple (w,,c₇,w'_y ) is inserted 22 into the categorisation model 10, otherwise the tuple is discarded 23.

The threshold thO may be predetermined by empirical methods. The threshold thO may be 0.7.

In one embodiment the tuple for a weighting w_υ' is inserted into the categorisation model if countζw,) meets a threshold th1 and number of documents containing word w, meets a threshold th2.

The threshold th1 may be predetermined by empirical methods. The threshold th1 may be three.

The threshold th2 may also be predetermined by empirical methods. The threshold th2 may be ten.

In one embodiment of the invention the categorisation model is used to train an artificial neural network (ANN). The ANN may be a standard feed-forward neural network with one input layer, one hidden layer and one output layer. The sizes of input and output layers are equal to M (number of categories). The size of hidden layer is configurable (minimal value may be equal to M).

A set of neural network patterns for training the neural network are first created using the model.

A neural network pattern is calculated for each document d_q in the training set D and consists of sets V1 and V2. V1 will form the input set for the neural network and V2 will form the output set for the neural network. Each document is a sequence of words w,. The set V1 is comprised of elements v_h v₂..,v_n where n is the number of categories in set C.

For each word w,- in the document d_q and for each category c,, the weighting Wjj corresponding to the word w, and the category c, is extracted from the model. Vj is assigned the sum of the weightings corresponding to all the words in the document for category c, divided by the total number of words in the document d_q.

The set V2 is comprised of elements v2i, v2₂...v2_n where n is the number of categories in set C. If d_q is categorising within the training set as belonging to category c, then v2_y is given the value "1" otherwise v2_} is given the value "0".

After each neural network pattern is created the set of patterns may be used to train the neural network.

Referring to Figure 2, an embodiment of the invention for categorising a document 30 in accordance with a categorisation model 31 will be described.

The model 31 includes a set of categories G, a set of words W and a set of weights W , which associate words in the set W with categories in the set C. Each word in the set W can be associated with one or more categories in the set C.

Model

, ]p _{J 3}W_y ]] where W₁ - is a word in the set W,

C_j - is a category in the set C, w_y e [θ,l] -is a weight of association between the word w, and category c_}

The model may be generated as described in Figure 1. The words 33 w, , within the document 30, which correspond to words in the model, are extracted from the document in step 32. This ensures that only words for which weightings exist are considered.

In step 34 each word W₁ is replaced by a set of pairs (c_j,w_y' ) where C₁ e C and w[_j are weights of the words W₁ associated with the category c₇ within the model. The sets of pairs 35 form a set P.

In steps 36 and 37 a score s, for each category c, is then calculated based on the combined weightings for each category compared to the total number of considered words w, within the document 30.

The score s, may be calculated as follows: score{C_j )

where \P\ - is the number of elements in the set P, score(C_j) - is a sum of weights w_tJ' for all pairs in the set P, which contain category c, .

If the weighting for a category meets (is greater than or equal to) a threshold th3 in step 38 then the document is categorised 39 within that category.

If none of the weightings exceeds the threshold then the web page cannot be categorised 40.

The threshold th3 may be predetermined by empirical evidence. The threshold th3 may be 0.3.

In an alternative embodiment, the neural network trained on the categorisation model will be used to categorise the input data object.

A neural network input set V1 is calculated for the input document 30. The set V1 is comprised of elements V₁, v₂.., v_n where n is the number of categories in set C.

For each word w, in the input document 30 and for each category q, the weighting Wy corresponding to the word w, and the category c, is extracted from the model 10. Vj is assigned the sum of the weightings corresponding to all the words in the document for category c, divided by the total number of words in the input document 30.

The input set V1 is provided to the trained neural network and an output set V2 is created. The set V2 is comprised of elements v2i, v2₂...v2_n where n is the number of categories in set C. Each element is a value between zero and one. The input document is categorised in category c, if v2_j meets a threshold th4.

The threshold value th4 may be predetermined and may be equal to 0.7.

It will be appreciated that the methods and systems described could be implemented in hardware or in software. Where the method or systems of the invention are implemented in software, any suitable programming language such as C++ or Java may be used. It will further be appreciated that data and/or processing involved within the methods and systems may be distributed across more than one computer system.

An example of the creation of a categorisation model in accordance with an embodiment of the invention will now be described.

The training set is comprised of four documents as follows:

A set of words W₁ associated with categories c_y are extracted from the training set:

A weighting w_y' is calculated for each unique pair of words and categories { w_{l t} C_j ) in the set E:

The word, category and weighting ( w, , c₇ , w/ ) tuple is added to the model M if the weighting is over the predetermined threshold (0.7):

An example of the categorisation of a document using a categorisation model in accordance with an embodiment of the invention will now be described.

For the purposes of this example, the categorisation model M will be used.

The document to be categorised is:

A set P is created comprising the words in the document that are also in the model:

As the word XXX is not in the model M the number of words in P is 2 (|p| = 2)

A score S_j is calculated for each category c₇ by summing the weights and dividing by the size of the set P:

A result category set is constructed from scores that exceed the predetermined threshold (0.3): result = {(C1, 0.5), (C2, 0.43)}

Therefore the document is categorised within C1 and C2. Another example of the categorisation of a document using a categorisation model in accordance with an embodiment of the invention will now be described.

For the purposes of this example, the categorisation model M will also be used.

The document to be categorised is:

As the words XXX, YYY, and ZZZ are not in the model M the number of words in P is 1 (|P| = 1).

A score _?_y is calculated for each category c_} by summing the weights and dividing by the size of the set P:

A result category set is constructed from scores that exceed the predetermined threshold (0.3): result = {(C1, 0.86)} One embodiment of the invention - Fast Word Statistics (FWS) - has been used in the categorisation of actual web pages (HTML pages) to produce the following test results. Four categories (weapons, chat, nudity, and pornography) which are typically utilized for blocking or filtering content on the internet for minors are used in the test.

For the purposes of comparison, results from Bayesian categorization algorithm and Support Vector Machines (SVM) categorization algorithm were also generated. In this test the Bayesian and SVM algorithms used single categorisation mode while the embodiment of the invention utilised multiple categorization mode. Generally single categorisation mode gives better results for the Bayesian and SVM algorithms then multiple categorisation mode.

To implement the test each of FWS, Bayesian, and SVM were first trained on a training set comprised of web pages categorised into one of the four categories and then each method was tested against a testing set of web pages for which the correct categorisation is known.

The Bayesian and SVM algorithms were first significantly tuned (optimised) in accordance with known methods. FWS used the raw data without any tuning.

The training and testing sets contain raw HTML pages downloaded from the internet. The distribution of the web pages across the sets and categories is as follows:

The following table summarises the optimised performance results:

The following table summarises the accuracy of the results of all three methods:

In other tests the FWS method was show to be linearly scalable and was used for other languages where it has shown similar or better performance and accuracy of results.

Embodiments of the present invention have the following potential advantages:

1) An embodiment provides a fast, linearly scalable learning method that permits fast construction of a categorisation model from raw and average-to-low quality of input documents.

2) In contrast to the Bayesian and SVM methods, an embodiment is immune to low quality training sets.

3) In contrast to the Bayesian and SVM methods, an embodiment does not require pre-processing of the vocabularies or tuning to provide accurate categorisation.

4) The performance and accuracy of an embodiment is similar or sometimes better than Bayesian and SVM algorithms. 5) An embodiment provides consistent and stable results on new categories and languages while Bayesian and SVM require significant human intervention for preparation and tuning.

6) An embodiment uses a statistical analysis approach and can be utilised in other fields such as data research and data mining.

7) The complexity of implementation of an embodiment is minimal compared to other well-known text categorization techniques.

8) An embodiment produces a different pattern of categorisation results to existing methods, in that the valid/invalid matches and uncategorised results for a set of documents are likely to be different to any other method. The consequence of this is that the embodiment is suited for combination with existing methods to produce an improved

^ categorisation for a new document. The combination of categorisation methods is described in patent application CATEGORISATION OF DATA USING MULTIPLE CATEGORISATION ENGINES.

While the present invention has been illustrated by the description of the embodiments thereof, and while the embodiments have been described in considerable detail, it is not the intention of the applicant to restrict or in any way limit the scope of the appended claims to such detail. Additional advantages and modifications will readily appear to those skilled in the art. Therefore, the invention in its broader aspects is not limited to the specific details representative apparatus and method, and illustrative examples shown and described. Accordingly, departures may be made from such details without departure from the spirit or scope of applicant's general inventive concept.

Claims

1. A method for categorising an input data object using a model comprising a plurality of patterns associated with at least one weighting for a category, including the steps of: i) identifying patterns within the input data object that correspond to at least some of the patterns within the model; ii) for each identified pattern, determining a weighting for at least one category from the model; iii) calculating a score for the input data object for at least one category based at least in part on the weightings of the identified patterns and the frequency of the identified patterns within the input data object; and iv) categorising the input data object based at least in part on the calculated score.

2. A method as claimed in claim 2 wherein at least some of the identified patterns are associated with a plurality of weightings for a plurality of categories.

3. A method as claimed in any one of the preceding claims wherein a plurality of scores is calculated for the input data object for a plurality of categories.

4. A method as claimed in any one of the preceding claims wherein the input data object is categorised in dependence on the calculated score only when the calculated score meets a predefined threshold.

5. A method as claimed in claim 4 wherein the predefined threshold is an empirically-derived threshold.

6. A method as claimed in any one of the preceding claims wherein the score sj for a category q is calculated as follows:

where P\ is the number of identified patterns in the input data object, and scoreiC_j) is the sum of weightings for all identified patterns associated with the category C₁.

7. A method as claimed in any one of the preceding claims wherein the patterns are words.

8. A method as claimed in any one of the preceding claims wherein the input data object is a document.

9. A method as claimed in any one of the preceding claims wherein the input data object is a web page.

10. A method as claimed in any one of the preceding claims wherein the model is generated in accordance with claim 10.

11. A method as claimed in any one of the preceding claims wherein an artificial neural network is used to categorise the input data object using the calculated score.

12. A method as claimed in claim 11 wherein an input set is used by the neural network to categorise the input data object, and wherein the input set comprises one or more of the calculated scores.

13. A method as claimed in claim 12 wherein the calculated scores is calculated as the sum of the weightings for the identified patterns in the input data object divided by the number of identified patterns in the input data object.

14. A method for generating a model for categorising input data objects, including the steps of: i) associating each data object of a plurality of data objects with one of a plurality of categories; ii) extracting a plurality of patterns from the plurality of data objects; iii) calculating a weighting for each pattern for at least one category based at least in part on the frequency of the pattern within data objects associated with that category compared to the frequency of the pattern within all data objects; and iv) inserting each weighting into the model.

15. A method as claimed in claim 14 wherein the weighting is only inserted into the model if the weighting meets a predefined threshold.

16. A method as claimed in claim 15 wherein the predefined threshold is an empirically-derived threshold.

17. A method as claimed in any one of claims 14 to 16 wherein each data object is only associated with one category.

18. A method as claimed in any one of claims 14 to 17 wherein a weighting is calculated for one or more of the patterns for a plurality of categories.

19. A method as claimed in any one of claims 14 to 18 wherein the weighting w_tJ' for a pattern w, for a category c_y is calculated as follows: f countiw,, C_j) w_u =

COUHt(W₁ ) where count(w,,C_j) - is the frequency of the pattern in all data objects associated with the category c/, and

COUHt(W₁) - is the frequency of the pattern in all data objects.

20. A method as claimed in any one of claims 14 to 19 wherein the patterns are words.

21. A method as claimed in any one of claims 14 to 20 wherein the data objects are documents.

22. A method as claimed in any one of claims 14 to 21 wherein the data objects are web pages.

23. A method as claimed in any one of claims 14 to 22 including the step of training an artificial neural network using the model.

24. A method as claimed in claim 23 wherein the neural network is trained on a set of neural network patterns for each data object.

25. A method as claimed in claim 24 wherein each neural network pattern includes an input set and an output set.

26. A method as claimed in claim 25 wherein the input set includes a set of elements for each category, and wherein each element is based at least in part on the sum of the weightings for the patterns in the data object divided by the number of patterns in the data object.

27. A system for categorising an input data object using a model, including: a processor arranged for generating a model for categorising input data objects by associating each data object of a plurality of data objects with one of a plurality of categories; extracting a plurality of patterns from the plurality of data objects; calculating a weighting for each pattern for at least one category in dependence on the frequency of each pattern within the data objects associated with that category compared to the frequency of the pattern within all data objects; and inserting each weighting into the model; a processor arranged for categorising the input data object using the model by identifying patterns within the input data object that correspond to at least some of the patterns within the model; for each identified pattern, determining a weighting for at least one category from the model; calculating a score for the input data object for at least one category based on the weightings of the identified patterns and the frequency of the identified patterns within the input data object; and categorising the input data object in dependence on the calculated score; and a memory arranged for storing the model.

28. A system arranged for performing the method of any one of claims 1 to 26.

29. A computer program arranged for performing the method or system of any one of the preceding claims.

30. A storage media arranged for storing a computer program as claimed in claim 29.