CN103106262A

CN103106262A - Method and device of file classification and generation of support vector machine model

Info

Publication number: CN103106262A
Application number: CN201310033125XA
Authority: CN
Inventors: 戴明洋
Original assignee: Sina Technology China Co Ltd
Current assignee: Sina Technology China Co Ltd
Priority date: 2013-01-28
Filing date: 2013-01-28
Publication date: 2013-05-15
Anticipated expiration: 2033-01-28
Also published as: CN103106262B

Abstract

The invention discloses a method and a device of file classification and generation of support vector machine model. The method comprises the steps of confirming classification of a file to be classified according to feature vectors of the file to be classified and the support vector machine model generated through a training set processed through classification flattening, wherein the classification flattening processing process of the training set comprises the steps of sequencing, according to levels of classifications from top to bottom, preset classification of each training sample in the training set; judging whether a sub-classification exists in the classification of each sample from a classification relatively high in level; and deleting the classification from the classification of the training sample if the sub-classification exists in the classification of each sample. Due to the fact that classification flattening treatment is conducted according to the level relationship between classifications, the obtained support vector machine model is enabled to be suitable for classification of multi-level classifications, and the classification is enabled to have better accuracy.

Description

The method and apparatus that document classification, supporting vector machine model generate

Technical field

The present invention relates to the computing machine treatment technology, relate in particular to the method and apparatus of document classification, supporting vector machine model generation.

Background technology

In recent years, along with the fast development of Internet, make the Web(network) on document resources present explosive growth, these document information data volumes are large, content is numerous and diverse.Compare with structurized information in database, destructuring or semi-structured web document information more enrich and are numerous and diverse.In order fully to effectively utilize these document resources, be the information that the user can find fast and effectively to be needed, and extract wherein potential valuable information, need these documents are classified.

The method of at present, document being carried out automatic classification usually adopts classifies based on the method for supporting vector machine model; The method comprises: training stage and sorting phase.At present, multiple document automatic classification method based on supporting vector machine model is arranged in prior art, the introduction that the below is comparatively detailed a kind of.

The method of supported vector machine model of training stage is: according to the document of having divided classification in training set, obtain the category feature vector; According to the category feature vector set, can supported vector machine model and effective word collection (or claiming dictionary); For ease of describing, the sample in training set is called training sample herein.

Wherein, according to the training sample of having divided classification in training set, obtain a kind of concrete grammar of category feature vector, flow process comprises the steps: as shown in Figure 1

S101: each training sample in training set is carried out participle, obtain the set of words of each training sample, deletion stop words wherein.

The training centralized collection the various documents of having divided classification, usually, training set adopts manual sort's corpus.In order to guarantee the stability and convergence of the supporting vector machine model that the training stage obtains, the number of documents in training set is greater than certain numerical value usually.

Document (training sample) is comprised of a string continuous word sequence, and word is the base unit in document; Participle is exactly that word sequence continuous in document is divided into the process of word one by one, and the word that marks off consists of the set of words of the document.

S102: for each classification, add up in the set of words of such other training sample the frequency that each word occurs.

For example, the training sample in training set has q classification, is designated as respectively: c ₁, c ₂C _qWherein, q is the natural number greater than 2;

In training set, total n word in the set of words of all training samples, be designated as respectively t ₁, t ₂T _nWherein, n is the natural number greater than 2;

For i classification wherein, count j frequency (number of times) that word occurs in the set of words of training sample of i classification, be designated as m _ij

S103: build classification word matrix.

According to the frequency that each word in each classification that counts occurs, obtain the word frequency vector of each classification; For example, the word frequency of i classification vector

The classification word matrix of the q * n that builds

That is classification word Matrix C _{Q * n}For:

According to classification word matrix, a kind of concrete grammar of supported vector machine model, flow process comprises the steps: as shown in Figure 2

S201: according to classification word matrix, calculate the anti-document frequency of each word.

Particularly, arrange classification frequency ICF for k word in n word _kComputing formula such as formula 1:

{ICF}_{k} = \log (\frac{q}{{CF}_{k}} + 0.01)

(formula 1)

Arrange classification frequency ICF herein, _kBe anti-document frequency (Inverse Document Frequency, the IDF) IDF of k word _kICF _k(IDF _k) value larger, show that the classification discrimination of k word is stronger.

S202: the anti-document frequency to each word sorts, and obtains effective word collection (also can be described as dictionary) according to ranking results.

Anti-document frequency (namely arranging the classification frequency) according to each word sorts to a said n word, according to predefined effective word parameter, therefrom extracts some sequences effective word collection of word formation the preceding.Particularly, for example, predefined effective word parameter is effective word number g, extracts g sort word and the effective word collection of anti-document frequency formation thereof the preceding; Perhaps predefined effective word parameter is effective word number percent h, extracts n * h sort word and the effective word collection of anti-document frequency formation thereof the preceding.

S203: rebuild classification word matrix according to effective word collection.

Word according to effective word is concentrated after in former classification word matrix, matrix element that be not contained in the concentrated word of effective word are rejected, forms new classification word matrix;

If effectively the concentrated word of word is p, for i classification in q classification, other word frequency vector of such that rebuilds

Wherein, m _irBe r the frequency that word occurs in p word in i classification;

According to the word frequency vector of each classification that rebuilds, the classification word matrix that rebuilds:

C_{q \times p}^{'} = {[{\overset{&RightArrow;}{c_{1}}}^{'}, {\overset{&RightArrow;}{c_{2}}}^{'}, . . ., {\overset{&RightArrow;}{c_{q}}}^{'}]}^{T} .

S204: according to the classification word matrix that rebuilds, calculate the word frequency (Term Frequency, TF) of each word, obtain word frequency vector of all categories.

Wherein, the word frequency tf of j word in the set of words of the training sample of i classification _ijCalculate as formula 2:

{tf}_{ij} = \frac{m_{ij}}{\max (m_{i 1}, m_{i 2}, . . ., m_{ir}, . . ., m_{ip})}

(formula 2)

Thus, obtain the word frequency vector of i classification

Wherein, tf _irBe the word frequency of r word in i classification in p word.

S205: according to the TF of word, and the IDF of each word, build supporting vector machine model.

Particularly, for each classification, according to such other word frequency vector that rebuilds, and the IDF of each word in p word, calculate such other proper vector; Wherein, the proper vector of i classification is

Wherein, tfidf _irBe r the word word frequency tf in i classification in p word _irAnti-document frequency IDF with this word _rProduct;

Proper vector by each classification can build described supporting vector machine model: according to the proper vector of each classification, determine lineoid corresponding to difference in the support vector model; Particularly, for every two classifications, calculate the optimal dividing lineoid take margin maximization as principle, thereby find the important parameter of the final support vector model of support vector conduct wherein.

After supported vector machine model, can carry out automatic classification, i.e. sorting phase to document according to this model; Sorting phase carries out the method flow of automatic classification to document, as shown in Figure 3, comprise the steps:

S301: treat classifying documents and carry out participle, obtain the set of words of this document to be sorted.

S302: the proper vector of calculating this document to be sorted.

Particularly, the proper vector of this document to be sorted is Wherein, z _rThe frequency tf that concentrates r word in p word to occur in this document to be sorted for effective word _irProduct value with the anti-document frequency of this word.

S303: according to proper vector and the supporting vector machine model of this document to be sorted, determine the affiliated classification of this document to be sorted.

Particularly, the distance in the proper vector of calculating this document to be sorted and supporting vector machine model between corresponding lineoid of all categories; Classification under this document to be sorted definite according to the distance of calculating: will be from the confidence level of classification under this sample to be sorted of distance conduct of lineoid, namely apart from the nearer corresponding classification of lineoid of proper vector of this document to be sorted, it is that under this document to be sorted, the confidence level of classification is higher; Wherein TOP K classification is as classification under this document to be sorted; Wherein K is preset value, equals 5 such as setting K, gets front 5 classifications as classification under this document to be sorted.In fact, the proper vector of document to be sorted and the distance between lineoid have reflected the similarity between the proper vector of the proper vector of document to be sorted and the corresponding classification of this lineoid; Distance is nearer, and the similarity between the proper vector of the proper vector of document to be sorted and classification is also just higher, and it is higher that this document to be sorted belongs to such other confidence level.

The present inventor's discovery, the document automatic classification method of prior art can be classified to the single document of classification level; Yet the document automatic classification method of prior art also is not suitable for the classification of the document of multi-layer classification, and document classification is out of true, undesirable as a result; Therefore, for the document of multi-layer classification, such as the document of news category, still adopt manual method to classify at present, make staff's workload large, and efficient is low.

Summary of the invention

Embodiments of the invention provide a kind of Document Classification Method based on the multi-layer classification and device, carry out automatic classification applicable to the document to the multi-layer classification.

According to an aspect of the present invention, provide a kind of Document Classification Method, having comprised:

After treating classifying documents and carrying out participle, determine the proper vector of this document to be sorted;

According to the proper vector of this document to be sorted and the supporting vector machine model that generates according to the training set of processing through the classification flattening, determine the affiliated classification of this document to be sorted, wherein,

The classification flattening processing procedure of described training set comprises: for each training sample in described training set, to the affiliated classification that this training sample sets in advance, the level of category just sorts; For each classification under this training sample, from the higher classification of level, judge under this training sample in classification, whether such other subclass classification is arranged; If have, this classification is rejected classification under this training sample.

Preferably, described classification has been assigned with unique sign, and has comprised such other level routing information in the sign of described classification.

Preferably, the sign of the classification below highest level is comprised of sign and such other subclass identification code of its parent classification; Wherein, described subclass identification code is for the one group of subclass that belongs to same parent, is unique identification code of each subclass distribution in organizing.

Wherein, described supporting vector machine model is according to specifically comprising that training set generates:

Build classification word matrix according to described training set;

Generate proper vector of all categories according to described classification word matrix, build described supporting vector machine model according to proper vector of all categories; And

The proper vector of described this document to be sorted of basis and supporting vector machine model, determine that the affiliated classification of this document to be sorted specifically comprises:

Distance between corresponding lineoid of all categories respectively in the proper vector of calculating this document to be sorted and described supporting vector machine model;

Classification under this document to be sorted definite according to the distance of calculating.

According to another aspect of the present invention, also provide a kind of supporting vector machine model generation method, having comprised:

Training set is carried out the classification flattening to be processed: for each training sample in described training set, to the affiliated classification that this training sample sets in advance, the level of category just sorts; For each classification under this training sample, from the higher classification of level, judge under this training sample in classification, whether such other subclass classification is arranged; If have, this classification is rejected classification under this training sample;

Generate described supporting vector machine model according to the training set of processing through the classification flattening.

According to another aspect of the present invention, also provide a kind of supporting vector machine model generating apparatus, having comprised:

Training set flattening processing module, be used for that training set is carried out the classification flattening and process: for each training sample of described training set, to the affiliated classification that this training sample sets in advance, the level of category just sorts; For each classification under this training sample, from the higher classification of level, judge under this training sample in classification, whether such other subclass classification is arranged; If have, this classification is rejected classification under this training sample; To export through the training set that the classification flattening is processed;

The supporting vector machine model generation module is used for receiving the training set of described training set flattening processing module output, and generates described supporting vector machine model according to the training set that receives.

The sign of the classification that highest level is following is comprised of sign and such other subclass identification code of its parent classification; Wherein, described subclass identification code is for the one group of subclass that belongs to same parent, is unique identification code of each subclass distribution in organizing.

The embodiment of the present invention is processed owing to first according to the hierarchical relationship between classification, training set being carried out the classification flattening, make through the training set that the classification flattening was processed and considered hierarchical relationship between classification, thereby the supporting vector machine model that obtains makes classification results have accuracy preferably applicable to the document of multi-layer classification is classified.

Further, comprise such other level routing information in the sign of classification, so that date back to such other parent classification according to the sign of all categories in the classification results of document, obtained the more detailed category attribute information of the document.

Description of drawings

Fig. 1 be prior art obtain the method flow diagram of classification word matrix according to training set;

Fig. 2 be prior art according to classification word matrix, the method flow diagram of supported vector machine model;

Fig. 3 be prior art according to supporting vector machine model, document is carried out the method flow diagram of automatic classification;

Fig. 4 be the embodiment of the present invention training set is carried out the method flow diagram that the classification flattening is processed;

Fig. 5 is the method flow diagram of the generation supporting vector machine model of the embodiment of the present invention;

Fig. 6 is the inner structure block diagram of the supporting vector machine model generating apparatus of the embodiment of the present invention.

Embodiment

For making purpose of the present invention, technical scheme and advantage clearer, referring to accompanying drawing and enumerate preferred embodiment, the present invention is described in more detail.Yet, need to prove, many details of listing in instructions are only in order to make the reader to one or more aspects of the present invention, a thorough understanding be arranged, even if do not have these specific details also can realize these aspects of the present invention.

The terms such as " module " used in this application, " system " are intended to comprise the entity relevant to computing machine, such as but not limited to hardware, firmware, combination thereof, software or executory software.For example, module can be, but be not limited in: the thread of the process of moving on processor, processor, object, executable program, execution, program and/or computing machine.For instance, the application program of moving on computing equipment and this computing equipment can be modules.One or more modules can be positioned at an executory process and/or thread, and module also can be on a computing machine and/or be distributed between two or more computing machines.

The present inventor analyzes the document automatic classification method of prior art, when finding to adopt prior art that the document of multi-layer classification is classified, due to the hierarchical relationship (the affiliated relation between classification in other words) of not considering between classification, so can cause the classification of document chaotic.Such as, a kind of multi-layer classification (also can be described as the tree construction classification) shown in following table 1:

Table 1

One-level	Secondary	Three grades	Level Four
				Science and technology	The internet	The internet form	?
?	?	?	Social networks
				?	?	?	The community
?	?	?	Venture capital investment
				?	?	?	Microblogging
?	?	?	China-concept-shares
				?	?	Internet giant	?
?	?	?	Baidu
				?	?	?	Tengxun
?	?	?	Facebook
				?	?	?	Alibaba
?	?	?	Google
				?	?	?	twitter
?	?	The internet famous person	?
				?	?	?	Ma Yun
?	?	?	Lei Jun
				?	?	?	Prick the gram Burger
?	?	?	Zhou Hong Yi
				?	?	?	Li Yanhong
?	?	?	Li Kaifu
				?	?	?	Ma Huateng
?	?	?	Liu Qiangdong
				?	?	Mobile Internet	?
?	?	Ecommerce	?
				?	Industry	The personage	?
?	?	?	Bauer is silent
				?	?	?	Base of a fruit nurse Cook
?	?	?	Liu Chuanzhi
				?	?	?	Yang Yuanqing
?	?	Company	?
				?	?	?	Association
?	?	?	Microsoft
				?	?	?	Apple
?	?	?	Intel
				?	?	?	Foxconn

?	?	?	Samsung
				?	?	Key concept	?
?	?	?	The cloud storage
				?	?	?	Large data
?	?	?	Windows

Wherein, classification is divided into four levels, is respectively from high to low one-level, secondary, three grades, level Four; Under one-level classification " science and technology ", comprise two secondary classifications " internet " and " industry ", namely " internet " and " industry " classification belongs to one-level classification " science and technology ", and " internet " and " industry " classification and one-level classification " science and technology " have the level membership; " internet " and " industry " classification is the subclass classification of " science and technology " classification, and " science and technology " classification is the parent classification of " internet " and " industry " classification;

Under secondary classification " internet ", comprise some three grades of classifications " internet form ", " internet giant ", " internet famous person " etc., be that these three grades of classifications belong to secondary classification " internet ", that is these three grades of classifications and secondary classification " internet " have the level membership; These three grades of classifications are the subclass classification of secondary classification " internet ", and secondary classification " internet " is the parent classification of these three grades of classifications.

After adopting prior art to generate supporting vector machine model for the proper vector of all categories in table 1, suppose to have a document to be sorted, after lineoid of all categories in its proper vector and supporting vector machine model carries out Distance Judgment, namely with supporting vector machine model in proper vector of all categories carry out similarity relatively after, the similarity that obtains classification from high to low is respectively: science and technology, internet, internet giant, internet famous person, Alibaba, Ma Yun; Select the first five classification of rank: science and technology, internet, internet giant, internet famous person, Alibaba are as the final classification results of the document; Like this, the characteristic attribute that causes the document to belong to the Ma Yun in the internet famous person just has been left in the basket; Thereby classifying quality out of true, poor effect may cause the classification of a lot of documents chaotic.

Thus, the present inventor considers the hierarchical relationship of considering between classification when the training stage, make training supporting vector machine model out go for automatic classification to the document of multi-layer classification: before training supporting vector machine model according to training set, first according to the hierarchical relationship between classification, training set to be carried out the classification flattening and process; Training set after processing through the classification flattening is trained, thereby the supporting vector machine model that obtains is applicable to the document of multi-layer classification is classified.

Usually, each document in training set can manually be set in advance at least one classification; For the document of multi-layer classification, in a plurality of classifications under it, may comprise the classification with level membership; For example, the document A in training set, under it, classification may comprise: science and technology, internet, internet giant, internet famous person, Alibaba, Ma Yun; Wherein, " science and technology " is the classification with level membership with " internet ", " internet " is the classification with level membership with " internet giant ", " internet " is the classification with level membership with " internet famous person ", " internet giant " and " Alibaba " are the classifications with level membership, and " internet famous person " and " Ma Yun " are the classifications with level membership.

According to the hierarchical relationship between classification, training set is carried out the method flow that the classification flattening is processed, as shown in Figure 4, comprise the steps:

S401: for each training sample in training set, to the affiliated classification that each training sample sets in advance, the level of category just sorts;

For example, the ranking results for classification under above-mentioned document A is: science and technology, internet, internet giant, internet famous person, Alibaba, Ma Yun.

S402: for each training sample, of all categories under this training sample respectively from the higher classification of level, judges under this training sample in classification, whether such other subclass classification is arranged; If have, this classification is rejected classification under this training sample.

For example, for above-mentioned document A, judge " science and technology " has " science and technology " in classification under document A subclass classification " internet ", namely have with it other classification " internet " of level membership, " science and technology " is rejected classification under document A; In like manner, afterwards " internet ", " internet giant " and " internet famous person " are rejected classification under document A.

Finally, classification has only kept " Alibaba " and " Ma Yun " two mutual subclass classification of not having a level membership under document A.

Thus, the method flow at training stage generation supporting vector machine model that provides of the embodiment of the present invention as shown in Figure 5, comprises the steps:

S501: according to the hierarchical relationship between classification, training set is carried out the classification flattening and process.

The concrete grammar that training set is carried out classification flattening processing is described in detail in aforementioned each step shown in Figure 4, repeats no more herein.

S502: according to the training set of processing through the classification flattening, generate supporting vector machine model.

In this step, build classification word matrix according to the training set of processing through the classification flattening; Generate proper vector of all categories according to described classification matrix, build described supporting vector machine model according to proper vector of all categories; Wherein according to the training set of processing through the classification flattening, the method that the generation supporting vector machine model adopts is identical with method of the prior art, is described in detail in each step shown in earlier figures 1,2, repeats no more herein.

After the supported vector machine model of technical scheme according to the present invention, according to supporting vector machine model, document to be sorted is classified: treat classifying documents and carry out participle, after obtaining the set of words of this document to be sorted, add up the frequency that effective word concentrates each word in p word to occur in this document to be sorted, the word frequency of each word in p word concentrated according to effective word of statistics and the anti-document frequency of each word obtain the proper vector of this document to be sorted

Distance between corresponding lineoid of all categories respectively in the proper vector of calculating this document to be sorted and described supporting vector machine model; Classification under this document to be sorted definite according to the distance of calculating.The method of carrying out document classification in detailed process and prior art is identical, is described in detail in aforesaid each step shown in Figure 3, repeats no more herein.

In fact, process if the classification of the training sample in training set is not carried out flattening, and directly calculate supporting vector machine model with these training samples, supporting vector machine model will not be suitable for the classification of the document of multi-layer classification; And use in the present invention when generating supporting vector machine model through the training set after the flattening processing, due to the document in the training set of processing through the classification flattening, under it, classification will can not have the level membership between any two, and the affiliated classification of its reservation is the lower classification of level; Therefore, according to the training set that the classification flattening is processed, will make the word frequency of the lower classification of level increase when building classification word matrix; And then when building supporting vector machine model, make the characteristic vector space of the lower classification of level larger; Thereby when carrying out document classification according to this supporting vector machine model, can be more prone to the lineoid of the lower classification of level, the similarity that can be more prone in other words the lower classification of level is higher; Making the lower classification of level preferentially to select out, the phenomenon that produces with regard to can not occur in prior art, multi-layer classification document being classified the time---classification that some levels are lower is left in the basket and causes that classifying quality is not good, the confusion of document classification.

for example, if adopting supporting vector machine model of the present invention classifies to above-mentioned document A, after training set being carried out classification flattening processing, in classification word matrix, the subclass classification is as " Ma Yun ", the word frequency of " Alibaba " will be increased, its subclass classification with and more the classification of upper level as " science and technology ", " internet ", the word frequency of " internet giant " will reduce, when according to the supporting vector machine model that obtains thus, above-mentioned document A being classified, the proper vector of document A will be apart from the subclass classification as " Ma Yun ", the lineoid of " Alibaba " is nearer, namely be more prone to the subclass classification as " Ma Yun ", the proper vector of " Alibaba ", be the proper vector and " Ma Yun " of document A, the similarity of the proper vector of " Alibaba " will higher than with " science and technology ", " internet ", the similarity of the proper vector of " internet giant ", therefore, adopt supporting vector machine model that document A is carried out after classification determines, resulting sequencing of similarity will be: " Ma Yun ", " Alibaba ", " internet giant ", " internet famous person ", " internet ", " science and technology ", select the first five classification of rank: " Ma Yun ", " Alibaba ", " internet giant ", " internet famous person ", " internet " are as the final classification results of the document, obviously, this classification results is more more accurate than the classification results of the sorting technique of prior art, and effect is better.

In actual applications, each classification has been assigned with unique sign; More preferably, the sign of each classification in the solution of the present invention has comprised such other level routing information; Thus, after supporting vector machine model according to the present invention is classified to document to be sorted, obtain the classification results of the document, can date back to according to the sign of all categories in this classification results such other parent classification, obtain the more detailed category attribute information of the document.

Particularly, comprise the classification logotype of level routing information can numeral or alphabetical form represent, wherein, the sign of the classification that highest level is following is comprised of sign and such other subclass identification code of its parent classification; To belonging to one group of subclass of same parent, wherein each subclass has been distributed the unique identification code in this group; That is to say, described subclass identification code is for the one group of subclass that belongs to same parent, is unique identification code of each subclass distribution in organizing.

For example, have the classification of the level membership shown in upper table 1, wherein, the highest level classification is one-level classification " science and technology ", and its classification logotype can be " 01 ";

Secondary classification " internet " below highest level, the sign of " industry " can be " 0101 ", " 0102 " respectively; Therefrom can find out, the high two digits of the sign of " internet " and " industry " equals the sign " 01 " of its parent classification " science and technology ", and then two digits " 01 ", " 02 " are respectively " internet ", " industry " identification code in group.

The sign of the three grades of classifications " internet form " below highest level, " internet giant ", " internet famous person ", " mobile Internet ", " ecommerce " can be " 010101 ", " 010102 ", " 010103 ", " 010104 ", " 010105 " respectively; High 4 bit digital of the sign of these classifications equal the sign " 0101 " of its parent classification " internet ", and then two digits " 01 ", " 02 ", " 03 ", " 04 ", " 05 " are respectively " internet form ", " internet giant ", " internet famous person ", " mobile Internet ", the identification code of " ecommerce " classification in group.

Thus, after obtaining the classification results of document, can determine easily the parent classification of the classification in classification results, and then also can determine the parent classification of this parent classification.

Obviously, similarly, the classification logotype that comprises the level routing information also can alphabetical form represent, Method And Principle is identical with numeral, repeats no more herein.

The supporting vector machine model generating apparatus that the embodiment of the present invention provides, its inner structure block diagram comprises as shown in Figure 6: training set flattening processing module 601, supporting vector machine model generation module 602.

Training set flattening processing module 601 is used for that training set is carried out the classification flattening and processes: for each training sample of described training set, to the affiliated classification that this training sample sets in advance, the level of category just sorts; For each classification under this training sample, from the higher classification of level, judge under this training sample in classification, whether such other subclass classification is arranged; If have, this classification is rejected classification under this training sample; To export through the training set that the classification flattening is processed; Comprised such other level routing information in the sign of described classification; The sign of described classification specifically can numeral or alphabetical form represent; Wherein, the sign of the classification below highest level is comprised of sign and such other subclass identification code of its parent classification; Wherein, described subclass identification code is for the one group of subclass that belongs to same parent, is unique identification code of each subclass distribution in organizing.

Supporting vector machine model generation module 602 is used for receiving the training set of training set flattening processing module 601 outputs, and generates described supporting vector machine model according to the training set that receives.Supporting vector machine model generation module 602 can adopt method same as the prior art to generate described supporting vector machine model according to training set, repeats no more herein.

The embodiment of the present invention is processed owing to first training set being carried out the classification flattening according to the hierarchical relationship between classification, make through the training set that the classification flattening was processed and considered hierarchical relationship between classification, thereby the supporting vector machine model that obtains makes classification results have accuracy preferably applicable to the document of multi-layer classification is classified.

One of ordinary skill in the art will appreciate that all or part of step that realizes in above-described embodiment method is to come the relevant hardware of instruction to complete by program, this program can be stored in a computer read/write memory medium, as: ROM/RAM, magnetic disc, CD etc.

The above is only the preferred embodiment of the present invention; should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the principle of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1. a Document Classification Method, is characterized in that, comprising:

2. the method for claim 1, is characterized in that, described classification has been assigned with unique sign, and comprised such other level routing information in the sign of described classification.

3. method as claimed in claim 2, is characterized in that, the sign of the classification that highest level is following is comprised of sign and such other subclass identification code of its parent classification; Wherein, described subclass identification code is for the one group of subclass that belongs to same parent, is unique identification code of each subclass distribution in organizing.

4. described method as arbitrary in claim 1-3, is characterized in that, described supporting vector machine model is according to specifically comprising that training set generates:

Build classification word matrix according to described training set;

5. a supporting vector machine model generation method, is characterized in that, comprising:

6. method as claimed in claim 5, is characterized in that, described classification has been assigned with unique sign, and comprised such other level routing information in the sign of described classification.

7. method as claimed in claim 6, is characterized in that, the sign of the classification that highest level is following is comprised of sign and such other subclass identification code of its parent classification; Wherein, described subclass identification code is for the one group of subclass that belongs to same parent, is unique identification code of each subclass distribution in organizing.

8. a supporting vector machine model generating apparatus, is characterized in that, comprising:

9. device as claimed in claim 8, is characterized in that, described classification has been assigned with unique sign, and comprised such other level routing information in the sign of described classification.

10. device as claimed in claim 9, is characterized in that, the sign of the classification that highest level is following is comprised of sign and such other subclass identification code of its parent classification; Wherein, described subclass identification code is for the one group of subclass that belongs to same parent, is unique identification code of each subclass distribution in organizing.