CN103106262B

CN103106262B - The method and apparatus that document classification, supporting vector machine model generate

Info

Publication number: CN103106262B
Application number: CN201310033125.XA
Authority: CN
Inventors: 戴明洋
Original assignee: Sina Technology China Co Ltd
Current assignee: Sina Technology China Co Ltd
Priority date: 2013-01-28
Filing date: 2013-01-28
Publication date: 2016-05-11
Anticipated expiration: 2033-01-28
Also published as: CN103106262A

Abstract

The invention discloses the method and apparatus of a kind of document classification, supporting vector machine model generation, described method comprises: according to the characteristic vector of document to be sorted and according to the supporting vector machine model generating through the training set of classification flattening processing, determine the affiliated classification of this document to be sorted, wherein, the classification flattening processing procedure of training set comprises: for the each training sample in training set, the affiliated classification that this training sample is set in advance, the level height of category sorts; For the each classification under this training sample, from the higher classification of level, judge under this training sample whether have such other subclass classification in classification; If have, this classification is rejected classification under this training sample. Owing to first training set being carried out to classification flattening processing according to the hierarchical relationship between classification, thereby make the supporting vector machine model obtaining applicable to the document of multi-layer classification is classified, make classification results there is good accuracy.

Description

The method and apparatus that document classification, supporting vector machine model generate

Technical field

The present invention relates to computer treatment technology, relate in particular to method that document classification, supporting vector machine model generate andDevice.

Background technology

In recent years, along with the fast development of Internet, make Web(network) on document resources present blastThe growth of formula, these document information data volumes are large, and content is numerous and diverse. Compared with structurized information in database, destructuring orSemi-structured web document information is more abundant and numerous and diverse. In order fully to effectively utilize these document resources, be that user canFind fast and effectively the information needing, and extract wherein potential valuable information, need these documents to carry outClassification.

The method of at present, document being carried out to automatic classification conventionally adopts the method based on supporting vector machine model to divideClass; The method comprises: training stage and sorting phase. At present, in prior art, there is the multiple literary composition based on supporting vector machine modelShelves automatic classification method, comparatively detailed introduction below one.

The method of supported vector machine model of training stage is: according to the document of having divided classification in training set, obtainCategory feature vector; According to category feature vector set, can supported vector machine model and effective word collection (or title wordAllusion quotation); For ease of describing, the sample in training set is called to training sample herein.

Wherein, according to the training sample of having divided classification in training set, obtain a kind of concrete grammar of category feature vector,Flow process as shown in Figure 1, comprises the steps:

S101: the each training sample in training set is carried out to participle, obtain the set of words of each training sample, deleteStop words wherein.

Training centralized collection the various documents of having divided classification, conventionally, training set adopts manual sort's language materialStorehouse. In order to ensure the stability and convergence of the supporting vector machine model that the training stage obtains, the number of files in training set conventionallyAmount is greater than certain numerical value.

Document (training sample) is made up of a string continuous word sequence, and word is the base unit in document; Participle is exactlyWord sequence continuous in document is divided into the process of word one by one, and the word marking off forms the set of words of the document.

S102: for each classification, add up in the set of words of such other training sample the frequency that each word occurs.

For example, total q the classification of training sample in training set, is designated as respectively: c₁、c₂……c_q; Wherein, q is for being greater than 2Natural number;

In training set, total n word in the set of words of all training samples, is designated as respectively t₁、t₂……t_n; ItsIn, n is greater than 2 natural number;

For i classification wherein, count j word in the set of words of training sample of i classification and occurThe frequency (number of times), be designated as m_ij。

S103: build classification word matrix.

According to the frequency that in the each classification counting, each word occurs, obtain the word frequency vector of each classification; ExampleAs, the word frequency vector of i classification

The classification word matrix of the q × n building

That is classification word Matrix C_q×nFor:

According to classification word matrix, a kind of concrete grammar of supported vector machine model, flow process as shown in Figure 2, comprisesFollowing steps:

S201: according to classification word matrix, calculate the anti-document frequency of each word.

Particularly, arrange classification frequency ICF for k word in n word_kComputing formula as formula 1:

{ICF}_{k} = \log (\frac{q}{{CF}_{k}} + 0.01)

(formula 1)

Arrange classification frequency ICF herein,_kBe the anti-document frequency (InverseDocument of k wordFrequency，IDF）IDF_k；ICF_k（IDF_k) value larger, show that the classification discrimination of k word is stronger.

S202: the anti-document frequency to each word sorts, obtains effective word collection according to ranking results and (also can be described asDictionary).

According to the anti-document frequency of each word (arranging classification frequency), a said n word is sorted, according in advanceEffective word parameter of first setting, therefrom extracts the preceding word of some sequences and forms effective word collection. Particularly, for example, pre-Effective word parameter of first setting is effective word number g, extracts g the preceding word of sequence and anti-document frequency thereof and formsEffectively word collection; Or predefined effective word parameter is effective word percentage h, extract n × h sequence precedingWord and anti-document frequency thereof form effective word collection.

S203: rebuild classification word matrix according to effective word collection.

According to the concentrated word of effective word, by former classification word matrix, be not contained in effective word concentrateThe matrix element of word forms new classification word matrix after rejecting;

If effectively the concentrated word of word is p, for i classification in q classification, this classification rebuildingWord frequency vectorWherein, m_irFor r word in p word is i classThe frequency occurring not;

According to the word frequency vector of the each classification rebuilding, the classification word matrix rebuilding:

C_{q \times p}^{'} = {[{\overset{&RightArrow;}{c_{1}}}^{'}, {\overset{&RightArrow;}{c_{2}}}^{'}, . . ., {\overset{&RightArrow;}{c_{q}}}^{'}]}^{T} .

S204: according to the classification word matrix rebuilding, calculate the word frequency (Term of each wordFrequency, TF), obtain word frequency vector of all categories.

Wherein, the word frequency tf of j word in the set of words of the training sample of i classification_ijAs formula 2 calculatesObtain:

{tf}_{ij} = \frac{m_{ij}}{\max (m_{i 1}, m_{i 2}, . . ., m_{ir}, . . ., m_{ip})}

(formula 2)

Thus, obtain the word frequency vector of i classification Wherein, tf_irForThe word frequency of r word in p word in i classification.

S205: according to the TF of word, and the IDF of each word, build supporting vector machine model.

Particularly, for each classification, according to such other word frequency vector rebuilding, and in p wordThe IDF of each word, calculate such other characteristic vector; Wherein, the characteristic vector of i classification isWherein, tfidf_irFor r word in p word is i classNot not middle word frequency tf_irAnti-document frequency IDF with this word_rProduct;

Characteristic vector by each classification can build described supporting vector machine model: according to the feature of each classification toAmount, determines hyperplane corresponding to difference in support vector model; Particularly, for every two classifications, with intervalMaximum turns to principle and calculates optimal dividing hyperplane, thereby finds support vector wherein as final support vector modelImportant parameter.

After supported vector machine model, can carry out automatic classification, i.e. sorting phase to document according to this model; ClassificationStage is carried out the method flow of automatic classification to document, as shown in Figure 3, comprise the steps:

S301: treat classifying documents and carry out participle, obtain the set of words of this document to be sorted.

S302: the characteristic vector of calculating this document to be sorted.

Particularly, the characteristic vector of this document to be sorted isWherein, z_rFor effective word collectionThe frequency tf that r word in a middle p word occurs in this document to be sorted_irProduct with the anti-document frequency of this wordValue.

S303: according to the characteristic vector of this document to be sorted and supporting vector machine model, determine this document to be sorted instituteBelong to classification.

Particularly, calculate the characteristic vector of this document to be sorted and corresponding hyperplane of all categories in supporting vector machine modelBetween distance; Determine classification under this document to be sorted according to the distance of calculating: the distance from hyperplane is treated point as thisThe confidence level of classification under class sample, the namely corresponding class of the hyperplane nearer apart from the characteristic vector of this document to be sortedNot, it is that under this document to be sorted, the confidence level of classification is higher; Wherein TOPK classification is as under this document to be sortedClassification; Wherein K is preset value, equals 5 such as setting K, gets front 5 classifications as classification under this document to be sorted.In fact, the distance between characteristic vector and the hyperplane of document to be sorted, has reflected the characteristic vector of document to be sorted and has been somebody's turn to doSimilarity between the characteristic vector of the corresponding classification of hyperplane; Distance is nearer, the characteristic vector of document to be sorted and classSimilarity between other characteristic vector is also just higher, and it is higher that this document to be sorted belongs to such other confidence level.

The present inventor finds, the document that the document automatic classification method of prior art can be single to classification levelClassify; But the document automatic classification method of prior art is not also suitable for the classification of the document of multi-layer classification, documentClassification results inaccuracy, undesirable; Therefore, for the document of multi-layer classification, such as the document of news category, still adopt at presentManual method is classified, and make staff's workload large, and efficiency is low.

Summary of the invention

Embodiments of the invention provide a kind of Document Classification Method and device based on multi-layer classification, applicable to rightThe document of multi-layer classification carries out automatic classification.

According to an aspect of the present invention, provide a kind of Document Classification Method, having comprised:

Treating classifying documents carries out, after participle, determining the characteristic vector of this document to be sorted;

According to the characteristic vector of this document to be sorted and according to propping up of generating through the training set of classification flattening processingHold vector machine model, determine the affiliated classification of this document to be sorted, wherein,

The classification flattening processing procedure of described training set, comprising: for the each training sample in described training set, rightThe affiliated classification that this training sample sets in advance, the level height of category sorts; Every under this training sampleIndividual classification, from the higher classification of level, judges under this training sample whether have such other subclass classification in classification; IfHave, this classification is rejected classification under this training sample.

Preferably, described classification has been assigned with unique mark, and has comprised such other in the mark of described classificationLevel routing information.

Preferably, the mark of the classification below highest level is by mark and such other subclass identification code of its parent classificationComposition; Wherein, described subclass identification code is for the one group of subclass that belongs to same parent, for organizing the unique of interior each subclass distributionIdentification code.

Wherein, described supporting vector machine model is specifically comprising of generating according to training set:

Build classification word matrix according to described training set;

Generate characteristic vector of all categories according to described classification word matrix, described in building according to characteristic vector of all categoriesSupporting vector machine model; And

Described according to the characteristic vector of this document to be sorted and supporting vector machine model, determine under this document to be sortedClassification specifically comprises:

Calculate the characteristic vector of this document to be sorted respectively corresponding of all categories super flat with described supporting vector machine modelDistance between face;

Determine classification under this document to be sorted according to the distance of calculating.

According to another aspect of the present invention, also provide a kind of supporting vector machine model generation method, having comprised:

Training set is carried out to classification flattening processing: for the each training sample in described training set, to this training sampleOriginally the affiliated classification setting in advance, the level height of category sorts; For the each classification under this training sample, fromThe classification that level is higher starts, and judges in the affiliated classification of this training sample whether have such other subclass classification; If have, shouldClassification is rejected classification under this training sample;

According to generating described supporting vector machine model through the training set of classification flattening processing.

According to another aspect of the present invention, also provide a kind of supporting vector machine model generating apparatus, having comprised:

Training set flattening processing module, for carrying out classification flattening processing: for described training set to training setEach training sample, the affiliated classification that this training sample is set in advance, the level height of category sorts; For thisEach classification under training sample, from the higher classification of level, judges under this training sample whether have this in classificationThe subclass classification of classification; If have, this classification is rejected classification under this training sample; Will be through classification flattening processingTraining set output;

Supporting vector machine model generation module, for receiving the training set of described training set flattening processing module output,And generate described supporting vector machine model according to the training set receiving.

The mark of the classification below highest level is made up of mark and such other subclass identification code of its parent classification; ItsIn, described subclass identification code is for the one group of subclass that belongs to same parent, the unique identification distributing for organizing interior each subclassCode.

The embodiment of the present invention, owing to first according to the hierarchical relationship between classification, training set being carried out to classification flattening processing, makesThe training set that must process through classification flattening has been considered the hierarchical relationship between classification, thus the SVMs obtainingModel, applicable to the document of multi-layer classification is classified, makes classification results have good accuracy.

Further, in the mark of classification, comprise such other level routing information, so that according to the classification knot of documentMark of all categories in fruit dates back to such other parent classification, obtains the category attribute information that the document is more detailed.

Brief description of the drawings

Fig. 1 be prior art obtain the method flow diagram of classification word matrix according to training set;

Fig. 2 be prior art according to classification word matrix, the method flow diagram of supported vector machine model;

Fig. 3 is the method flow diagram that according to supporting vector machine model, document is carried out automatic classification of prior art;

Fig. 4 is the method flow diagram that training set is carried out to classification flattening processing of the embodiment of the present invention;

Fig. 5 is the method flow diagram of the generation supporting vector machine model of the embodiment of the present invention;

Fig. 6 is the internal structure block diagram of the supporting vector machine model generating apparatus of the embodiment of the present invention.

Detailed description of the invention

For making object of the present invention, technical scheme and advantage clearer, referring to accompanying drawing and enumerate preferred realityExecute example, the present invention is described in more detail. But, it should be noted that, many details of listing in description be only forMake reader have a thorough understanding to one or more aspects of the present invention, even if do not have these specific details passable yetRealize these aspects of the present invention.

The terms such as " module " used in this application, " system " are intended to comprise the entity relevant to computer, for example but do not limitIn hardware, firmware, combination thereof, software or executory software. For example, module can be, but be not limited in: processThread, program and/or the computer of the process moved on device, processor, object, executable program, execution. For instance, meterThe application program of moving on calculation equipment and this computing equipment can be modules. One or more modules can be positioned at executoryIn process and/or thread, a module also can be positioned on a computer and/or be distributed in two or more calculatingBetween machine.

The present inventor analyzes the document automatic classification method of prior art, finds to adopt prior art pairWhen the document of multi-layer classification is classified, due to the hierarchical relationship (institute between classification in other words not considering between classificationGenus relation), so can cause the classification confusion of document. Such as, a kind of multi-layer classification shown in following table 1 (also can be described as tree knotStructure classification):

Table 1

One-level	Secondary	Three grades	Level Four
				Science and technology	Internet	Internet form
			Social networks
							Community
			Venture capital investment
							Microblogging
			China-concept-shares
						Internet giant	5 -->
			Baidu
							Tengxun
			Facebook
							Alibaba
			Google
							twitter
		Internet famous person
							Ma Yun
			Lei Jun
							Prick gram Burger
			Zhou Hong Yi
							Li Yanhong
			Li Kaifu
							Ma Huateng
			Liu Qiangdong
						Mobile Internet
		Ecommerce
					Industry	Personage
			Bauer is silent
							Base of a fruit nurse Cook
			Liu Chuanzhi
							Yang Yuanqing
		Company
							Association
			Microsoft
							Apple
			Intel
							Foxconn

Samsung
	Key concept
Cloud storage
		Large data
Windows

Wherein, classification is divided into four levels, is respectively from high to low one-level, secondary, three grades, level Four; In one-level classification " sectionSkill " under, comprise two secondary classifications " internet " and " industry ", " internet " and " industry " classification belongs to one-level classification " sectionSkill ", " internet " and " industry " classification and one-level classification " science and technology " have level membership; " internet " and " industry " classificationFor the subclass classification of " science and technology " classification, " science and technology " classification is the parent classification of " internet " and " industry " classification;

Under secondary classification " internet ", comprise some three grades of classifications " internet form ", " internet giant ", " interconnectedUser name people " etc., these three grades of classifications belong to secondary classification " internet ", that is these three grades of classifications and secondary classification " interconnectedNet " there is level membership; These three grades of classifications are the subclass classification of secondary classification " internet ", secondary classification " internet "For the parent classification of these three grades of classifications.

Adopt prior art to generate after supporting vector machine model for the characteristic vector of all categories in table 1, suppose to have oneDocument to be sorted, the hyperplane of all categories in its characteristic vector and supporting vector machine model carries out after Distance Judgment, withCharacteristic vector of all categories in supporting vector machine model carry out similarity relatively after, the similarity obtaining classification from high to lowBe respectively: science and technology, internet, internet giant, internet famous person, Alibaba, Ma Yun; Select the first five classification of rank: sectionSkill, internet, internet giant, internet famous person, Alibaba are as the final classification results of the document; Like this, cause thisThe characteristic attribute that document belongs to the Ma Yun in internet famous person has just been left in the basket; Thereby classifying quality inaccuracy, poor effect, canCan be able to cause the classification confusion of a lot of documents.

Thus, the present inventor considers and in the time of the training stage, considers the hierarchical relationship between classification, makes trainingSupporting vector machine model out goes for the automatic classification of the document to multi-layer classification: training according to training setBefore supporting vector machine model, first according to the hierarchical relationship between classification, training set is carried out to classification flattening processing; To pass throughClassification flattening training set after treatment is trained, thereby the supporting vector machine model obtaining is applicable to multi-layer classificationDocument classify.

Conventionally, the each document in training set can manually be set in advance at least one classification; For the literary composition of multi-layer classificationShelves, in the multiple classifications under it, may comprise the classification with level membership; For example, the document A in training set, itsAffiliated classification may comprise: science and technology, internet, internet giant, internet famous person, Alibaba, Ma Yun; Wherein, " science and technology "Be the classification with level membership with " internet ", " internet " has level membership with " internet giant "Classification, " internet " is the classification with level membership with " internet famous person ", " internet giant " and " Alibaba "Be the classification with level membership, " internet famous person " and " Ma Yun " are the classifications with level membership.

According to the hierarchical relationship between classification, training set is carried out the method flow of classification flattening processing, as shown in Figure 4,Comprise the steps:

S401: for the each training sample in training set, the affiliated classification that each training sample is set in advance, by classOther level height sorts;

For example, for the ranking results of classification under above-mentioned document A be: science and technology, internet, internet giant, internetFamous person, Alibaba, Ma Yun.

S402: for each training sample, of all categories under this training sample respectively, from the higher classification of levelStart, judge in the affiliated classification of this training sample whether have such other subclass classification; If have, by this classification from this training sampleUnder this, in classification, reject.

For example, for above-mentioned document A, judge " science and technology " has the subclass classification of " science and technology " " mutually under document A in classificationNetworking ", there is with it other classification " internet " of level membership, " science and technology " is picked classification under document ARemove; In like manner, afterwards " internet ", " internet giant " and " internet famous person " are rejected classification under document A.

Finally, classification has only retained " Alibaba " and " Ma Yun " two and has not mutually had a level membership under document ASubclass classification.

Thus, the method flow at training stage generation supporting vector machine model providing of the embodiment of the present invention, as Fig. 5Shown in, comprise the steps:

S501: training set is carried out to classification flattening processing according to the hierarchical relationship between classification.

The concrete grammar that training set is carried out to classification flattening processing has carried out in detail in the each step shown in earlier figures 4Introduce, repeat no more herein.

S502: according to the training set through classification flattening processing, generate supporting vector machine model.

In this step, according to building classification word matrix through the training set of classification flattening processing; According to described classificationMatrix generates characteristic vector of all categories, builds described supporting vector machine model according to characteristic vector of all categories; Wherein basisThrough the training set of classification flattening processing, generate method and method phase of the prior art that supporting vector machine model adoptsWith, in the each step shown in earlier figures 1,2, be described in detail, repeat no more herein.

After the supported vector machine model of technical scheme according to the present invention, according to supporting vector machine model to be sortedDocument classify: treat classifying documents and carry out participle, obtain, after the set of words of this document to be sorted, adding up effective wordThe frequency of concentrating the each word in p word to occur in this document to be sorted, concentrates p word according to effective word of statisticsIn the word frequency of each word and the anti-document frequency of each word obtain the characteristic vector of this document to be sortedCalculate the characteristic vector of this document to be sorted corresponding respectively with described supporting vector machine modelDistance between hyperplane of all categories; Determine classification under this document to be sorted according to the distance of calculating. Detailed process is with existingHave that in technology, to carry out the method for document classification identical, in the each step shown in aforesaid Fig. 3, be described in detail, herein notRepeat again.

In fact, if the classification of the training sample in training set is not carried out to flattening processing, and directly use theseTraining sample calculates supporting vector machine model, supporting vector machine model by be not suitable for multi-layer classification document pointClass; And in the present invention, use while generating supporting vector machine model through flattening training set after treatment, due to flat through classificationDocument in the training set of graduation processing, under it, classification will can not have level membership between any two, and its reservationAffiliated classification is the classification that level is lower; Therefore, according to the training set of classification flattening processing, will while building classification word matrixCan make the word frequency of the classification that level is lower increase; And then build when supporting vector machine model, make the class that level is lowerOther characteristic vector space is larger; Thereby while carrying out document classification according to this supporting vector machine model, can be more prone to levelThe hyperplane of low classification, the similarity that can be more prone in other words the classification that level is lower is higher; Make the class that level is lowerCan preferentially not select out, just there will not be the phenomenon producing when multi-layer classification document classification in prior art---Some levels are left in the basket compared with low classification and cause that classifying quality is not good, the confusion of document classification.

For example,, if adopt supporting vector machine model of the present invention to classify to above-mentioned document A, due to training set is enteredAfter the flattening of row classification is processed, in classification word matrix, subclass classification will be increased as the word frequency of " Ma Yun ", " Alibaba ",Its subclass classification with and more the classification of upper level will reduce as " science and technology ", " internet ", " internet giant's " word frequency, according toWhen the supporting vector machine model obtaining is thus classified to above-mentioned document A, the characteristic vector of document A will be apart from subclass classificationAs nearer in the hyperplane of " Ma Yun ", " Alibaba ", be namely more prone to subclass classification as " Ma Yun ", " Alibaba "Characteristic vector, the similarity of the characteristic vector of the characteristic vector of document A and " Ma Yun ", " Alibaba " will higher than with " sectionSkill ", the similarity of " internet ", " internet giant's " characteristic vector; Therefore, adopt supporting vector machine model to enter document AAfter row classification is determined, the sequencing of similarity obtaining will be: " Ma Yun ", " Alibaba ", " internet giant ", " internetFamous person ", " internet ", " science and technology "; Select the first five classification of rank: " Ma Yun ", " Alibaba ", " internet giant ", " interconnectedUser name people ", " internet " as the final classification results of the document; Obviously, this classification results is than the sorting technique of prior artClassification results more accurate, effect is better.

In actual applications, each classification has been assigned with unique mark; More preferably, each class in the solution of the present inventionOther mark, has comprised such other level routing information; Thus, at supporting vector machine model according to the present invention to be sortedDocument classify after, obtain the classification results of the document, can recall according to the mark of all categories in this classification resultsTo such other parent classification, obtain the category attribute information that the document is more detailed.

Particularly, the classification logotype that comprises level routing information can numeral or alphabetical form represent, wherein, highest levelThe mark of following classification is made up of mark and such other subclass identification code of its parent classification; To belonging to one of same parentGroup subclass, wherein each subclass has been distributed the unique identification code in this group; That is to say, described subclass identification code is for genusIn one group of subclass of same parent, unique identification code of distributing for organizing interior each subclass.

For example, have the classification of the level membership shown in upper table 1, wherein, highest level classification is one-level classification " sectionSkill ", its classification logotype can be " 01 ";

Secondary classification " internet " below highest level, the mark of " industry " can be " 0101 ", " 0102 " respectively; FromIn can find out, the high two digits of the mark of " internet " and " industry " equals the mark of its parent classification " science and technology "" 01 ", then two digits " 01 ", " 02 " are respectively " internet ", " industry " identification code in group.

Three grades of classifications " internet form " below highest level, " internet giant ", " internet famous person ", " mutually mobileNetworking ", the mark of " ecommerce " can be " 010101 ", " 010102 ", " 010103 ", " 010104 ", " 010105 " respectively;High 4 bit digital of the mark of these classifications equal the mark " 0101 " of its parent classification " internet ", then two digits " 01 "," 02 ", " 03 ", " 04 ", " 05 " are respectively " internet form ", " internet giant ", " internet famous person ", " mobile interconnectedNet ", the identification code of " ecommerce " classification in group.

Thus, obtaining after the classification results of document, can determine easily the parent classification of the classification in classification results,And then also can determine the parent classification of this parent classification.

Obviously, similarly, the classification logotype that comprises level routing information also can alphabetical form represent, Method And Principle and numberWord identical repeats no more herein.

The supporting vector machine model generating apparatus that the embodiment of the present invention provides, its internal structure block diagram as shown in Figure 6, wrapsDraw together: training set flattening processing module 601, supporting vector machine model generation module 602.

Training set flattening processing module 601 is for carrying out classification flattening processing to training set: for described training setIn each training sample, the affiliated classification that this training sample is set in advance, the level height of category sorts; ForWhether the each classification under this training sample, from the higher classification of level, judge under this training sample and have in classificationSuch other subclass classification; If have, this classification is rejected classification under this training sample; Will be through classification flattening placeThe training set output of reason; In the mark of described classification, comprise such other level routing information; The mark of described classification is concreteCan numeral or alphabetical form represent; Wherein, the mark of the classification below highest level is by the mark of its parent classification and suchOther subclass identification code composition; Wherein, described subclass identification code is for the one group of subclass that belongs to same parent, interior each for organizingUnique identification code that subclass is distributed.

The training set that supporting vector machine model generation module 602 is exported for receiving training set flattening processing module 601,And generate described supporting vector machine model according to the training set receiving. Supporting vector machine model generation module 602 can adopt with existingThere is the method that technology is identical to generate described supporting vector machine model according to training set, repeat no more herein.

The embodiment of the present invention, owing to first training set being carried out to classification flattening processing according to the hierarchical relationship between classification, makesThe training set that must process through classification flattening has been considered the hierarchical relationship between classification, thus the SVMs obtainingModel, applicable to the document of multi-layer classification is classified, makes classification results have good accuracy.

One of ordinary skill in the art will appreciate that all or part of step realizing in above-described embodiment method is passableCarry out by program the hardware that instruction is relevant and complete, this program can be stored in a computer read/write memory medium, as:ROM/RAM, magnetic disc, CD etc.

The above is only the preferred embodiment of the present invention, it should be pointed out that the ordinary skill people for the artMember, under the premise without departing from the principles of the invention, can also make some improvements and modifications, and these improvements and modifications also shouldBe considered as protection scope of the present invention.

Claims

1. a Document Classification Method, is characterized in that, comprising:

The support generating according to the characteristic vector of this document to be sorted and according to the training set through classification flattening processing toAmount machine model, determines the affiliated classification of this document to be sorted, wherein,

The classification flattening processing procedure of described training set, comprising: for the each training sample in described training set, to this instructionPractice the affiliated classification that sample sets in advance, the level height of category sorts; For the each class under this training sampleNot, from the higher classification of level, judge under this training sample whether have such other subclass classification in classification; If have,This classification is rejected classification under this training sample.

2. the method for claim 1, is characterized in that, described classification has been assigned with unique mark, and described classIn other mark, comprise such other level routing information.

3. method as claimed in claim 2, is characterized in that, the mark of the classification below highest level is by its parent classificationMark and such other subclass identification code composition; Wherein, described subclass identification code is for the one group of subclass that belongs to same parent,Unique identification code of distributing for organizing interior each subclass.

4. the method as described in as arbitrary in claim 1-3, is characterized in that, described supporting vector machine model is raw according to training setWhat become specifically comprises:

Build classification word matrix according to described training set;

Generate characteristic vector of all categories according to described classification word matrix, build described support according to characteristic vector of all categoriesVector machine model; And

Described according to the characteristic vector of this document to be sorted and supporting vector machine model, determine the affiliated classification of this document to be sortedSpecifically comprise:

Calculate the characteristic vector of this document to be sorted and in described supporting vector machine model respectively corresponding hyperplane of all categories itBetween distance;

5. a supporting vector machine model generation method, is characterized in that, comprising:

Training set is carried out to classification flattening processing: for the each training sample in described training set, pre-to this training sampleThe affiliated classification first arranging, the level height of category sorts; For the each classification under this training sample, from levelHigher classification starts, and judges in the affiliated classification of this training sample whether have such other subclass classification; If have, by this classificationUnder this training sample, classification, reject;

6. method as claimed in claim 5, is characterized in that, described classification has been assigned with unique mark, and described classIn other mark, comprise such other level routing information.

7. method as claimed in claim 6, is characterized in that, the mark of the classification below highest level is by its parent classificationMark and such other subclass identification code composition; Wherein, described subclass identification code is for the one group of subclass that belongs to same parent,Unique identification code of distributing for organizing interior each subclass.

8. a supporting vector machine model generating apparatus, is characterized in that, comprising:

Training set flattening processing module, for carrying out classification flattening processing: every for described training set to training setIndividual training sample, the affiliated classification that this training sample is set in advance, the level height of category sorts; For this trainingEach classification under sample, from the higher classification of level, judges under this training sample whether have this classification in classificationSubclass classification; If have, this classification is rejected classification under this training sample; By the instruction through classification flattening processingPractice collection output;

Supporting vector machine model generation module, for receiving the training set of described training set flattening processing module output, and rootGenerate described supporting vector machine model according to the training set receiving.

9. device as claimed in claim 8, is characterized in that, described classification has been assigned with unique mark, and described classIn other mark, comprise such other level routing information.

10. device as claimed in claim 9, is characterized in that, the mark of the classification below highest level is by its parent classificationMark and such other subclass identification code composition; Wherein, described subclass identification code is for the one group of son that belongs to same parentClass, unique identification code of distributing for organizing interior each subclass.