CN103106262B - The method and apparatus that document classification, supporting vector machine model generate - Google Patents

The method and apparatus that document classification, supporting vector machine model generate Download PDF

Info

Publication number
CN103106262B
CN103106262B CN201310033125.XA CN201310033125A CN103106262B CN 103106262 B CN103106262 B CN 103106262B CN 201310033125 A CN201310033125 A CN 201310033125A CN 103106262 B CN103106262 B CN 103106262B
Authority
CN
China
Prior art keywords
classification
document
training
machine model
subclass
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310033125.XA
Other languages
Chinese (zh)
Other versions
CN103106262A (en
Inventor
戴明洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sina Technology China Co Ltd
Original Assignee
Sina Technology China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sina Technology China Co Ltd filed Critical Sina Technology China Co Ltd
Priority to CN201310033125.XA priority Critical patent/CN103106262B/en
Publication of CN103106262A publication Critical patent/CN103106262A/en
Application granted granted Critical
Publication of CN103106262B publication Critical patent/CN103106262B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses the method and apparatus of a kind of document classification, supporting vector machine model generation, described method comprises: according to the characteristic vector of document to be sorted and according to the supporting vector machine model generating through the training set of classification flattening processing, determine the affiliated classification of this document to be sorted, wherein, the classification flattening processing procedure of training set comprises: for the each training sample in training set, the affiliated classification that this training sample is set in advance, the level height of category sorts; For the each classification under this training sample, from the higher classification of level, judge under this training sample whether have such other subclass classification in classification; If have, this classification is rejected classification under this training sample. Owing to first training set being carried out to classification flattening processing according to the hierarchical relationship between classification, thereby make the supporting vector machine model obtaining applicable to the document of multi-layer classification is classified, make classification results there is good accuracy.

Description

The method and apparatus that document classification, supporting vector machine model generate
Technical field
The present invention relates to computer treatment technology, relate in particular to method that document classification, supporting vector machine model generate andDevice.
Background technology
In recent years, along with the fast development of Internet, make Web(network) on document resources present blastThe growth of formula, these document information data volumes are large, and content is numerous and diverse. Compared with structurized information in database, destructuring orSemi-structured web document information is more abundant and numerous and diverse. In order fully to effectively utilize these document resources, be that user canFind fast and effectively the information needing, and extract wherein potential valuable information, need these documents to carry outClassification.
The method of at present, document being carried out to automatic classification conventionally adopts the method based on supporting vector machine model to divideClass; The method comprises: training stage and sorting phase. At present, in prior art, there is the multiple literary composition based on supporting vector machine modelShelves automatic classification method, comparatively detailed introduction below one.
The method of supported vector machine model of training stage is: according to the document of having divided classification in training set, obtainCategory feature vector; According to category feature vector set, can supported vector machine model and effective word collection (or title wordAllusion quotation); For ease of describing, the sample in training set is called to training sample herein.
Wherein, according to the training sample of having divided classification in training set, obtain a kind of concrete grammar of category feature vector,Flow process as shown in Figure 1, comprises the steps:
S101: the each training sample in training set is carried out to participle, obtain the set of words of each training sample, deleteStop words wherein.
Training centralized collection the various documents of having divided classification, conventionally, training set adopts manual sort's language materialStorehouse. In order to ensure the stability and convergence of the supporting vector machine model that the training stage obtains, the number of files in training set conventionallyAmount is greater than certain numerical value.
Document (training sample) is made up of a string continuous word sequence, and word is the base unit in document; Participle is exactlyWord sequence continuous in document is divided into the process of word one by one, and the word marking off forms the set of words of the document.
S102: for each classification, add up in the set of words of such other training sample the frequency that each word occurs.
For example, total q the classification of training sample in training set, is designated as respectively: c1、c2……cq; Wherein, q is for being greater than 2Natural number;
In training set, total n word in the set of words of all training samples, is designated as respectively t1、t2……tn; ItsIn, n is greater than 2 natural number;
For i classification wherein, count j word in the set of words of training sample of i classification and occurThe frequency (number of times), be designated as mij
S103: build classification word matrix.
According to the frequency that in the each classification counting, each word occurs, obtain the word frequency vector of each classification; ExampleAs, the word frequency vector of i classification
The classification word matrix of the q × n building
That is classification word Matrix Cq×nFor:
According to classification word matrix, a kind of concrete grammar of supported vector machine model, flow process as shown in Figure 2, comprisesFollowing steps:
S201: according to classification word matrix, calculate the anti-document frequency of each word.
Particularly, arrange classification frequency ICF for k word in n wordkComputing formula as formula 1:
ICF k = log ( q CF k + 0.01 ) (formula 1)
Arrange classification frequency ICF herein,kBe the anti-document frequency (InverseDocument of k wordFrequency,IDF)IDFk;ICFk(IDFk) value larger, show that the classification discrimination of k word is stronger.
S202: the anti-document frequency to each word sorts, obtains effective word collection according to ranking results and (also can be described asDictionary).
According to the anti-document frequency of each word (arranging classification frequency), a said n word is sorted, according in advanceEffective word parameter of first setting, therefrom extracts the preceding word of some sequences and forms effective word collection. Particularly, for example, pre-Effective word parameter of first setting is effective word number g, extracts g the preceding word of sequence and anti-document frequency thereof and formsEffectively word collection; Or predefined effective word parameter is effective word percentage h, extract n × h sequence precedingWord and anti-document frequency thereof form effective word collection.
S203: rebuild classification word matrix according to effective word collection.
According to the concentrated word of effective word, by former classification word matrix, be not contained in effective word concentrateThe matrix element of word forms new classification word matrix after rejecting;
If effectively the concentrated word of word is p, for i classification in q classification, this classification rebuildingWord frequency vectorWherein, mirFor r word in p word is i classThe frequency occurring not;
According to the word frequency vector of the each classification rebuilding, the classification word matrix rebuilding: C q × p ′ = [ c 1 → ′ , c 2 → ′ , . . . , c q → ′ ] T .
S204: according to the classification word matrix rebuilding, calculate the word frequency (Term of each wordFrequency, TF), obtain word frequency vector of all categories.
Wherein, the word frequency tf of j word in the set of words of the training sample of i classificationijAs formula 2 calculatesObtain:
tf ij = m ij max ( m i 1 , m i 2 , . . . , m ir , . . . , m ip ) (formula 2)
Thus, obtain the word frequency vector of i classification Wherein, tfirForThe word frequency of r word in p word in i classification.
S205: according to the TF of word, and the IDF of each word, build supporting vector machine model.
Particularly, for each classification, according to such other word frequency vector rebuilding, and in p wordThe IDF of each word, calculate such other characteristic vector; Wherein, the characteristic vector of i classification isWherein, tfidfirFor r word in p word is i classNot not middle word frequency tfirAnti-document frequency IDF with this wordrProduct;
Characteristic vector by each classification can build described supporting vector machine model: according to the feature of each classification toAmount, determines hyperplane corresponding to difference in support vector model; Particularly, for every two classifications, with intervalMaximum turns to principle and calculates optimal dividing hyperplane, thereby finds support vector wherein as final support vector modelImportant parameter.
After supported vector machine model, can carry out automatic classification, i.e. sorting phase to document according to this model; ClassificationStage is carried out the method flow of automatic classification to document, as shown in Figure 3, comprise the steps:
S301: treat classifying documents and carry out participle, obtain the set of words of this document to be sorted.
S302: the characteristic vector of calculating this document to be sorted.
Particularly, the characteristic vector of this document to be sorted isWherein, zrFor effective word collectionThe frequency tf that r word in a middle p word occurs in this document to be sortedirProduct with the anti-document frequency of this wordValue.
S303: according to the characteristic vector of this document to be sorted and supporting vector machine model, determine this document to be sorted instituteBelong to classification.
Particularly, calculate the characteristic vector of this document to be sorted and corresponding hyperplane of all categories in supporting vector machine modelBetween distance; Determine classification under this document to be sorted according to the distance of calculating: the distance from hyperplane is treated point as thisThe confidence level of classification under class sample, the namely corresponding class of the hyperplane nearer apart from the characteristic vector of this document to be sortedNot, it is that under this document to be sorted, the confidence level of classification is higher; Wherein TOPK classification is as under this document to be sortedClassification; Wherein K is preset value, equals 5 such as setting K, gets front 5 classifications as classification under this document to be sorted.In fact, the distance between characteristic vector and the hyperplane of document to be sorted, has reflected the characteristic vector of document to be sorted and has been somebody's turn to doSimilarity between the characteristic vector of the corresponding classification of hyperplane; Distance is nearer, the characteristic vector of document to be sorted and classSimilarity between other characteristic vector is also just higher, and it is higher that this document to be sorted belongs to such other confidence level.
The present inventor finds, the document that the document automatic classification method of prior art can be single to classification levelClassify; But the document automatic classification method of prior art is not also suitable for the classification of the document of multi-layer classification, documentClassification results inaccuracy, undesirable; Therefore, for the document of multi-layer classification, such as the document of news category, still adopt at presentManual method is classified, and make staff's workload large, and efficiency is low.
Summary of the invention
Embodiments of the invention provide a kind of Document Classification Method and device based on multi-layer classification, applicable to rightThe document of multi-layer classification carries out automatic classification.
According to an aspect of the present invention, provide a kind of Document Classification Method, having comprised:
Treating classifying documents carries out, after participle, determining the characteristic vector of this document to be sorted;
According to the characteristic vector of this document to be sorted and according to propping up of generating through the training set of classification flattening processingHold vector machine model, determine the affiliated classification of this document to be sorted, wherein,
The classification flattening processing procedure of described training set, comprising: for the each training sample in described training set, rightThe affiliated classification that this training sample sets in advance, the level height of category sorts; Every under this training sampleIndividual classification, from the higher classification of level, judges under this training sample whether have such other subclass classification in classification; IfHave, this classification is rejected classification under this training sample.
Preferably, described classification has been assigned with unique mark, and has comprised such other in the mark of described classificationLevel routing information.
Preferably, the mark of the classification below highest level is by mark and such other subclass identification code of its parent classificationComposition; Wherein, described subclass identification code is for the one group of subclass that belongs to same parent, for organizing the unique of interior each subclass distributionIdentification code.
Wherein, described supporting vector machine model is specifically comprising of generating according to training set:
Build classification word matrix according to described training set;
Generate characteristic vector of all categories according to described classification word matrix, described in building according to characteristic vector of all categoriesSupporting vector machine model; And
Described according to the characteristic vector of this document to be sorted and supporting vector machine model, determine under this document to be sortedClassification specifically comprises:
Calculate the characteristic vector of this document to be sorted respectively corresponding of all categories super flat with described supporting vector machine modelDistance between face;
Determine classification under this document to be sorted according to the distance of calculating.
According to another aspect of the present invention, also provide a kind of supporting vector machine model generation method, having comprised:
Training set is carried out to classification flattening processing: for the each training sample in described training set, to this training sampleOriginally the affiliated classification setting in advance, the level height of category sorts; For the each classification under this training sample, fromThe classification that level is higher starts, and judges in the affiliated classification of this training sample whether have such other subclass classification; If have, shouldClassification is rejected classification under this training sample;
According to generating described supporting vector machine model through the training set of classification flattening processing.
Preferably, described classification has been assigned with unique mark, and has comprised such other in the mark of described classificationLevel routing information.
Preferably, the mark of the classification below highest level is by mark and such other subclass identification code of its parent classificationComposition; Wherein, described subclass identification code is for the one group of subclass that belongs to same parent, for organizing the unique of interior each subclass distributionIdentification code.
According to another aspect of the present invention, also provide a kind of supporting vector machine model generating apparatus, having comprised:
Training set flattening processing module, for carrying out classification flattening processing: for described training set to training setEach training sample, the affiliated classification that this training sample is set in advance, the level height of category sorts; For thisEach classification under training sample, from the higher classification of level, judges under this training sample whether have this in classificationThe subclass classification of classification; If have, this classification is rejected classification under this training sample; Will be through classification flattening processingTraining set output;
Supporting vector machine model generation module, for receiving the training set of described training set flattening processing module output,And generate described supporting vector machine model according to the training set receiving.
Preferably, described classification has been assigned with unique mark, and has comprised such other in the mark of described classificationLevel routing information.
The mark of the classification below highest level is made up of mark and such other subclass identification code of its parent classification; ItsIn, described subclass identification code is for the one group of subclass that belongs to same parent, the unique identification distributing for organizing interior each subclassCode.
The embodiment of the present invention, owing to first according to the hierarchical relationship between classification, training set being carried out to classification flattening processing, makesThe training set that must process through classification flattening has been considered the hierarchical relationship between classification, thus the SVMs obtainingModel, applicable to the document of multi-layer classification is classified, makes classification results have good accuracy.
Further, in the mark of classification, comprise such other level routing information, so that according to the classification knot of documentMark of all categories in fruit dates back to such other parent classification, obtains the category attribute information that the document is more detailed.
Brief description of the drawings
Fig. 1 be prior art obtain the method flow diagram of classification word matrix according to training set;
Fig. 2 be prior art according to classification word matrix, the method flow diagram of supported vector machine model;
Fig. 3 is the method flow diagram that according to supporting vector machine model, document is carried out automatic classification of prior art;
Fig. 4 is the method flow diagram that training set is carried out to classification flattening processing of the embodiment of the present invention;
Fig. 5 is the method flow diagram of the generation supporting vector machine model of the embodiment of the present invention;
Fig. 6 is the internal structure block diagram of the supporting vector machine model generating apparatus of the embodiment of the present invention.
Detailed description of the invention
For making object of the present invention, technical scheme and advantage clearer, referring to accompanying drawing and enumerate preferred realityExecute example, the present invention is described in more detail. But, it should be noted that, many details of listing in description be only forMake reader have a thorough understanding to one or more aspects of the present invention, even if do not have these specific details passable yetRealize these aspects of the present invention.
The terms such as " module " used in this application, " system " are intended to comprise the entity relevant to computer, for example but do not limitIn hardware, firmware, combination thereof, software or executory software. For example, module can be, but be not limited in: processThread, program and/or the computer of the process moved on device, processor, object, executable program, execution. For instance, meterThe application program of moving on calculation equipment and this computing equipment can be modules. One or more modules can be positioned at executoryIn process and/or thread, a module also can be positioned on a computer and/or be distributed in two or more calculatingBetween machine.
The present inventor analyzes the document automatic classification method of prior art, finds to adopt prior art pairWhen the document of multi-layer classification is classified, due to the hierarchical relationship (institute between classification in other words not considering between classificationGenus relation), so can cause the classification confusion of document. Such as, a kind of multi-layer classification shown in following table 1 (also can be described as tree knotStructure classification):
Table 1
One-level Secondary Three grades Level Four
Science and technology Internet Internet form
Social networks
Community
Venture capital investment
Microblogging
China-concept-shares
Internet giant 5 -->
Baidu
Tengxun
Facebook
Alibaba
Google
twitter
Internet famous person
Ma Yun
Lei Jun
Prick gram Burger
Zhou Hong Yi
Li Yanhong
Li Kaifu
Ma Huateng
Liu Qiangdong
Mobile Internet
Ecommerce
Industry Personage
Bauer is silent
Base of a fruit nurse Cook
Liu Chuanzhi
Yang Yuanqing
Company
Association
Microsoft
Apple
Intel
Foxconn
Samsung
Key concept
Cloud storage
Large data
Windows
Wherein, classification is divided into four levels, is respectively from high to low one-level, secondary, three grades, level Four; In one-level classification " sectionSkill " under, comprise two secondary classifications " internet " and " industry ", " internet " and " industry " classification belongs to one-level classification " sectionSkill ", " internet " and " industry " classification and one-level classification " science and technology " have level membership; " internet " and " industry " classificationFor the subclass classification of " science and technology " classification, " science and technology " classification is the parent classification of " internet " and " industry " classification;
Under secondary classification " internet ", comprise some three grades of classifications " internet form ", " internet giant ", " interconnectedUser name people " etc., these three grades of classifications belong to secondary classification " internet ", that is these three grades of classifications and secondary classification " interconnectedNet " there is level membership; These three grades of classifications are the subclass classification of secondary classification " internet ", secondary classification " internet "For the parent classification of these three grades of classifications.
Adopt prior art to generate after supporting vector machine model for the characteristic vector of all categories in table 1, suppose to have oneDocument to be sorted, the hyperplane of all categories in its characteristic vector and supporting vector machine model carries out after Distance Judgment, withCharacteristic vector of all categories in supporting vector machine model carry out similarity relatively after, the similarity obtaining classification from high to lowBe respectively: science and technology, internet, internet giant, internet famous person, Alibaba, Ma Yun; Select the first five classification of rank: sectionSkill, internet, internet giant, internet famous person, Alibaba are as the final classification results of the document; Like this, cause thisThe characteristic attribute that document belongs to the Ma Yun in internet famous person has just been left in the basket; Thereby classifying quality inaccuracy, poor effect, canCan be able to cause the classification confusion of a lot of documents.
Thus, the present inventor considers and in the time of the training stage, considers the hierarchical relationship between classification, makes trainingSupporting vector machine model out goes for the automatic classification of the document to multi-layer classification: training according to training setBefore supporting vector machine model, first according to the hierarchical relationship between classification, training set is carried out to classification flattening processing; To pass throughClassification flattening training set after treatment is trained, thereby the supporting vector machine model obtaining is applicable to multi-layer classificationDocument classify.
Conventionally, the each document in training set can manually be set in advance at least one classification; For the literary composition of multi-layer classificationShelves, in the multiple classifications under it, may comprise the classification with level membership; For example, the document A in training set, itsAffiliated classification may comprise: science and technology, internet, internet giant, internet famous person, Alibaba, Ma Yun; Wherein, " science and technology "Be the classification with level membership with " internet ", " internet " has level membership with " internet giant "Classification, " internet " is the classification with level membership with " internet famous person ", " internet giant " and " Alibaba "Be the classification with level membership, " internet famous person " and " Ma Yun " are the classifications with level membership.
According to the hierarchical relationship between classification, training set is carried out the method flow of classification flattening processing, as shown in Figure 4,Comprise the steps:
S401: for the each training sample in training set, the affiliated classification that each training sample is set in advance, by classOther level height sorts;
For example, for the ranking results of classification under above-mentioned document A be: science and technology, internet, internet giant, internetFamous person, Alibaba, Ma Yun.
S402: for each training sample, of all categories under this training sample respectively, from the higher classification of levelStart, judge in the affiliated classification of this training sample whether have such other subclass classification; If have, by this classification from this training sampleUnder this, in classification, reject.
For example, for above-mentioned document A, judge " science and technology " has the subclass classification of " science and technology " " mutually under document A in classificationNetworking ", there is with it other classification " internet " of level membership, " science and technology " is picked classification under document ARemove; In like manner, afterwards " internet ", " internet giant " and " internet famous person " are rejected classification under document A.
Finally, classification has only retained " Alibaba " and " Ma Yun " two and has not mutually had a level membership under document ASubclass classification.
Thus, the method flow at training stage generation supporting vector machine model providing of the embodiment of the present invention, as Fig. 5Shown in, comprise the steps:
S501: training set is carried out to classification flattening processing according to the hierarchical relationship between classification.
The concrete grammar that training set is carried out to classification flattening processing has carried out in detail in the each step shown in earlier figures 4Introduce, repeat no more herein.
S502: according to the training set through classification flattening processing, generate supporting vector machine model.
In this step, according to building classification word matrix through the training set of classification flattening processing; According to described classificationMatrix generates characteristic vector of all categories, builds described supporting vector machine model according to characteristic vector of all categories; Wherein basisThrough the training set of classification flattening processing, generate method and method phase of the prior art that supporting vector machine model adoptsWith, in the each step shown in earlier figures 1,2, be described in detail, repeat no more herein.
After the supported vector machine model of technical scheme according to the present invention, according to supporting vector machine model to be sortedDocument classify: treat classifying documents and carry out participle, obtain, after the set of words of this document to be sorted, adding up effective wordThe frequency of concentrating the each word in p word to occur in this document to be sorted, concentrates p word according to effective word of statisticsIn the word frequency of each word and the anti-document frequency of each word obtain the characteristic vector of this document to be sortedCalculate the characteristic vector of this document to be sorted corresponding respectively with described supporting vector machine modelDistance between hyperplane of all categories; Determine classification under this document to be sorted according to the distance of calculating. Detailed process is with existingHave that in technology, to carry out the method for document classification identical, in the each step shown in aforesaid Fig. 3, be described in detail, herein notRepeat again.
In fact, if the classification of the training sample in training set is not carried out to flattening processing, and directly use theseTraining sample calculates supporting vector machine model, supporting vector machine model by be not suitable for multi-layer classification document pointClass; And in the present invention, use while generating supporting vector machine model through flattening training set after treatment, due to flat through classificationDocument in the training set of graduation processing, under it, classification will can not have level membership between any two, and its reservationAffiliated classification is the classification that level is lower; Therefore, according to the training set of classification flattening processing, will while building classification word matrixCan make the word frequency of the classification that level is lower increase; And then build when supporting vector machine model, make the class that level is lowerOther characteristic vector space is larger; Thereby while carrying out document classification according to this supporting vector machine model, can be more prone to levelThe hyperplane of low classification, the similarity that can be more prone in other words the classification that level is lower is higher; Make the class that level is lowerCan preferentially not select out, just there will not be the phenomenon producing when multi-layer classification document classification in prior art---Some levels are left in the basket compared with low classification and cause that classifying quality is not good, the confusion of document classification.
For example,, if adopt supporting vector machine model of the present invention to classify to above-mentioned document A, due to training set is enteredAfter the flattening of row classification is processed, in classification word matrix, subclass classification will be increased as the word frequency of " Ma Yun ", " Alibaba ",Its subclass classification with and more the classification of upper level will reduce as " science and technology ", " internet ", " internet giant's " word frequency, according toWhen the supporting vector machine model obtaining is thus classified to above-mentioned document A, the characteristic vector of document A will be apart from subclass classificationAs nearer in the hyperplane of " Ma Yun ", " Alibaba ", be namely more prone to subclass classification as " Ma Yun ", " Alibaba "Characteristic vector, the similarity of the characteristic vector of the characteristic vector of document A and " Ma Yun ", " Alibaba " will higher than with " sectionSkill ", the similarity of " internet ", " internet giant's " characteristic vector; Therefore, adopt supporting vector machine model to enter document AAfter row classification is determined, the sequencing of similarity obtaining will be: " Ma Yun ", " Alibaba ", " internet giant ", " internetFamous person ", " internet ", " science and technology "; Select the first five classification of rank: " Ma Yun ", " Alibaba ", " internet giant ", " interconnectedUser name people ", " internet " as the final classification results of the document; Obviously, this classification results is than the sorting technique of prior artClassification results more accurate, effect is better.
In actual applications, each classification has been assigned with unique mark; More preferably, each class in the solution of the present inventionOther mark, has comprised such other level routing information; Thus, at supporting vector machine model according to the present invention to be sortedDocument classify after, obtain the classification results of the document, can recall according to the mark of all categories in this classification resultsTo such other parent classification, obtain the category attribute information that the document is more detailed.
Particularly, the classification logotype that comprises level routing information can numeral or alphabetical form represent, wherein, highest levelThe mark of following classification is made up of mark and such other subclass identification code of its parent classification; To belonging to one of same parentGroup subclass, wherein each subclass has been distributed the unique identification code in this group; That is to say, described subclass identification code is for genusIn one group of subclass of same parent, unique identification code of distributing for organizing interior each subclass.
For example, have the classification of the level membership shown in upper table 1, wherein, highest level classification is one-level classification " sectionSkill ", its classification logotype can be " 01 ";
Secondary classification " internet " below highest level, the mark of " industry " can be " 0101 ", " 0102 " respectively; FromIn can find out, the high two digits of the mark of " internet " and " industry " equals the mark of its parent classification " science and technology "" 01 ", then two digits " 01 ", " 02 " are respectively " internet ", " industry " identification code in group.
Three grades of classifications " internet form " below highest level, " internet giant ", " internet famous person ", " mutually mobileNetworking ", the mark of " ecommerce " can be " 010101 ", " 010102 ", " 010103 ", " 010104 ", " 010105 " respectively;High 4 bit digital of the mark of these classifications equal the mark " 0101 " of its parent classification " internet ", then two digits " 01 "," 02 ", " 03 ", " 04 ", " 05 " are respectively " internet form ", " internet giant ", " internet famous person ", " mobile interconnectedNet ", the identification code of " ecommerce " classification in group.
Thus, obtaining after the classification results of document, can determine easily the parent classification of the classification in classification results,And then also can determine the parent classification of this parent classification.
Obviously, similarly, the classification logotype that comprises level routing information also can alphabetical form represent, Method And Principle and numberWord identical repeats no more herein.
The supporting vector machine model generating apparatus that the embodiment of the present invention provides, its internal structure block diagram as shown in Figure 6, wrapsDraw together: training set flattening processing module 601, supporting vector machine model generation module 602.
Training set flattening processing module 601 is for carrying out classification flattening processing to training set: for described training setIn each training sample, the affiliated classification that this training sample is set in advance, the level height of category sorts; ForWhether the each classification under this training sample, from the higher classification of level, judge under this training sample and have in classificationSuch other subclass classification; If have, this classification is rejected classification under this training sample; Will be through classification flattening placeThe training set output of reason; In the mark of described classification, comprise such other level routing information; The mark of described classification is concreteCan numeral or alphabetical form represent; Wherein, the mark of the classification below highest level is by the mark of its parent classification and suchOther subclass identification code composition; Wherein, described subclass identification code is for the one group of subclass that belongs to same parent, interior each for organizingUnique identification code that subclass is distributed.
The training set that supporting vector machine model generation module 602 is exported for receiving training set flattening processing module 601,And generate described supporting vector machine model according to the training set receiving. Supporting vector machine model generation module 602 can adopt with existingThere is the method that technology is identical to generate described supporting vector machine model according to training set, repeat no more herein.
The embodiment of the present invention, owing to first training set being carried out to classification flattening processing according to the hierarchical relationship between classification, makesThe training set that must process through classification flattening has been considered the hierarchical relationship between classification, thus the SVMs obtainingModel, applicable to the document of multi-layer classification is classified, makes classification results have good accuracy.
Further, in the mark of classification, comprise such other level routing information, so that according to the classification knot of documentMark of all categories in fruit dates back to such other parent classification, obtains the category attribute information that the document is more detailed.
One of ordinary skill in the art will appreciate that all or part of step realizing in above-described embodiment method is passableCarry out by program the hardware that instruction is relevant and complete, this program can be stored in a computer read/write memory medium, as:ROM/RAM, magnetic disc, CD etc.
The above is only the preferred embodiment of the present invention, it should be pointed out that the ordinary skill people for the artMember, under the premise without departing from the principles of the invention, can also make some improvements and modifications, and these improvements and modifications also shouldBe considered as protection scope of the present invention.

Claims (10)

1. a Document Classification Method, is characterized in that, comprising:
Treating classifying documents carries out, after participle, determining the characteristic vector of this document to be sorted;
The support generating according to the characteristic vector of this document to be sorted and according to the training set through classification flattening processing toAmount machine model, determines the affiliated classification of this document to be sorted, wherein,
The classification flattening processing procedure of described training set, comprising: for the each training sample in described training set, to this instructionPractice the affiliated classification that sample sets in advance, the level height of category sorts; For the each class under this training sampleNot, from the higher classification of level, judge under this training sample whether have such other subclass classification in classification; If have,This classification is rejected classification under this training sample.
2. the method for claim 1, is characterized in that, described classification has been assigned with unique mark, and described classIn other mark, comprise such other level routing information.
3. method as claimed in claim 2, is characterized in that, the mark of the classification below highest level is by its parent classificationMark and such other subclass identification code composition; Wherein, described subclass identification code is for the one group of subclass that belongs to same parent,Unique identification code of distributing for organizing interior each subclass.
4. the method as described in as arbitrary in claim 1-3, is characterized in that, described supporting vector machine model is raw according to training setWhat become specifically comprises:
Build classification word matrix according to described training set;
Generate characteristic vector of all categories according to described classification word matrix, build described support according to characteristic vector of all categoriesVector machine model; And
Described according to the characteristic vector of this document to be sorted and supporting vector machine model, determine the affiliated classification of this document to be sortedSpecifically comprise:
Calculate the characteristic vector of this document to be sorted and in described supporting vector machine model respectively corresponding hyperplane of all categories itBetween distance;
Determine classification under this document to be sorted according to the distance of calculating.
5. a supporting vector machine model generation method, is characterized in that, comprising:
Training set is carried out to classification flattening processing: for the each training sample in described training set, pre-to this training sampleThe affiliated classification first arranging, the level height of category sorts; For the each classification under this training sample, from levelHigher classification starts, and judges in the affiliated classification of this training sample whether have such other subclass classification; If have, by this classificationUnder this training sample, classification, reject;
According to generating described supporting vector machine model through the training set of classification flattening processing.
6. method as claimed in claim 5, is characterized in that, described classification has been assigned with unique mark, and described classIn other mark, comprise such other level routing information.
7. method as claimed in claim 6, is characterized in that, the mark of the classification below highest level is by its parent classificationMark and such other subclass identification code composition; Wherein, described subclass identification code is for the one group of subclass that belongs to same parent,Unique identification code of distributing for organizing interior each subclass.
8. a supporting vector machine model generating apparatus, is characterized in that, comprising:
Training set flattening processing module, for carrying out classification flattening processing: every for described training set to training setIndividual training sample, the affiliated classification that this training sample is set in advance, the level height of category sorts; For this trainingEach classification under sample, from the higher classification of level, judges under this training sample whether have this classification in classificationSubclass classification; If have, this classification is rejected classification under this training sample; By the instruction through classification flattening processingPractice collection output;
Supporting vector machine model generation module, for receiving the training set of described training set flattening processing module output, and rootGenerate described supporting vector machine model according to the training set receiving.
9. device as claimed in claim 8, is characterized in that, described classification has been assigned with unique mark, and described classIn other mark, comprise such other level routing information.
10. device as claimed in claim 9, is characterized in that, the mark of the classification below highest level is by its parent classificationMark and such other subclass identification code composition; Wherein, described subclass identification code is for the one group of son that belongs to same parentClass, unique identification code of distributing for organizing interior each subclass.
CN201310033125.XA 2013-01-28 2013-01-28 The method and apparatus that document classification, supporting vector machine model generate Active CN103106262B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310033125.XA CN103106262B (en) 2013-01-28 2013-01-28 The method and apparatus that document classification, supporting vector machine model generate

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310033125.XA CN103106262B (en) 2013-01-28 2013-01-28 The method and apparatus that document classification, supporting vector machine model generate

Publications (2)

Publication Number Publication Date
CN103106262A CN103106262A (en) 2013-05-15
CN103106262B true CN103106262B (en) 2016-05-11

Family

ID=48314117

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310033125.XA Active CN103106262B (en) 2013-01-28 2013-01-28 The method and apparatus that document classification, supporting vector machine model generate

Country Status (1)

Country Link
CN (1) CN103106262B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104978328A (en) * 2014-04-03 2015-10-14 北京奇虎科技有限公司 Hierarchical classifier obtaining method, text classification method, hierarchical classifier obtaining device and text classification device
CN104050240A (en) * 2014-05-26 2014-09-17 北京奇虎科技有限公司 Method and device for determining categorical attribute of search query word
CN105512145A (en) * 2014-09-26 2016-04-20 阿里巴巴集团控股有限公司 Method and device for information classification
CN104680192B (en) * 2015-02-05 2017-12-12 国家电网公司 A kind of electric power image classification method based on deep learning
CN104850592B (en) * 2015-04-27 2018-09-18 小米科技有限责任公司 The method and apparatus for generating model file
CN106022599A (en) * 2016-05-18 2016-10-12 德稻全球创新网络(北京)有限公司 Industrial design talent level evaluation method and system
CN106126734B (en) * 2016-07-04 2019-06-28 北京奇艺世纪科技有限公司 The classification method and device of document
CN107194260A (en) * 2017-04-20 2017-09-22 中国科学院软件研究所 A kind of Linux Kernel association CVE intelligent Forecastings based on machine learning
CN107894986B (en) * 2017-09-26 2021-03-30 北京纳人网络科技有限公司 Enterprise relation division method based on vectorization, server and client
CN109033478B (en) * 2018-09-12 2022-08-19 重庆工业职业技术学院 Text information rule analysis method and system for search engine
CN111199170B (en) * 2018-11-16 2022-04-01 长鑫存储技术有限公司 Formula file identification method and device, electronic equipment and storage medium
CN110808968B (en) * 2019-10-25 2022-02-11 新华三信息安全技术有限公司 Network attack detection method and device, electronic equipment and readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6446061B1 (en) * 1998-07-31 2002-09-03 International Business Machines Corporation Taxonomy generation for document collections
CN1725213A (en) * 2004-07-22 2006-01-25 国际商业机器公司 Method and system for structuring, maintaining personal sort tree, sort display file
CN102243645A (en) * 2010-05-11 2011-11-16 微软公司 Hierarchical content classification into deep taxonomies

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6446061B1 (en) * 1998-07-31 2002-09-03 International Business Machines Corporation Taxonomy generation for document collections
CN1725213A (en) * 2004-07-22 2006-01-25 国际商业机器公司 Method and system for structuring, maintaining personal sort tree, sort display file
CN102243645A (en) * 2010-05-11 2011-11-16 微软公司 Hierarchical content classification into deep taxonomies

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于词向量空间模型的中文文本分类方法;胡学刚等;《合肥工业大学学报(自然科学版)》;20071031;第30卷(第10期);第1262-1263页 *
马乐等.一种基于SVM 的网页层次分类算法.《北京师范大学学报(自然科学版)》.2009,第45卷(第3期), *

Also Published As

Publication number Publication date
CN103106262A (en) 2013-05-15

Similar Documents

Publication Publication Date Title
CN103106262B (en) The method and apparatus that document classification, supporting vector machine model generate
CN107609121B (en) News text classification method based on LDA and word2vec algorithm
CN105893609B (en) A kind of mobile APP recommended method based on weighted blend
CN103744981B (en) System for automatic classification analysis for website based on website content
US10115061B2 (en) Motif recognition
US9317613B2 (en) Large scale entity-specific resource classification
CN104809117B (en) Video data aggregation processing method, paradigmatic system and video search platform
CN108363821A (en) A kind of information-pushing method, device, terminal device and storage medium
CN106489149A (en) A kind of data mask method based on data mining and mass-rent and system
CN107766371A (en) A kind of text message sorting technique and its device
CN104915447A (en) Method and device for tracing hot topics and confirming keywords
CN103123636B (en) Set up the method and apparatus of the method for entry disaggregated model, entry automatic classification
US20160170993A1 (en) System and method for ranking news feeds
CN106156372A (en) The sorting technique of a kind of internet site and device
Wu et al. News filtering and summarization on the web
CN106557558A (en) A kind of data analysing method and device
CN107918657A (en) The matching process and device of a kind of data source
CN102428467A (en) Similarity-Based Feature Set Supplementation For Classification
CN102542061A (en) Intelligent product classification method
CN106354787A (en) Entity coreference resolution method based on similarity
CN107341199A (en) A kind of recommendation method based on documentation & info general model
CN103761286B (en) A kind of Service Source search method based on user interest
CN105335368A (en) Product clustering method and apparatus
CN108717459A (en) A kind of mobile application defect positioning method of user oriented comment information
JP2013015971A (en) Representative comment extraction method and program

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20230417

Address after: Room 501-502, 5/F, Sina Headquarters Scientific Research Building, Block N-1 and N-2, Zhongguancun Software Park, Dongbei Wangxi Road, Haidian District, Beijing, 100193

Patentee after: Sina Technology (China) Co.,Ltd.

Address before: 100080, International Building, No. 58 West Fourth Ring Road, Haidian District, Beijing, 20 floor

Patentee before: Sina.com Technology (China) Co.,Ltd.

TR01 Transfer of patent right