CN103106262A - Method and device of file classification and generation of support vector machine model - Google Patents

Method and device of file classification and generation of support vector machine model Download PDF

Info

Publication number
CN103106262A
CN103106262A CN201310033125XA CN201310033125A CN103106262A CN 103106262 A CN103106262 A CN 103106262A CN 201310033125X A CN201310033125X A CN 201310033125XA CN 201310033125 A CN201310033125 A CN 201310033125A CN 103106262 A CN103106262 A CN 103106262A
Authority
CN
China
Prior art keywords
classification
document
training set
machine model
subclass
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310033125XA
Other languages
Chinese (zh)
Other versions
CN103106262B (en
Inventor
戴明洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sina Technology China Co Ltd
Original Assignee
Sina Technology China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sina Technology China Co Ltd filed Critical Sina Technology China Co Ltd
Priority to CN201310033125.XA priority Critical patent/CN103106262B/en
Publication of CN103106262A publication Critical patent/CN103106262A/en
Application granted granted Critical
Publication of CN103106262B publication Critical patent/CN103106262B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a method and a device of file classification and generation of support vector machine model. The method comprises the steps of confirming classification of a file to be classified according to feature vectors of the file to be classified and the support vector machine model generated through a training set processed through classification flattening, wherein the classification flattening processing process of the training set comprises the steps of sequencing, according to levels of classifications from top to bottom, preset classification of each training sample in the training set; judging whether a sub-classification exists in the classification of each sample from a classification relatively high in level; and deleting the classification from the classification of the training sample if the sub-classification exists in the classification of each sample. Due to the fact that classification flattening treatment is conducted according to the level relationship between classifications, the obtained support vector machine model is enabled to be suitable for classification of multi-level classifications, and the classification is enabled to have better accuracy.

Description

The method and apparatus that document classification, supporting vector machine model generate
Technical field
The present invention relates to the computing machine treatment technology, relate in particular to the method and apparatus of document classification, supporting vector machine model generation.
Background technology
In recent years, along with the fast development of Internet, make the Web(network) on document resources present explosive growth, these document information data volumes are large, content is numerous and diverse.Compare with structurized information in database, destructuring or semi-structured web document information more enrich and are numerous and diverse.In order fully to effectively utilize these document resources, be the information that the user can find fast and effectively to be needed, and extract wherein potential valuable information, need these documents are classified.
The method of at present, document being carried out automatic classification usually adopts classifies based on the method for supporting vector machine model; The method comprises: training stage and sorting phase.At present, multiple document automatic classification method based on supporting vector machine model is arranged in prior art, the introduction that the below is comparatively detailed a kind of.
The method of supported vector machine model of training stage is: according to the document of having divided classification in training set, obtain the category feature vector; According to the category feature vector set, can supported vector machine model and effective word collection (or claiming dictionary); For ease of describing, the sample in training set is called training sample herein.
Wherein, according to the training sample of having divided classification in training set, obtain a kind of concrete grammar of category feature vector, flow process comprises the steps: as shown in Figure 1
S101: each training sample in training set is carried out participle, obtain the set of words of each training sample, deletion stop words wherein.
The training centralized collection the various documents of having divided classification, usually, training set adopts manual sort's corpus.In order to guarantee the stability and convergence of the supporting vector machine model that the training stage obtains, the number of documents in training set is greater than certain numerical value usually.
Document (training sample) is comprised of a string continuous word sequence, and word is the base unit in document; Participle is exactly that word sequence continuous in document is divided into the process of word one by one, and the word that marks off consists of the set of words of the document.
S102: for each classification, add up in the set of words of such other training sample the frequency that each word occurs.
For example, the training sample in training set has q classification, is designated as respectively: c 1, c 2C qWherein, q is the natural number greater than 2;
In training set, total n word in the set of words of all training samples, be designated as respectively t 1, t 2T nWherein, n is the natural number greater than 2;
For i classification wherein, count j frequency (number of times) that word occurs in the set of words of training sample of i classification, be designated as m ij
S103: build classification word matrix.
According to the frequency that each word in each classification that counts occurs, obtain the word frequency vector of each classification; For example, the word frequency of i classification vector
The classification word matrix of the q * n that builds
Figure BDA00002785932100022
That is classification word Matrix C Q * nFor:
Figure BDA00002785932100023
According to classification word matrix, a kind of concrete grammar of supported vector machine model, flow process comprises the steps: as shown in Figure 2
S201: according to classification word matrix, calculate the anti-document frequency of each word.
Particularly, arrange classification frequency ICF for k word in n word kComputing formula such as formula 1:
ICF k = log ( q CF k + 0.01 ) (formula 1)
Arrange classification frequency ICF herein, kBe anti-document frequency (Inverse Document Frequency, the IDF) IDF of k word kICF k(IDF k) value larger, show that the classification discrimination of k word is stronger.
S202: the anti-document frequency to each word sorts, and obtains effective word collection (also can be described as dictionary) according to ranking results.
Anti-document frequency (namely arranging the classification frequency) according to each word sorts to a said n word, according to predefined effective word parameter, therefrom extracts some sequences effective word collection of word formation the preceding.Particularly, for example, predefined effective word parameter is effective word number g, extracts g sort word and the effective word collection of anti-document frequency formation thereof the preceding; Perhaps predefined effective word parameter is effective word number percent h, extracts n * h sort word and the effective word collection of anti-document frequency formation thereof the preceding.
S203: rebuild classification word matrix according to effective word collection.
Word according to effective word is concentrated after in former classification word matrix, matrix element that be not contained in the concentrated word of effective word are rejected, forms new classification word matrix;
If effectively the concentrated word of word is p, for i classification in q classification, other word frequency vector of such that rebuilds
Figure BDA00002785932100031
Wherein, m irBe r the frequency that word occurs in p word in i classification;
According to the word frequency vector of each classification that rebuilds, the classification word matrix that rebuilds: C q × p ′ = [ c 1 → ′ , c 2 → ′ , . . . , c q → ′ ] T .
S204: according to the classification word matrix that rebuilds, calculate the word frequency (Term Frequency, TF) of each word, obtain word frequency vector of all categories.
Wherein, the word frequency tf of j word in the set of words of the training sample of i classification ijCalculate as formula 2:
tf ij = m ij max ( m i 1 , m i 2 , . . . , m ir , . . . , m ip ) (formula 2)
Thus, obtain the word frequency vector of i classification
Figure BDA00002785932100034
Figure BDA00002785932100035
Wherein, tf irBe the word frequency of r word in i classification in p word.
S205: according to the TF of word, and the IDF of each word, build supporting vector machine model.
Particularly, for each classification, according to such other word frequency vector that rebuilds, and the IDF of each word in p word, calculate such other proper vector; Wherein, the proper vector of i classification is
Figure BDA00002785932100036
Wherein, tfidf irBe r the word word frequency tf in i classification in p word irAnti-document frequency IDF with this word rProduct;
Proper vector by each classification can build described supporting vector machine model: according to the proper vector of each classification, determine lineoid corresponding to difference in the support vector model; Particularly, for every two classifications, calculate the optimal dividing lineoid take margin maximization as principle, thereby find the important parameter of the final support vector model of support vector conduct wherein.
After supported vector machine model, can carry out automatic classification, i.e. sorting phase to document according to this model; Sorting phase carries out the method flow of automatic classification to document, as shown in Figure 3, comprise the steps:
S301: treat classifying documents and carry out participle, obtain the set of words of this document to be sorted.
S302: the proper vector of calculating this document to be sorted.
Particularly, the proper vector of this document to be sorted is Wherein, z rThe frequency tf that concentrates r word in p word to occur in this document to be sorted for effective word irProduct value with the anti-document frequency of this word.
S303: according to proper vector and the supporting vector machine model of this document to be sorted, determine the affiliated classification of this document to be sorted.
Particularly, the distance in the proper vector of calculating this document to be sorted and supporting vector machine model between corresponding lineoid of all categories; Classification under this document to be sorted definite according to the distance of calculating: will be from the confidence level of classification under this sample to be sorted of distance conduct of lineoid, namely apart from the nearer corresponding classification of lineoid of proper vector of this document to be sorted, it is that under this document to be sorted, the confidence level of classification is higher; Wherein TOP K classification is as classification under this document to be sorted; Wherein K is preset value, equals 5 such as setting K, gets front 5 classifications as classification under this document to be sorted.In fact, the proper vector of document to be sorted and the distance between lineoid have reflected the similarity between the proper vector of the proper vector of document to be sorted and the corresponding classification of this lineoid; Distance is nearer, and the similarity between the proper vector of the proper vector of document to be sorted and classification is also just higher, and it is higher that this document to be sorted belongs to such other confidence level.
The present inventor's discovery, the document automatic classification method of prior art can be classified to the single document of classification level; Yet the document automatic classification method of prior art also is not suitable for the classification of the document of multi-layer classification, and document classification is out of true, undesirable as a result; Therefore, for the document of multi-layer classification, such as the document of news category, still adopt manual method to classify at present, make staff's workload large, and efficient is low.
Summary of the invention
Embodiments of the invention provide a kind of Document Classification Method based on the multi-layer classification and device, carry out automatic classification applicable to the document to the multi-layer classification.
According to an aspect of the present invention, provide a kind of Document Classification Method, having comprised:
After treating classifying documents and carrying out participle, determine the proper vector of this document to be sorted;
According to the proper vector of this document to be sorted and the supporting vector machine model that generates according to the training set of processing through the classification flattening, determine the affiliated classification of this document to be sorted, wherein,
The classification flattening processing procedure of described training set comprises: for each training sample in described training set, to the affiliated classification that this training sample sets in advance, the level of category just sorts; For each classification under this training sample, from the higher classification of level, judge under this training sample in classification, whether such other subclass classification is arranged; If have, this classification is rejected classification under this training sample.
Preferably, described classification has been assigned with unique sign, and has comprised such other level routing information in the sign of described classification.
Preferably, the sign of the classification below highest level is comprised of sign and such other subclass identification code of its parent classification; Wherein, described subclass identification code is for the one group of subclass that belongs to same parent, is unique identification code of each subclass distribution in organizing.
Wherein, described supporting vector machine model is according to specifically comprising that training set generates:
Build classification word matrix according to described training set;
Generate proper vector of all categories according to described classification word matrix, build described supporting vector machine model according to proper vector of all categories; And
The proper vector of described this document to be sorted of basis and supporting vector machine model, determine that the affiliated classification of this document to be sorted specifically comprises:
Distance between corresponding lineoid of all categories respectively in the proper vector of calculating this document to be sorted and described supporting vector machine model;
Classification under this document to be sorted definite according to the distance of calculating.
According to another aspect of the present invention, also provide a kind of supporting vector machine model generation method, having comprised:
Training set is carried out the classification flattening to be processed: for each training sample in described training set, to the affiliated classification that this training sample sets in advance, the level of category just sorts; For each classification under this training sample, from the higher classification of level, judge under this training sample in classification, whether such other subclass classification is arranged; If have, this classification is rejected classification under this training sample;
Generate described supporting vector machine model according to the training set of processing through the classification flattening.
Preferably, described classification has been assigned with unique sign, and has comprised such other level routing information in the sign of described classification.
Preferably, the sign of the classification below highest level is comprised of sign and such other subclass identification code of its parent classification; Wherein, described subclass identification code is for the one group of subclass that belongs to same parent, is unique identification code of each subclass distribution in organizing.
According to another aspect of the present invention, also provide a kind of supporting vector machine model generating apparatus, having comprised:
Training set flattening processing module, be used for that training set is carried out the classification flattening and process: for each training sample of described training set, to the affiliated classification that this training sample sets in advance, the level of category just sorts; For each classification under this training sample, from the higher classification of level, judge under this training sample in classification, whether such other subclass classification is arranged; If have, this classification is rejected classification under this training sample; To export through the training set that the classification flattening is processed;
The supporting vector machine model generation module is used for receiving the training set of described training set flattening processing module output, and generates described supporting vector machine model according to the training set that receives.
Preferably, described classification has been assigned with unique sign, and has comprised such other level routing information in the sign of described classification.
The sign of the classification that highest level is following is comprised of sign and such other subclass identification code of its parent classification; Wherein, described subclass identification code is for the one group of subclass that belongs to same parent, is unique identification code of each subclass distribution in organizing.
The embodiment of the present invention is processed owing to first according to the hierarchical relationship between classification, training set being carried out the classification flattening, make through the training set that the classification flattening was processed and considered hierarchical relationship between classification, thereby the supporting vector machine model that obtains makes classification results have accuracy preferably applicable to the document of multi-layer classification is classified.
Further, comprise such other level routing information in the sign of classification, so that date back to such other parent classification according to the sign of all categories in the classification results of document, obtained the more detailed category attribute information of the document.
Description of drawings
Fig. 1 be prior art obtain the method flow diagram of classification word matrix according to training set;
Fig. 2 be prior art according to classification word matrix, the method flow diagram of supported vector machine model;
Fig. 3 be prior art according to supporting vector machine model, document is carried out the method flow diagram of automatic classification;
Fig. 4 be the embodiment of the present invention training set is carried out the method flow diagram that the classification flattening is processed;
Fig. 5 is the method flow diagram of the generation supporting vector machine model of the embodiment of the present invention;
Fig. 6 is the inner structure block diagram of the supporting vector machine model generating apparatus of the embodiment of the present invention.
Embodiment
For making purpose of the present invention, technical scheme and advantage clearer, referring to accompanying drawing and enumerate preferred embodiment, the present invention is described in more detail.Yet, need to prove, many details of listing in instructions are only in order to make the reader to one or more aspects of the present invention, a thorough understanding be arranged, even if do not have these specific details also can realize these aspects of the present invention.
The terms such as " module " used in this application, " system " are intended to comprise the entity relevant to computing machine, such as but not limited to hardware, firmware, combination thereof, software or executory software.For example, module can be, but be not limited in: the thread of the process of moving on processor, processor, object, executable program, execution, program and/or computing machine.For instance, the application program of moving on computing equipment and this computing equipment can be modules.One or more modules can be positioned at an executory process and/or thread, and module also can be on a computing machine and/or be distributed between two or more computing machines.
The present inventor analyzes the document automatic classification method of prior art, when finding to adopt prior art that the document of multi-layer classification is classified, due to the hierarchical relationship (the affiliated relation between classification in other words) of not considering between classification, so can cause the classification of document chaotic.Such as, a kind of multi-layer classification (also can be described as the tree construction classification) shown in following table 1:
Table 1
One-level Secondary Three grades Level Four
Science and technology The internet The internet form ?
? ? ? Social networks
? ? ? The community
? ? ? Venture capital investment
? ? ? Microblogging
? ? ? China-concept-shares
? ? Internet giant ?
? ? ? Baidu
? ? ? Tengxun
? ? ? Facebook
? ? ? Alibaba
? ? ? Google
? ? ? twitter
? ? The internet famous person ?
? ? ? Ma Yun
? ? ? Lei Jun
? ? ? Prick the gram Burger
? ? ? Zhou Hong Yi
? ? ? Li Yanhong
? ? ? Li Kaifu
? ? ? Ma Huateng
? ? ? Liu Qiangdong
? ? Mobile Internet ?
? ? Ecommerce ?
? Industry The personage ?
? ? ? Bauer is silent
? ? ? Base of a fruit nurse Cook
? ? ? Liu Chuanzhi
? ? ? Yang Yuanqing
? ? Company ?
? ? ? Association
? ? ? Microsoft
? ? ? Apple
? ? ? Intel
? ? ? Foxconn
? ? ? Samsung
? ? Key concept ?
? ? ? The cloud storage
? ? ? Large data
? ? ? Windows
Wherein, classification is divided into four levels, is respectively from high to low one-level, secondary, three grades, level Four; Under one-level classification " science and technology ", comprise two secondary classifications " internet " and " industry ", namely " internet " and " industry " classification belongs to one-level classification " science and technology ", and " internet " and " industry " classification and one-level classification " science and technology " have the level membership; " internet " and " industry " classification is the subclass classification of " science and technology " classification, and " science and technology " classification is the parent classification of " internet " and " industry " classification;
Under secondary classification " internet ", comprise some three grades of classifications " internet form ", " internet giant ", " internet famous person " etc., be that these three grades of classifications belong to secondary classification " internet ", that is these three grades of classifications and secondary classification " internet " have the level membership; These three grades of classifications are the subclass classification of secondary classification " internet ", and secondary classification " internet " is the parent classification of these three grades of classifications.
After adopting prior art to generate supporting vector machine model for the proper vector of all categories in table 1, suppose to have a document to be sorted, after lineoid of all categories in its proper vector and supporting vector machine model carries out Distance Judgment, namely with supporting vector machine model in proper vector of all categories carry out similarity relatively after, the similarity that obtains classification from high to low is respectively: science and technology, internet, internet giant, internet famous person, Alibaba, Ma Yun; Select the first five classification of rank: science and technology, internet, internet giant, internet famous person, Alibaba are as the final classification results of the document; Like this, the characteristic attribute that causes the document to belong to the Ma Yun in the internet famous person just has been left in the basket; Thereby classifying quality out of true, poor effect may cause the classification of a lot of documents chaotic.
Thus, the present inventor considers the hierarchical relationship of considering between classification when the training stage, make training supporting vector machine model out go for automatic classification to the document of multi-layer classification: before training supporting vector machine model according to training set, first according to the hierarchical relationship between classification, training set to be carried out the classification flattening and process; Training set after processing through the classification flattening is trained, thereby the supporting vector machine model that obtains is applicable to the document of multi-layer classification is classified.
Usually, each document in training set can manually be set in advance at least one classification; For the document of multi-layer classification, in a plurality of classifications under it, may comprise the classification with level membership; For example, the document A in training set, under it, classification may comprise: science and technology, internet, internet giant, internet famous person, Alibaba, Ma Yun; Wherein, " science and technology " is the classification with level membership with " internet ", " internet " is the classification with level membership with " internet giant ", " internet " is the classification with level membership with " internet famous person ", " internet giant " and " Alibaba " are the classifications with level membership, and " internet famous person " and " Ma Yun " are the classifications with level membership.
According to the hierarchical relationship between classification, training set is carried out the method flow that the classification flattening is processed, as shown in Figure 4, comprise the steps:
S401: for each training sample in training set, to the affiliated classification that each training sample sets in advance, the level of category just sorts;
For example, the ranking results for classification under above-mentioned document A is: science and technology, internet, internet giant, internet famous person, Alibaba, Ma Yun.
S402: for each training sample, of all categories under this training sample respectively from the higher classification of level, judges under this training sample in classification, whether such other subclass classification is arranged; If have, this classification is rejected classification under this training sample.
For example, for above-mentioned document A, judge " science and technology " has " science and technology " in classification under document A subclass classification " internet ", namely have with it other classification " internet " of level membership, " science and technology " is rejected classification under document A; In like manner, afterwards " internet ", " internet giant " and " internet famous person " are rejected classification under document A.
Finally, classification has only kept " Alibaba " and " Ma Yun " two mutual subclass classification of not having a level membership under document A.
Thus, the method flow at training stage generation supporting vector machine model that provides of the embodiment of the present invention as shown in Figure 5, comprises the steps:
S501: according to the hierarchical relationship between classification, training set is carried out the classification flattening and process.
The concrete grammar that training set is carried out classification flattening processing is described in detail in aforementioned each step shown in Figure 4, repeats no more herein.
S502: according to the training set of processing through the classification flattening, generate supporting vector machine model.
In this step, build classification word matrix according to the training set of processing through the classification flattening; Generate proper vector of all categories according to described classification matrix, build described supporting vector machine model according to proper vector of all categories; Wherein according to the training set of processing through the classification flattening, the method that the generation supporting vector machine model adopts is identical with method of the prior art, is described in detail in each step shown in earlier figures 1,2, repeats no more herein.
After the supported vector machine model of technical scheme according to the present invention, according to supporting vector machine model, document to be sorted is classified: treat classifying documents and carry out participle, after obtaining the set of words of this document to be sorted, add up the frequency that effective word concentrates each word in p word to occur in this document to be sorted, the word frequency of each word in p word concentrated according to effective word of statistics and the anti-document frequency of each word obtain the proper vector of this document to be sorted
Figure BDA00002785932100101
Distance between corresponding lineoid of all categories respectively in the proper vector of calculating this document to be sorted and described supporting vector machine model; Classification under this document to be sorted definite according to the distance of calculating.The method of carrying out document classification in detailed process and prior art is identical, is described in detail in aforesaid each step shown in Figure 3, repeats no more herein.
In fact, process if the classification of the training sample in training set is not carried out flattening, and directly calculate supporting vector machine model with these training samples, supporting vector machine model will not be suitable for the classification of the document of multi-layer classification; And use in the present invention when generating supporting vector machine model through the training set after the flattening processing, due to the document in the training set of processing through the classification flattening, under it, classification will can not have the level membership between any two, and the affiliated classification of its reservation is the lower classification of level; Therefore, according to the training set that the classification flattening is processed, will make the word frequency of the lower classification of level increase when building classification word matrix; And then when building supporting vector machine model, make the characteristic vector space of the lower classification of level larger; Thereby when carrying out document classification according to this supporting vector machine model, can be more prone to the lineoid of the lower classification of level, the similarity that can be more prone in other words the lower classification of level is higher; Making the lower classification of level preferentially to select out, the phenomenon that produces with regard to can not occur in prior art, multi-layer classification document being classified the time---classification that some levels are lower is left in the basket and causes that classifying quality is not good, the confusion of document classification.
for example, if adopting supporting vector machine model of the present invention classifies to above-mentioned document A, after training set being carried out classification flattening processing, in classification word matrix, the subclass classification is as " Ma Yun ", the word frequency of " Alibaba " will be increased, its subclass classification with and more the classification of upper level as " science and technology ", " internet ", the word frequency of " internet giant " will reduce, when according to the supporting vector machine model that obtains thus, above-mentioned document A being classified, the proper vector of document A will be apart from the subclass classification as " Ma Yun ", the lineoid of " Alibaba " is nearer, namely be more prone to the subclass classification as " Ma Yun ", the proper vector of " Alibaba ", be the proper vector and " Ma Yun " of document A, the similarity of the proper vector of " Alibaba " will higher than with " science and technology ", " internet ", the similarity of the proper vector of " internet giant ", therefore, adopt supporting vector machine model that document A is carried out after classification determines, resulting sequencing of similarity will be: " Ma Yun ", " Alibaba ", " internet giant ", " internet famous person ", " internet ", " science and technology ", select the first five classification of rank: " Ma Yun ", " Alibaba ", " internet giant ", " internet famous person ", " internet " are as the final classification results of the document, obviously, this classification results is more more accurate than the classification results of the sorting technique of prior art, and effect is better.
In actual applications, each classification has been assigned with unique sign; More preferably, the sign of each classification in the solution of the present invention has comprised such other level routing information; Thus, after supporting vector machine model according to the present invention is classified to document to be sorted, obtain the classification results of the document, can date back to according to the sign of all categories in this classification results such other parent classification, obtain the more detailed category attribute information of the document.
Particularly, comprise the classification logotype of level routing information can numeral or alphabetical form represent, wherein, the sign of the classification that highest level is following is comprised of sign and such other subclass identification code of its parent classification; To belonging to one group of subclass of same parent, wherein each subclass has been distributed the unique identification code in this group; That is to say, described subclass identification code is for the one group of subclass that belongs to same parent, is unique identification code of each subclass distribution in organizing.
For example, have the classification of the level membership shown in upper table 1, wherein, the highest level classification is one-level classification " science and technology ", and its classification logotype can be " 01 ";
Secondary classification " internet " below highest level, the sign of " industry " can be " 0101 ", " 0102 " respectively; Therefrom can find out, the high two digits of the sign of " internet " and " industry " equals the sign " 01 " of its parent classification " science and technology ", and then two digits " 01 ", " 02 " are respectively " internet ", " industry " identification code in group.
The sign of the three grades of classifications " internet form " below highest level, " internet giant ", " internet famous person ", " mobile Internet ", " ecommerce " can be " 010101 ", " 010102 ", " 010103 ", " 010104 ", " 010105 " respectively; High 4 bit digital of the sign of these classifications equal the sign " 0101 " of its parent classification " internet ", and then two digits " 01 ", " 02 ", " 03 ", " 04 ", " 05 " are respectively " internet form ", " internet giant ", " internet famous person ", " mobile Internet ", the identification code of " ecommerce " classification in group.
Thus, after obtaining the classification results of document, can determine easily the parent classification of the classification in classification results, and then also can determine the parent classification of this parent classification.
Obviously, similarly, the classification logotype that comprises the level routing information also can alphabetical form represent, Method And Principle is identical with numeral, repeats no more herein.
The supporting vector machine model generating apparatus that the embodiment of the present invention provides, its inner structure block diagram comprises as shown in Figure 6: training set flattening processing module 601, supporting vector machine model generation module 602.
Training set flattening processing module 601 is used for that training set is carried out the classification flattening and processes: for each training sample of described training set, to the affiliated classification that this training sample sets in advance, the level of category just sorts; For each classification under this training sample, from the higher classification of level, judge under this training sample in classification, whether such other subclass classification is arranged; If have, this classification is rejected classification under this training sample; To export through the training set that the classification flattening is processed; Comprised such other level routing information in the sign of described classification; The sign of described classification specifically can numeral or alphabetical form represent; Wherein, the sign of the classification below highest level is comprised of sign and such other subclass identification code of its parent classification; Wherein, described subclass identification code is for the one group of subclass that belongs to same parent, is unique identification code of each subclass distribution in organizing.
Supporting vector machine model generation module 602 is used for receiving the training set of training set flattening processing module 601 outputs, and generates described supporting vector machine model according to the training set that receives.Supporting vector machine model generation module 602 can adopt method same as the prior art to generate described supporting vector machine model according to training set, repeats no more herein.
The embodiment of the present invention is processed owing to first training set being carried out the classification flattening according to the hierarchical relationship between classification, make through the training set that the classification flattening was processed and considered hierarchical relationship between classification, thereby the supporting vector machine model that obtains makes classification results have accuracy preferably applicable to the document of multi-layer classification is classified.
Further, comprise such other level routing information in the sign of classification, so that date back to such other parent classification according to the sign of all categories in the classification results of document, obtained the more detailed category attribute information of the document.
One of ordinary skill in the art will appreciate that all or part of step that realizes in above-described embodiment method is to come the relevant hardware of instruction to complete by program, this program can be stored in a computer read/write memory medium, as: ROM/RAM, magnetic disc, CD etc.
The above is only the preferred embodiment of the present invention; should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the principle of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (10)

1. a Document Classification Method, is characterized in that, comprising:
After treating classifying documents and carrying out participle, determine the proper vector of this document to be sorted;
According to the proper vector of this document to be sorted and the supporting vector machine model that generates according to the training set of processing through the classification flattening, determine the affiliated classification of this document to be sorted, wherein,
The classification flattening processing procedure of described training set comprises: for each training sample in described training set, to the affiliated classification that this training sample sets in advance, the level of category just sorts; For each classification under this training sample, from the higher classification of level, judge under this training sample in classification, whether such other subclass classification is arranged; If have, this classification is rejected classification under this training sample.
2. the method for claim 1, is characterized in that, described classification has been assigned with unique sign, and comprised such other level routing information in the sign of described classification.
3. method as claimed in claim 2, is characterized in that, the sign of the classification that highest level is following is comprised of sign and such other subclass identification code of its parent classification; Wherein, described subclass identification code is for the one group of subclass that belongs to same parent, is unique identification code of each subclass distribution in organizing.
4. described method as arbitrary in claim 1-3, is characterized in that, described supporting vector machine model is according to specifically comprising that training set generates:
Build classification word matrix according to described training set;
Generate proper vector of all categories according to described classification word matrix, build described supporting vector machine model according to proper vector of all categories; And
The proper vector of described this document to be sorted of basis and supporting vector machine model, determine that the affiliated classification of this document to be sorted specifically comprises:
Distance between corresponding lineoid of all categories respectively in the proper vector of calculating this document to be sorted and described supporting vector machine model;
Classification under this document to be sorted definite according to the distance of calculating.
5. a supporting vector machine model generation method, is characterized in that, comprising:
Training set is carried out the classification flattening to be processed: for each training sample in described training set, to the affiliated classification that this training sample sets in advance, the level of category just sorts; For each classification under this training sample, from the higher classification of level, judge under this training sample in classification, whether such other subclass classification is arranged; If have, this classification is rejected classification under this training sample;
Generate described supporting vector machine model according to the training set of processing through the classification flattening.
6. method as claimed in claim 5, is characterized in that, described classification has been assigned with unique sign, and comprised such other level routing information in the sign of described classification.
7. method as claimed in claim 6, is characterized in that, the sign of the classification that highest level is following is comprised of sign and such other subclass identification code of its parent classification; Wherein, described subclass identification code is for the one group of subclass that belongs to same parent, is unique identification code of each subclass distribution in organizing.
8. a supporting vector machine model generating apparatus, is characterized in that, comprising:
Training set flattening processing module, be used for that training set is carried out the classification flattening and process: for each training sample of described training set, to the affiliated classification that this training sample sets in advance, the level of category just sorts; For each classification under this training sample, from the higher classification of level, judge under this training sample in classification, whether such other subclass classification is arranged; If have, this classification is rejected classification under this training sample; To export through the training set that the classification flattening is processed;
The supporting vector machine model generation module is used for receiving the training set of described training set flattening processing module output, and generates described supporting vector machine model according to the training set that receives.
9. device as claimed in claim 8, is characterized in that, described classification has been assigned with unique sign, and comprised such other level routing information in the sign of described classification.
10. device as claimed in claim 9, is characterized in that, the sign of the classification that highest level is following is comprised of sign and such other subclass identification code of its parent classification; Wherein, described subclass identification code is for the one group of subclass that belongs to same parent, is unique identification code of each subclass distribution in organizing.
CN201310033125.XA 2013-01-28 2013-01-28 The method and apparatus that document classification, supporting vector machine model generate Active CN103106262B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310033125.XA CN103106262B (en) 2013-01-28 2013-01-28 The method and apparatus that document classification, supporting vector machine model generate

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310033125.XA CN103106262B (en) 2013-01-28 2013-01-28 The method and apparatus that document classification, supporting vector machine model generate

Publications (2)

Publication Number Publication Date
CN103106262A true CN103106262A (en) 2013-05-15
CN103106262B CN103106262B (en) 2016-05-11

Family

ID=48314117

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310033125.XA Active CN103106262B (en) 2013-01-28 2013-01-28 The method and apparatus that document classification, supporting vector machine model generate

Country Status (1)

Country Link
CN (1) CN103106262B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104680192A (en) * 2015-02-05 2015-06-03 国家电网公司 Electric power image classification method based on deep learning
CN104850592A (en) * 2015-04-27 2015-08-19 小米科技有限责任公司 Method and device for generating model file
CN104978328A (en) * 2014-04-03 2015-10-14 北京奇虎科技有限公司 Hierarchical classifier obtaining method, text classification method, hierarchical classifier obtaining device and text classification device
WO2015180622A1 (en) * 2014-05-26 2015-12-03 北京奇虎科技有限公司 Method and apparatus for determining categorical attribute of queried word in search
CN105512145A (en) * 2014-09-26 2016-04-20 阿里巴巴集团控股有限公司 Method and device for information classification
CN106022599A (en) * 2016-05-18 2016-10-12 德稻全球创新网络(北京)有限公司 Industrial design talent level evaluation method and system
CN106126734A (en) * 2016-07-04 2016-11-16 北京奇艺世纪科技有限公司 The sorting technique of document and device
CN107194260A (en) * 2017-04-20 2017-09-22 中国科学院软件研究所 A kind of Linux Kernel association CVE intelligent Forecastings based on machine learning
CN107894986A (en) * 2017-09-26 2018-04-10 北京纳人网络科技有限公司 A kind of business connection division methods, server and client based on vectorization
CN109033478A (en) * 2018-09-12 2018-12-18 重庆工业职业技术学院 A kind of text information law analytical method and system for search engine
CN110808968A (en) * 2019-10-25 2020-02-18 新华三信息安全技术有限公司 Network attack detection method and device, electronic equipment and readable storage medium
CN111199170A (en) * 2018-11-16 2020-05-26 长鑫存储技术有限公司 Formula file identification method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6446061B1 (en) * 1998-07-31 2002-09-03 International Business Machines Corporation Taxonomy generation for document collections
CN1725213A (en) * 2004-07-22 2006-01-25 国际商业机器公司 Method and system for structuring, maintaining personal sort tree, sort display file
CN102243645A (en) * 2010-05-11 2011-11-16 微软公司 Hierarchical content classification into deep taxonomies

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6446061B1 (en) * 1998-07-31 2002-09-03 International Business Machines Corporation Taxonomy generation for document collections
CN1725213A (en) * 2004-07-22 2006-01-25 国际商业机器公司 Method and system for structuring, maintaining personal sort tree, sort display file
CN102243645A (en) * 2010-05-11 2011-11-16 微软公司 Hierarchical content classification into deep taxonomies

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
胡学刚等: "基于词向量空间模型的中文文本分类方法", 《合肥工业大学学报(自然科学版)》 *
马乐等: "一种基于SVM 的网页层次分类算法", 《北京师范大学学报(自然科学版)》 *
马乐等: "一种基于SVM 的网页层次分类算法", 《北京师范大学学报(自然科学版)》, vol. 45, no. 3, 30 June 2009 (2009-06-30) *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104978328A (en) * 2014-04-03 2015-10-14 北京奇虎科技有限公司 Hierarchical classifier obtaining method, text classification method, hierarchical classifier obtaining device and text classification device
WO2015180622A1 (en) * 2014-05-26 2015-12-03 北京奇虎科技有限公司 Method and apparatus for determining categorical attribute of queried word in search
CN105512145A (en) * 2014-09-26 2016-04-20 阿里巴巴集团控股有限公司 Method and device for information classification
CN104680192B (en) * 2015-02-05 2017-12-12 国家电网公司 A kind of electric power image classification method based on deep learning
CN104680192A (en) * 2015-02-05 2015-06-03 国家电网公司 Electric power image classification method based on deep learning
CN104850592B (en) * 2015-04-27 2018-09-18 小米科技有限责任公司 The method and apparatus for generating model file
CN104850592A (en) * 2015-04-27 2015-08-19 小米科技有限责任公司 Method and device for generating model file
CN106022599A (en) * 2016-05-18 2016-10-12 德稻全球创新网络(北京)有限公司 Industrial design talent level evaluation method and system
CN106126734A (en) * 2016-07-04 2016-11-16 北京奇艺世纪科技有限公司 The sorting technique of document and device
CN106126734B (en) * 2016-07-04 2019-06-28 北京奇艺世纪科技有限公司 The classification method and device of document
CN107194260A (en) * 2017-04-20 2017-09-22 中国科学院软件研究所 A kind of Linux Kernel association CVE intelligent Forecastings based on machine learning
CN107894986A (en) * 2017-09-26 2018-04-10 北京纳人网络科技有限公司 A kind of business connection division methods, server and client based on vectorization
CN107894986B (en) * 2017-09-26 2021-03-30 北京纳人网络科技有限公司 Enterprise relation division method based on vectorization, server and client
CN109033478A (en) * 2018-09-12 2018-12-18 重庆工业职业技术学院 A kind of text information law analytical method and system for search engine
CN111199170A (en) * 2018-11-16 2020-05-26 长鑫存储技术有限公司 Formula file identification method and device, electronic equipment and storage medium
CN111199170B (en) * 2018-11-16 2022-04-01 长鑫存储技术有限公司 Formula file identification method and device, electronic equipment and storage medium
CN110808968A (en) * 2019-10-25 2020-02-18 新华三信息安全技术有限公司 Network attack detection method and device, electronic equipment and readable storage medium

Also Published As

Publication number Publication date
CN103106262B (en) 2016-05-11

Similar Documents

Publication Publication Date Title
CN103106262A (en) Method and device of file classification and generation of support vector machine model
Rathi et al. Sentiment analysis of tweets using machine learning approach
CN107609121B (en) News text classification method based on LDA and word2vec algorithm
CN105893609B (en) A kind of mobile APP recommended method based on weighted blend
WO2022126971A1 (en) Density-based text clustering method and apparatus, device, and storage medium
Venugopalan et al. Exploring sentiment analysis on twitter data
CN103049433A (en) Automatic question answering method, automatic question answering system and method for constructing question answering case base
CN106844407B (en) Tag network generation method and system based on data set correlation
CN107871144A (en) Invoice trade name sorting technique, system, equipment and computer-readable recording medium
Albadarneh et al. Using big data analytics for authorship authentication of arabic tweets
Bai et al. Constructing sentiment lexicons in Norwegian from a large text corpus
CN109063147A (en) Online course forum content recommendation method and system based on text similarity
CN116911312B (en) Task type dialogue system and implementation method thereof
CN106600213B (en) Intelligent management system and method for personal resume
CN112256842A (en) Method, electronic device and storage medium for text clustering
Arasteh et al. ARAZ: A software modules clustering method using the combination of particle swarm optimization and genetic algorithms
Vishwakarma et al. A comparative study of K-means and K-medoid clustering for social media text mining
CN103412880A (en) Method and device for determining implicit associated information between multimedia resources
CN106708829A (en) Data recommendation method and data recommendation system
CN105550292B (en) A kind of Web page classification method based on von Mises-Fisher probabilistic models
CN104331507B (en) Machine data classification is found automatically and the method and device of classification
CN104991920A (en) Label generation method and apparatus
CN111026940A (en) Network public opinion and risk information monitoring system and electronic equipment for power grid electromagnetic environment
CN115329076A (en) Bank data screening processing method, device, system and medium
CN112784040B (en) Vertical industry text classification method based on corpus

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230417

Address after: Room 501-502, 5/F, Sina Headquarters Scientific Research Building, Block N-1 and N-2, Zhongguancun Software Park, Dongbei Wangxi Road, Haidian District, Beijing, 100193

Patentee after: Sina Technology (China) Co.,Ltd.

Address before: 100080, International Building, No. 58 West Fourth Ring Road, Haidian District, Beijing, 20 floor

Patentee before: Sina.com Technology (China) Co.,Ltd.