CN101944099A - Method for automatically classifying text documents by utilizing body - Google Patents

Method for automatically classifying text documents by utilizing body Download PDF

Info

Publication number
CN101944099A
CN101944099A CN 201010210107 CN201010210107A CN101944099A CN 101944099 A CN101944099 A CN 101944099A CN 201010210107 CN201010210107 CN 201010210107 CN 201010210107 A CN201010210107 A CN 201010210107A CN 101944099 A CN101944099 A CN 101944099A
Authority
CN
China
Prior art keywords
word
meaning
notion
text document
expression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 201010210107
Other languages
Chinese (zh)
Other versions
CN101944099B (en
Inventor
郭雷
方俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Tianying Environmental Protection Energy Co ltd
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN2010102101070A priority Critical patent/CN101944099B/en
Publication of CN101944099A publication Critical patent/CN101944099A/en
Application granted granted Critical
Publication of CN101944099B publication Critical patent/CN101944099B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention relates to a method for automatically classifying text documents by utilizing a body, comprising the following steps: firstly expressing the characteristic information of a text document by utilizing a weighted key word set; and then expressing the characteristic information of a classifying catalogue by a body which is subject to body disambiguation and body expansion; transforming the body into a weighted word meaning set through analyzing the body structural characteristic; finally calculating the semantic similar value between the key word set of the text document and the body weighted word meaning set by utilizing a Earth Mover's Distance method; further calculating the similar value between the text document and the classifying catalogue; and classifying and sequencing the text document according to the similar value between the text document and the classifying catalogue. By utilizing the method of the invention, the text document can be automatically classified, and the accuracy of the text document classification can be improved.

Description

A kind of body that uses carries out the text document method of classification automatically
Technical field
The present invention relates to a kind of method of using body that text document is classified automatically, belong to fields such as computer information processing, information retrieval.Be applicable to the network text document of magnanimity is classified fast and accurately automatically.
Background technology
In order to improve the efficient of text document tissue, better support the user to browse and search information, the text document classification is the emphasis problem that people paid close attention to all the time.Begin most, text document classification is that the people manually finishes, but more and more along with the text document resource, manual classification become can not, so autotext document classification technology becomes the emphasis of research.
The text document classification generally is divided into three phases: at first, the characteristic information of text document and split catalog is extracted out; Then, classifier calculated goes out the similar value of text document and split catalog; At last, text document belongs to different catalogues according to similar value.
Traditional machine learning method has been applied to text document classifies automatically, comprises neural network, Bayes, support vector machine and k neighbours' method.Some classified text documents of collection that these methods are at first manual use these classified text document collection to come training classifier then, use the sorter that trains that text document is divided in the split catalog at last.The sorting technique of these machine learning has following shortcoming:
1) traditional machine learning method training classifier needs manual a large amount of document sets of classifying text of collecting, and this process is very loaded down with trivial details, and at different split catalogs, needs the manual different text document collection of collection to come training classifier;
2) method of traditional machine learning is not considered the semantic relation between the speech, so be difficult to improve the accuracy rate of classification.
In order to solve the shortcoming of machine learning method, the present invention proposes a kind of body that uses and come the method that text document is classified automatically.
Summary of the invention
The technical matters that solves
In order to solve at present the shortcoming based on the method for machine learning, the present invention proposes to use body that text document is classified automatically, can fast and accurately text document be classified automatically and sort.
Technical scheme
Thought of the present invention is: use body to come the characteristic information of presentation class catalogue, utilize the semantic similar value between text document and the body to carry out real-time classification, saved the process of training study like this, and, will constantly be improved based on the accuracy rate and the recall rate of the sorting technique of body along with body constantly upgrades and evolves; On the other hand, when the similar value of calculating between text document and the body,, thereby improve the accuracy rate of classification based on the semantic relation between the method consideration speech of body.
The invention is characterized in: the characteristic information that proposes the effective presentation class catalogue of this physical efficiency, and treat that by using the body after disambiguation and the extension process to come the characteristic information of presentation class catalogue, utilizing the semantic similar value between classifying text document and the body classifies.
Basic process of the present invention is: at first, use the heavy keyword set of cum rights to represent the characteristic information of text document; Then, use through the body after disambiguation and the extension process and come the characteristic information of presentation class catalogue, and, body is converted into the heavy meaning of a word set of cum rights by analyzing body structural feature; At last, use Earth Mover ' s Distance method to calculate the keyword set of text document and the semantic similar value between the set of the body weight meaning of a word, wherein, similar value between the single meaning of a word and the speech adopts and measures based on the method for WordNet lexical or textual analysis, and utilize this semanteme similar value to calculate similar value between text document and the split catalog, carry out the classification and the ordering of text document according to the similar value between text document and the split catalog.
A kind of body that uses carries out the text document method of classification automatically, it is characterized in that step is as follows:
(1) extracts the keyword set for the treatment of every piece of text document in the classifying text collection of document with the KEA algorithm, obtain the heavy keyword set of cum rights of text document; In Swoogle body search engine, retrieve with each split catalog term by name in the given catalogue set, the body of ordering first is as the body of this split catalog of expression in the result for retrieval that obtains, the body of representing each split catalog is carried out body qi and the body expansion that disappears, obtain representing the new body of this split catalog;
The described body qi process that disappears is:
At first, select the context of the interior speech of each notional word L scope of body middle distance as this notional word; The span of described L is [3,5];
Then, by semantic relatedness computation formula
Figure BSA00000175171100031
I that calculates each notional word may meaning of a word s iJ context con with this notional word jSemantic relevancy relateness (s i, con j), and press
Figure BSA00000175171100032
I that calculates each notional word may meaning of a word s iAverage semantic relevancy Rel (s i);
Wherein, i=1,2 ..., I, I represent the number of the possible meaning of a word of notional word, j=1, and 2 ..., J, J represent the contextual number of notional word; WordNumInGlossOfs iExpression s iThe word number that comprises of WordNet lexical or textual analysis, wordNumInGlossOfcon jExpression con jThe word number that comprises of WordNet lexical or textual analysis, NumOfOverlaps_s iCon jExpression s iWordNet lexical or textual analysis and con jThe word that comprised of WordNet lexical or textual analysis in the number of same word; Described may the meaning of a word be the meaning of a word that is defined among the morphology database WordNet;
At last, select to have the notion meaning of a word of the possible meaning of a word of maximum average semantic relevancy Rel value as notional word;
Described body expansion process is:
Utilize the semantic relevancy computing formula
Figure BSA00000175171100033
Calculate through body each notion meaning of a word of the body of qi after handling set of the superordination meaning of a word and the next each meaning of a word in the meaning of a word set and semantic relevancy between this notion meaning of a word of concerning in WordNet that disappear, and judge: for each meaning of a word in the set of the superordination meaning of a word, if the semantic relevancy between it and this notion meaning of a word, then joins this meaning of a word the parent set of this notion meaning of a word greater than given threshold value one; For the next each meaning of a word that concerns in the meaning of a word set, if it with this notion meaning of a word between semantic relevancy greater than given threshold value two, the subclass that then this meaning of a word is joined this notion meaning of a word is gathered; All meaning of a word in the synonymy meaning of a word set of each notion meaning of a word in WordNet are all joined the similar set of this notion meaning of a word;
Wherein,
Figure BSA00000175171100041
Expression is through disappear p the notion meaning of a word of the body of qi after handling of body, p=1, and 2 ..., P, P represent through the disappear number of the notion meaning of a word of the body of qi after handling of body; S ' PqExpression
Figure BSA00000175171100042
Superordination meaning of a word set/the next q meaning of a word that concerns in the meaning of a word set, q=1,2 ..., Q, Q represent superordination meaning of a word set/the next number that concerns the meaning of a word in the meaning of a word set;
Figure BSA00000175171100043
Expression
Figure BSA00000175171100044
The word number that comprises of WordNet lexical or textual analysis, wordNumInGlossOfs ' PqExpression s ' PqThe word number that comprises of WordNet lexical or textual analysis,
Figure BSA00000175171100045
Expression
Figure BSA00000175171100046
WordNet lexical or textual analysis and s ' PqThe word that comprised of WordNet lexical or textual analysis in the number of same word;
The described given threshold value one and the span of threshold value two are [0.6,1];
(2) the weight meaning of a word that calculates the new body of each split catalog of expression is gathered, and is specially:
At first, body is changed into the digraph of being made up of vertex set and directed edge set: each summit of digraph is a notion meaning of a word in the body, each bar directed edge of digraph is two relation of inclusion between the notion meaning of a word, and the direction of directed edge is pointed to father's notion meaning of a word by the sub-notion meaning of a word;
Then, press
Figure BSA00000175171100047
Calculate the weight of each notion meaning of a word;
Wherein, weight represents the weight of the notion meaning of a word, and layer represents the number of plies on the summit of this notion meaning of a word correspondence;
The number of plies on described summit is the shortest path distance of the notion meaning of a word of summit correspondence apart from the body root;
(3) press Sim (d, o)=1-EMD (d, o) the similar value Sim (d between calculating text document and the split catalog, o), if the similar value Sim (d between text document and split catalog, o), then text document is categorized into this split catalog, otherwise text document is not categorized into this split catalog greater than given threshold value δ;
Wherein, d is the heavy keyword set of cum rights of text document, and o is the weight meaning of a word set of body; (d is o) for utilizing text document that Earth Mover ' s Distance method calculates and the semantic similar value between the body for EMD; The span of described given threshold value δ is [0.5,0.6];
(4) to all text documents under the sorted split catalog according to similar value Sim (d, o) the descending ordering.
Beneficial effect
This method of the present invention uses body to represent the characteristic information of catalogue, carries out real-time classification by the semantic similar value of calculating between text document and the body, has saved the process of training study, and has improved the accuracy rate of classification.In addition, the present invention uses the disambiguation technology will represent that the speech in the body becomes the meaning of a word, has solved the inaccurate problem of result of calculation of the similar value that the polysemy of speech causes, improves the precision that semantic similar value is calculated, and has further improved the precision of classification; On the basis of body disambiguation, the present invention comes body is automatically expanded by using WordNet, has enriched the notion content of body, thereby has improved the accuracy rate that follow-up similar value is calculated, and solves the bothersome problem of manual creation body.
Description of drawings
Fig. 1: the basic flow sheet of the inventive method
Embodiment
Now in conjunction with the accompanying drawings the present invention is further described:
The use body that proposes according to the present invention carries out the method for text document classification, and we use Java and Perl language to realize that concrete implementation procedure is as follows:
Use body to carry out the text document sorting technique and be divided into following four steps:
Step 1: the structure of text document keyword set.Here, adopt the extraction of KEA algorithm to treat the heavy keyword set of cum rights of each piece text document in the classifying text collection of document, be specially: for treating classified text collection of document D={d 1, d 2..., d | D|Each piece text document d in (| D| represents the text document record among the text document set D) i, at first, adopt naive Bayesian to estimate, by consider three characteristic attributes of number Length of letter in frequency tf * idf that speech (existing word) occurs, mean place Occurrence that speech occurs and the speech in text document in text document, to d iIn each speech, adopt following formula to calculate the probability P r of its speech that is the theme:
Pr=Pr[T|yes]×Pr[O|yes]×Pr[L|yes]×Pr[yes] (1)
Wherein, Pr[T|yes], Pr[O|yes] and Pr[L|yes] be illustrated respectively in the be the theme probability of speech of this speech under the condition that three characteristic attribute tf * idf, Occurrence and Length get currency; Pr[yes] comprise the ratio of number and the number of the text document that does not comprise descriptor of the text document of descriptor in the expression text document set.
Then, select to have preceding n the speech (n gets 4~6 usually) of maximum Pr value as text document d iKeyword, obtain text document d iThe heavy keyword set of cum rights, and with text document d iRepresent with the keyword set that this cum rights is heavy, i.e. d i={ URL i, (t 1, tw 1) ..., (t Ij, tw Ij) ..., wherein, t IjFor extracting the keyword that obtains, tw as stated above IjBe keyword t IjWeight, be its Pr value that calculates by formula (1).
Step 2: body pre-service.At first, retrieve in Swoogle body search engine with each split catalog term by name in the given catalogue set, and represent this split catalog with the body of ordering first in the result for retrieval that obtains, like this, catalogue set CA={ca 1, ca 2..., ca | CA|Just use body to gather O={o 1, o 2..., o | O|Represent, wherein, | O| represents the body number among the body set O, | CA| represents the split catalog number among the catalogue set CA, satisfies | O|=|CA|.Wherein, split catalog corresponding a body, i.e. a body o mRepresent a split catalog ca mCharacteristic information, i.e. ca m:=o m
Next, to each body o mCarry out the body disambiguation of step 2.1 and the body extension process of step 2.2.Wherein, the present invention adopts the meaning of a word that is defined among the morphology database WordNet to represent as the morphology of body, and the path distance of setting between interior any two notional words of same knowledge is 1.
Step 2.1: body disambiguation.Because the corresponding a plurality of meaning of a word of speech possibility, this phenomenon can reduce the precision that semantic similar value is calculated.In order to eliminate the ambiguousness that vocabulary shows in the body, body is carried out disambiguation handle, promptly utilize the context of speech in the body, determine the meaning of a word that it is correct.Be specially:
At first, the speech in the L distance range of the notional word s in the body is chosen as the context of notional word s, obtains the set of context Con={con of notional word s 1..., con j..., wherein, con jJ the context of expression notional word s; The span of L is [3,5];
Then, use formula (2) to calculate notional word s each meaning of a word s in WordNet i(i=1 ..., N i, N iBe the meaning of a word number of notional word s in WordNet) and its set of context Con in average semantic relevancy Rel (s between all contexts i):
Rel ( s i ) = Σ j = 1 | Con | relateness ( s i , con j ) | Con | - - - ( 2 )
Wherein, | Con| is the context number of notional word s, i.e. the number of speech among the set of context Con; Relateness (s i, con j) be i meaning of a word s iWith its j context con jSemantic relevancy, its computing formula is as follows:
relateness ( s i , con j ) = NumOfOverlaps _ s i con j ( wordNumInGlossOfs i + wordNumInGlossOfcon j ) / 2 - - - ( 2 )
Wherein, wordNumInGlossOfs iBe s iThe word number that comprises of WordNet lexical or textual analysis, wordNumInGlossOfcon jBe con jThe word number that comprises of WordNet lexical or textual analysis, NumOfOverlaps_s iCon jBe s iWordNet lexical or textual analysis and con jThe word that comprised of WordNet lexical or textual analysis in the number of same word;
At last, select the correct meaning of a word of the corresponding meaning of a word of maximum average semantic relevancy Rel value, i.e. the notion meaning of a word of notional word s as notional word s.Because the notional word that occurs together in same body has certain semantic relation, so the correct meaning of a word is the meaning of a word with the semantic relevancy maximum of neighbours' notional word.
Each notional word in the body is all handled by the said process qi that disappears.
Step 2.2: body expansion.Disappear after qi handles through the body of step 2.1, body is represented that by the notion meaning of a word notion meaning of a word in the body after the qi that disappears to body is handled in the body expansion adds the meaning of a word that is associated, thereby enriches body.
At first, obtain body each the notion meaning of a word cs in the body of qi after handling that disappears k(k=1 ..., N k, N kNumber for the notion meaning of a word in the body) set of the superordination meaning of a word in WordNet (hypernymy), the next meaning of a word set (hyponymy) and the synonymy meaning of a word of concerning are gathered (synonym), use hypernym (cs respectively k)={ a K1, a K2..., hyponym (cs k)={ b K1, b K2... and synonym (cs k)={ c K1, c K2... expression notion meaning of a word cs kThese three kinds concern meaning of a word set, and set hypernym_value and hyponym_value is two threshold values, the span of hypernym_value and hyponym_value is [0.6,1].
Then, calculate notion meaning of a word cs by formula (4) kWith superordination meaning of a word set hypernym (cs k) in each meaning of a word a Kp(p=1,2 ..., P, P are the meaning of a word number in the set of the superordination meaning of a word) semantic relevancy, if cs kAnd a KpSemantic relevancy greater than given hypernym_value threshold value, then with a KpAdd cs kParent set in; Calculate notion meaning of a word cs by formula (5) kWith the next meaning of a word set hyponym (cs that concerns k) in each meaning of a word b Kq(q=1,2 ..., Q, Q are the next meaning of a word number in the meaning of a word set that concerns) semantic relevancy, if cs kAnd b KqSemantic relevancy greater than given hyponym_value threshold value, then with b KqAdd cs kSubclass set in; With synonymy meaning of a word set synonym (cs k) in all meaning of a word c Kl(l=1,2 ..., L, L are the next meaning of a word number that concerns in the meaning of a word set) all join cs kSimilar set in.
relateness ( cs k , a kp ) = NumOfOverlaps _ cs k a kp ( wordNumInGlossOfcs k + wordNumInGlossOfa kp ) / 2 - - - ( 4 )
relateness ( cs k , b kq ) = NumOfOverlaps _ cs k b kq ( wordNumInGlossOfcs k + wordNumInGlossOfb kq ) / 2 - - - ( 5 )
Wherein, wordNumInGlossOfcs k, wordNumInGlossOfa Kp, wordNumInGlossOfb KqBe respectively cs k, a Kp, b KqThe word number that comprises of WordNet lexical or textual analysis; NumOfOverlaps_cs ka KpBe cs kWordNet lexical or textual analysis and a KpThe word that comprised of WordNet lexical or textual analysis in the number of same word; NumOfOverlaps_cs kb KqBe cs kWordNet lexical or textual analysis and b KqThe word that comprised of WordNet lexical or textual analysis in the number of same word.
Each notion meaning of a word in the body is all carried out extension process as stated above.
Through after the step 2, obtain the new body set of presentation class directory feature information
Figure BSA00000175171100083
Wherein,
Figure BSA00000175171100091
(m=1,2 ..., | o|) be disappear body after qi and the body extension process of body.
In the present embodiment, use the Jena routine package that body is operated, use the operation of JAWS routine package realization Wordnet.
Step 3: the structure of body meaning of a word set.For each body after disappear through body qi and the body extension process
Figure BSA00000175171100092
(m=1,2 ..., | o|):
At first, with body Change into digraph G, promptly
Figure BSA00000175171100094
Wherein, V is a vertex set, V={v 1, v 2. ..., v | V|(| V| represents the number on summit among the vertex set V), E is the directed edge set, E={e 1, e 2. ..., e | E|(| E| represents the number of directed edge among the directed edge set E), each summit of digraph is a notion meaning of a word in the body, each bar directed edge of digraph is two relation of inclusion between the notion meaning of a word, and the direction of directed edge is pointed to father's notion meaning of a word by the sub-notion meaning of a word.In digraph G, the number of plies on summit is the shortest path distance of its corresponding notion meaning of a word apart from the body root.According to the big more principle of the high more contribution of notion meaning of a word place level, the present invention uses formula (6) to calculate notion meaning of a word cs kWeight sw k:
sw k = 1 ( layer ( v k ) ) 1 / 4 - - - ( 6 )
Wherein, layer (v k) be and notion meaning of a word cs kCorresponding vertex v kThe number of plies.
Through after such processing, body
Figure BSA00000175171100096
Can be expressed as the heavy body meaning of a word set of cum rights
Figure BSA00000175171100097
Wherein,
Figure BSA00000175171100098
Be body
Figure BSA00000175171100099
In the notion meaning of a word, sw kBe its pairing weight, The expression body
Figure BSA000001751711000911
The number of the middle notion meaning of a word.
Body is gathered
Figure BSA000001751711000912
In each body all handle as stated above.
Step 4: classification and ordering.Through after three top step process, text document is represented that by the heavy keyword set of cum rights split catalog is represented with the heavy body meaning of a word set of cum rights.Below, to determine whether according to the similar value between text document and the split catalog text document is referred in a certain split catalog, similar value is big more, and the relation between text document and the split catalog is just tight more, and text document might belong to this split catalog more.
At first, use the measure calculating text document of Earth Mover ' s Distance and the semantic similar value between the body.Be specially:
For text document d={ (t 1, tw 1), (t 2, tw 2) ..., (t | d|, tw | d|) (t represents the keyword of text document, and tw represents the weight of keyword) and body o={ (cs 1, sw 1), (cs 2, sw 2) ..., (cs | o|, sw | o|) (cs represents the notion meaning of a word in the body, and sw represents the weight of the notion meaning of a word), can obtain weight graph by them Wherein, W is a distance matrix, its element w IjKeyword t for text document i(i=1,2 ..., | d|, | d| is the number of the keyword of text document d) and the notion meaning of a word cs of body j(j=1,2 ..., | o|, | o| is the number of the notion meaning of a word of body o) between semantic similar value.
Weight graph is arranged
Figure BSA00000175171100102
Vertex set be:
Figure BSA00000175171100103
W is
Figure BSA00000175171100104
Limit set.For weight graph is arranged
Figure BSA00000175171100105
, the target that semantic similar value is calculated is to find a paths F={f Ij, i=1 ..., p, j=1 ..., q} (f IjBe t iAnd cs jBetween the limit), make following formula EMD (d, o) value is minimum:
EMD ( d , o ) = Σ i = 1 p Σ j = 1 q f ij w ij Σ i = 1 p Σ j = 1 q f ij - - - ( 7 )
(d o) is semantic similar value between text document and the body to EMD.
Then the similar value between text document and the split catalog is:
Sim(d,o)=1-EMD(d,o) (8)
After obtaining the similar value Sim between text document and the split catalog, set a threshold values δ and determine whether text document is categorized into this split catalog.If the similar value Sim between text document and this split catalog then is classified into this split catalog greater than threshold value δ, otherwise will not be classified into this split catalog.Similar value Sim between text document and the split catalog has represented the closeness relation between text document and the split catalog, so utilize the similar value Sim between text document and the split catalog that the text document under each split catalog is sorted again, similar value Sim is big more, and then the sorting position of text document is forward more.The span of threshold value δ is [0.5,0.6].
Example experiment: carried out one group and test and assess the present invention on the basis that program realizes, in the experiment, threshold value δ is 0.5, and selecting contextual distance range L is 3.Choosing of text document: from the website (http://dmoz.org) of Open Directory Project project (the maximum directory items of artificially webpage being classified), chosen 26 web page text document.In these webpages, there are 11 webpages to belong to the Arts catalogue, 8 webpages belong to the Sports catalogue, and 7 webpages belong to the Games catalogue.These 26 webpages and their url address are as shown in table 1.Use prefix sign " a " " s " and " g " to represent to belong to the text document of Arts, Sports and Games catalogue respectively.
The web page text document that table 1 is collected from http://dmoz.org/
Figure BSA00000175171100111
Body is chosen: at first extract Arts from the RDF file (structure.rdf.u8.gz) of representing Open Directory Project bibliographic structure, Sports and Games body.The Arts body comprises 521 notions, and the Sports body comprises 602 notions, and the Games body comprises 558 notions.Then, these bodies are through body disambiguation and body extension process, and in the process of handling, hypernym_value and hyponym_value are set as 0.9.After treatment, the number of the notion that contains in Arts, Sports and the Games body is respectively 1557,1809 and 1719.
Adopt method of the present invention to the processing of classifying of these 26 web page text document, the result is as shown in the table:
The result of table 2 classification and ordering
Figure BSA00000175171100121
Table 2 has provided text document that directory A rts, Sports and Games comprised and similar value, and the text document in each catalogue sorts according to similar value, calculate the accuracy rate and the recall rate of sorting technique integral body from these results, the result is as shown in table 3:
Table 3 sorting algorithm performance
Recall rate Accuracy rate
96.2% 83.9%
In order to assess the performance of sort method, sorted lists and the manual sorted lists that generates that table 2 is produced compare, and manually the sorted lists that generates is as shown in the table:
The sorted lists that table 4 manually produces
Suppose τ iBe to use sort algorithm to catalogue c iIn the tabulation of classified text document after sorting, It is manual standard sorted tabulation.The phase recency of two tabulations calculates with following formula so:
S = Σ i = 1 | C | S ′ ( τ i , τ i * ) | C | - - - ( 9 )
Figure BSA00000175171100132
Be tabulation τ iAnd tabulation
Figure BSA00000175171100133
Identical element number on same sequence position, | C| is tabulation τ iOr tabulation
Figure BSA00000175171100134
The total number of element.The phase recency S that formula above using calculates each directory listing averages, and ranking results of the present invention is 79.1% with the average recency mutually of the sorted lists of table 4 standard.
Table 5 sort method Performance Evaluation
?S Arts(%) ?S Sports(%) S Games(%) Average phase recency (%)
?78.5 ?75.0 83.7 79.1
From preliminary assessment experiment as can be seen, classification of the present invention and ordering have reached preferable performance, and accuracy rate and recall rate are all than higher, and ordering also has good effect.This because the body that body structure algorithm is created is fairly perfect, is because when calculating the similar value of text document and body, considered semantic information on the one hand on the other hand.In addition, because the present invention's trouble of having removed the manual collection training set from, and its classification performance can also become along with the evolution of body better, so the use body is classified to text document and the method that sorts has good prospect.

Claims (1)

1. one kind is used body to carry out the text document method of classification automatically, it is characterized in that step is as follows:
(1) extracts the keyword set for the treatment of every piece of text document in the classifying text collection of document with the KEA algorithm, obtain the heavy keyword set of cum rights of text document; In Swoogle body search engine, retrieve with each split catalog term by name in the given catalogue set, the body of ordering first is as the body of this split catalog of expression in the result for retrieval that obtains, the body of representing each split catalog is carried out body qi and the body expansion that disappears, obtain representing the new body of this split catalog;
The described body qi process that disappears is:
At first, select the context of the interior speech of each notional word L scope of body middle distance as this notional word;
The span of described L is [3,5];
Then, by semantic relatedness computation formula
Figure FSA00000175171000011
I that calculates each notional word may meaning of a word s iJ context con with this notional word jSemantic relevancy relateness (s i, con j), and press
Figure FSA00000175171000012
I that calculates each notional word may meaning of a word s iAverage semantic relevancy Rel (s i);
Wherein, i=1,2 ..., I, I represent the number of the possible meaning of a word of notional word, j=1, and 2 ..., J, J represent the contextual number of notional word; WordNumInGlossOfs iExpression s iThe word number that comprises of WordNet lexical or textual analysis, wordNumInGlossOfcon jExpression con jThe word number that comprises of WordNet lexical or textual analysis, NumOfOverlaps_s iCon jExpression s iWordNet lexical or textual analysis and con jThe word that comprised of WordNet lexical or textual analysis in the number of same word; Described may the meaning of a word be the meaning of a word that is defined among the morphology database WordNet;
At last, select to have the notion meaning of a word of the possible meaning of a word of maximum average semantic relevancy Rel value as notional word;
Described body expansion process is:
Utilize the semantic relevancy computing formula
Figure FSA00000175171000021
Calculate through body each notion meaning of a word of the body of qi after handling set of the superordination meaning of a word and the next each meaning of a word in the meaning of a word set and semantic relevancy between this notion meaning of a word of concerning in WordNet that disappear, and judge: for each meaning of a word in the set of the superordination meaning of a word, if the semantic relevancy between it and this notion meaning of a word, then joins this meaning of a word the parent set of this notion meaning of a word greater than given threshold value one; For the next each meaning of a word that concerns in the meaning of a word set, if it with this notion meaning of a word between semantic relevancy greater than given threshold value two, the subclass that then this meaning of a word is joined this notion meaning of a word is gathered; All meaning of a word in the synonymy meaning of a word set of each notion meaning of a word in WordNet are all joined the similar set of this notion meaning of a word;
Wherein, Expression is through disappear p the notion meaning of a word of the body of qi after handling of body, p=1, and 2 ..., P, P represent through the disappear number of the notion meaning of a word of the body of qi after handling of body; S ' PqExpression
Figure FSA00000175171000023
Superordination meaning of a word set/the next q meaning of a word that concerns in the meaning of a word set, q=1,2 ..., Q, Q represent superordination meaning of a word set/the next number that concerns the meaning of a word in the meaning of a word set;
Figure FSA00000175171000024
Expression
Figure FSA00000175171000025
The word number that comprises of WordNet lexical or textual analysis,
Figure FSA00000175171000026
Expression s ' PqThe word number that comprises of WordNet lexical or textual analysis,
Figure FSA00000175171000027
Expression
Figure FSA00000175171000028
WordNet lexical or textual analysis and s ' PqThe word that comprised of WordNet lexical or textual analysis in the number of same word;
The described given threshold value one and the span of threshold value two are [0.6,1];
(2) the weight meaning of a word that calculates the new body of each split catalog of expression is gathered, and is specially:
At first, body is changed into the digraph of being made up of vertex set and directed edge set: each summit of digraph is a notion meaning of a word in the body, each bar directed edge of digraph is two relation of inclusion between the notion meaning of a word, and the direction of directed edge is pointed to father's notion meaning of a word by the sub-notion meaning of a word;
Then, press
Figure FSA00000175171000031
Calculate the weight of each notion meaning of a word;
Wherein, weight represents the weight of the notion meaning of a word, and layer represents the number of plies on the summit of this notion meaning of a word correspondence;
The number of plies on described summit is the shortest path distance of the notion meaning of a word of summit correspondence apart from the body root;
(3) press Sim (d, o)=1-EMD (d, o) the similar value Sim (d between calculating text document and the split catalog, o), if the similar value Sim (d between text document and split catalog, o), then text document is categorized into this split catalog, otherwise text document is not categorized into this split catalog greater than given threshold value δ;
Wherein, d is the heavy keyword set of cum rights of text document, and o is the weight meaning of a word set of body;
(d is o) for utilizing text document that Earth Mover ' s Distance method calculates and the semantic similar value between the body for EMD; The span of described given threshold value δ is [0.5,0.6];
(4) to all text documents under the sorted split catalog according to similar value Sim (d, o) the descending ordering.
CN2010102101070A 2010-06-24 2010-06-24 Method for automatically classifying text documents by utilizing body Expired - Fee Related CN101944099B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010102101070A CN101944099B (en) 2010-06-24 2010-06-24 Method for automatically classifying text documents by utilizing body

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010102101070A CN101944099B (en) 2010-06-24 2010-06-24 Method for automatically classifying text documents by utilizing body

Publications (2)

Publication Number Publication Date
CN101944099A true CN101944099A (en) 2011-01-12
CN101944099B CN101944099B (en) 2012-05-30

Family

ID=43436091

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010102101070A Expired - Fee Related CN101944099B (en) 2010-06-24 2010-06-24 Method for automatically classifying text documents by utilizing body

Country Status (1)

Country Link
CN (1) CN101944099B (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521242A (en) * 2011-11-14 2012-06-27 江苏联著实业有限公司 Automatic classification system based on OWL (Ontology of Web Language) ontology analysis
CN102708104A (en) * 2011-03-28 2012-10-03 日电(中国)有限公司 Method and equipment for sorting document
CN103034922A (en) * 2011-09-30 2013-04-10 国际商业机器公司 Refinement and calibration method and system for improving classification of information assets
CN103123685A (en) * 2011-11-18 2013-05-29 江南大学 Text mode recognition method
CN103218362A (en) * 2012-01-19 2013-07-24 中兴通讯股份有限公司 Method and system for constructing domain ontology
CN103392177A (en) * 2011-02-25 2013-11-13 英派尔科技开发有限公司 Ontology expansion
CN103970888A (en) * 2014-05-21 2014-08-06 山东省科学院情报研究所 Document classifying method based on network measure index
CN104102651A (en) * 2013-04-07 2014-10-15 华东师范大学 Semantic-based self-adaption text classification method under cloud computing environment
CN104182463A (en) * 2014-07-21 2014-12-03 安徽华贞信息科技有限公司 Semantic-based text classification method
WO2015043077A1 (en) * 2013-09-29 2015-04-02 北大方正集团有限公司 Semantic information acquisition method, keyword expansion method thereof, and search method and system
CN105117397A (en) * 2015-06-18 2015-12-02 浙江大学 Method for searching semantic association of medical documents based on ontology
CN105205090A (en) * 2015-05-29 2015-12-30 湖南大学 Web page text classification algorithm research based on web page link analysis and support vector machine
CN105354184A (en) * 2015-10-28 2016-02-24 甘肃智呈网络科技有限公司 Method for using optimized vector space model to automatically classify document
CN105893606A (en) * 2016-04-25 2016-08-24 深圳市永兴元科技有限公司 Text classifying method and device
CN107066448A (en) * 2017-04-23 2017-08-18 四川用联信息技术有限公司 New small-world network model realizes the extracting method of text feature
CN108009248A (en) * 2017-11-30 2018-05-08 国信优易数据有限公司 A kind of data classification method and system
CN108197109A (en) * 2017-12-29 2018-06-22 北京百分点信息科技有限公司 A kind of multilingual analysis method and device based on natural language processing
WO2018153265A1 (en) * 2017-02-23 2018-08-30 腾讯科技(深圳)有限公司 Keyword extraction method, computer device, and storage medium
WO2018161516A1 (en) * 2017-03-07 2018-09-13 京东方科技集团股份有限公司 Method and device for automatic discovery of medical knowledge
CN109271513A (en) * 2018-09-07 2019-01-25 华南师范大学 A kind of file classification method, computer-readable storage media and system
CN110348497A (en) * 2019-06-28 2019-10-18 西安理工大学 A kind of document representation method based on the building of WT-GloVe term vector
CN112632968A (en) * 2020-12-18 2021-04-09 万兴科技(湖南)有限公司 PDF directory identification method, electronic device and computer readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005316699A (en) * 2004-04-28 2005-11-10 Hitachi Ltd Content disclosure system, content disclosure method and content disclosure program
US20080059448A1 (en) * 2006-09-06 2008-03-06 Walter Chang System and Method of Determining and Recommending a Document Control Policy for a Document
CN101169780A (en) * 2006-10-25 2008-04-30 华为技术有限公司 Semantic ontology retrieval system and method
CN101639837A (en) * 2008-07-29 2010-02-03 日电(中国)有限公司 Method and system for automatically classifying objects

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005316699A (en) * 2004-04-28 2005-11-10 Hitachi Ltd Content disclosure system, content disclosure method and content disclosure program
US20080059448A1 (en) * 2006-09-06 2008-03-06 Walter Chang System and Method of Determining and Recommending a Document Control Policy for a Document
CN101169780A (en) * 2006-10-25 2008-04-30 华为技术有限公司 Semantic ontology retrieval system and method
CN101639837A (en) * 2008-07-29 2010-02-03 日电(中国)有限公司 Method and system for automatically classifying objects

Cited By (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103392177A (en) * 2011-02-25 2013-11-13 英派尔科技开发有限公司 Ontology expansion
CN103392177B (en) * 2011-02-25 2018-01-05 英派尔科技开发有限公司 Ontology expansion
CN102708104B (en) * 2011-03-28 2015-03-11 日电(中国)有限公司 Method and equipment for sorting document
CN102708104A (en) * 2011-03-28 2012-10-03 日电(中国)有限公司 Method and equipment for sorting document
CN103034922A (en) * 2011-09-30 2013-04-10 国际商业机器公司 Refinement and calibration method and system for improving classification of information assets
CN103034922B (en) * 2011-09-30 2017-05-03 国际商业机器公司 Refinement and calibration method and system for improving classification of information assets
CN102521242A (en) * 2011-11-14 2012-06-27 江苏联著实业有限公司 Automatic classification system based on OWL (Ontology of Web Language) ontology analysis
CN103123685A (en) * 2011-11-18 2013-05-29 江南大学 Text mode recognition method
CN103123685B (en) * 2011-11-18 2016-03-02 江南大学 Text mode recognition method
CN103218362A (en) * 2012-01-19 2013-07-24 中兴通讯股份有限公司 Method and system for constructing domain ontology
CN104102651B (en) * 2013-04-07 2017-07-25 华东师范大学 Based on semantic adaptive file classification method under cloud computing environment
CN104102651A (en) * 2013-04-07 2014-10-15 华东师范大学 Semantic-based self-adaption text classification method under cloud computing environment
WO2015043077A1 (en) * 2013-09-29 2015-04-02 北大方正集团有限公司 Semantic information acquisition method, keyword expansion method thereof, and search method and system
CN104516902A (en) * 2013-09-29 2015-04-15 北大方正集团有限公司 Semantic information acquisition method and corresponding keyword extension method and search method
US10268758B2 (en) 2013-09-29 2019-04-23 Peking University Founder Group Co. Ltd. Method and system of acquiring semantic information, keyword expansion and keyword search thereof
JP2016532173A (en) * 2013-09-29 2016-10-13 ペキン ユニバーシティ ファウンダー グループ カンパニー,リミティド Semantic information, keyword expansion and related keyword search method and system
CN103970888B (en) * 2014-05-21 2017-02-15 山东省科学院情报研究所 Document classifying method based on network measure index
CN103970888A (en) * 2014-05-21 2014-08-06 山东省科学院情报研究所 Document classifying method based on network measure index
CN104182463A (en) * 2014-07-21 2014-12-03 安徽华贞信息科技有限公司 Semantic-based text classification method
CN105205090A (en) * 2015-05-29 2015-12-30 湖南大学 Web page text classification algorithm research based on web page link analysis and support vector machine
CN105117397B (en) * 2015-06-18 2018-08-28 浙江大学 A kind of medical files semantic association search method based on ontology
CN105117397A (en) * 2015-06-18 2015-12-02 浙江大学 Method for searching semantic association of medical documents based on ontology
CN105354184A (en) * 2015-10-28 2016-02-24 甘肃智呈网络科技有限公司 Method for using optimized vector space model to automatically classify document
CN105354184B (en) * 2015-10-28 2018-04-20 甘肃智呈网络科技有限公司 A kind of vector space model using optimization realizes the method that document is classified automatically
CN105893606A (en) * 2016-04-25 2016-08-24 深圳市永兴元科技有限公司 Text classifying method and device
WO2018153265A1 (en) * 2017-02-23 2018-08-30 腾讯科技(深圳)有限公司 Keyword extraction method, computer device, and storage medium
US10963637B2 (en) 2017-02-23 2021-03-30 Tencent Technology (Shenzhen) Company Ltd Keyword extraction method, computer equipment and storage medium
WO2018161516A1 (en) * 2017-03-07 2018-09-13 京东方科技集团股份有限公司 Method and device for automatic discovery of medical knowledge
US11455546B2 (en) 2017-03-07 2022-09-27 Beijing Boe Technology Development Co., Ltd. Method and apparatus for automatically discovering medical knowledge
CN107066448A (en) * 2017-04-23 2017-08-18 四川用联信息技术有限公司 New small-world network model realizes the extracting method of text feature
CN108009248A (en) * 2017-11-30 2018-05-08 国信优易数据有限公司 A kind of data classification method and system
CN108197109B (en) * 2017-12-29 2021-04-23 北京百分点科技集团股份有限公司 Multi-language analysis method and device based on natural language processing
CN108197109A (en) * 2017-12-29 2018-06-22 北京百分点信息科技有限公司 A kind of multilingual analysis method and device based on natural language processing
CN109271513A (en) * 2018-09-07 2019-01-25 华南师范大学 A kind of file classification method, computer-readable storage media and system
CN109271513B (en) * 2018-09-07 2021-10-22 华南师范大学 Text classification method, computer readable storage medium and system
CN110348497B (en) * 2019-06-28 2021-09-10 西安理工大学 Text representation method constructed based on WT-GloVe word vector
CN110348497A (en) * 2019-06-28 2019-10-18 西安理工大学 A kind of document representation method based on the building of WT-GloVe term vector
CN112632968A (en) * 2020-12-18 2021-04-09 万兴科技(湖南)有限公司 PDF directory identification method, electronic device and computer readable storage medium
CN112632968B (en) * 2020-12-18 2024-02-13 万兴科技(湖南)有限公司 PDF catalog identification method, electronic equipment and computer readable storage medium

Also Published As

Publication number Publication date
CN101944099B (en) 2012-05-30

Similar Documents

Publication Publication Date Title
CN101944099B (en) Method for automatically classifying text documents by utilizing body
Chaovalit et al. Movie review mining: A comparison between supervised and unsupervised classification approaches
CN104573046B (en) A kind of comment and analysis method and system based on term vector
EP2041669B1 (en) Text categorization using external knowledge
CN102799647B (en) Method and device for webpage reduplication deletion
CN106599054B (en) Method and system for classifying and pushing questions
CN101493819B (en) Method for optimizing detection of search engine cheat
CN105488024A (en) Webpage topic sentence extraction method and apparatus
CN103577462B (en) A kind of Document Classification Method and device
CN105653706A (en) Multilayer quotation recommendation method based on literature content mapping knowledge domain
CN103744956B (en) A kind of diversified expanding method of key word
CN103838833A (en) Full-text retrieval system based on semantic analysis of relevant words
CN102411563A (en) Method, device and system for identifying target words
CN103235812B (en) Method and system for identifying multiple query intents
CN104484380A (en) Personalized search method and personalized search device
CN105183784A (en) Content based junk webpage detecting method and detecting apparatus thereof
CN104778276A (en) Multi-index combining and sequencing algorithm based on improved TF-IDF (term frequency-inverse document frequency)
Chifu et al. Word sense discrimination in information retrieval: A spectral clustering-based approach
CN105512333A (en) Product comment theme searching method based on emotional tendency
CN102789452A (en) Similar content extraction method
CN100458797C (en) Process for ordering network advertisement
Geng et al. Evaluating web content quality via multi-scale features
CN105956010A (en) Distributed information retrieval set selection method based on distributed representation and local ordering
Jedrzejewski et al. Opinion mining and social networks: A promising match
CN104572915A (en) User event relevance calculation method based on content environment enhancement

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C53 Correction of patent of invention or patent application
CB03 Change of inventor or designer information

Inventor after: Fang Jun

Inventor after: Guo Lei

Inventor after: Yang Ning

Inventor before: Guo Lei

Inventor before: Fang Jun

COR Change of bibliographic data

Free format text: CORRECT: INVENTOR; FROM: GUO LEI FANG JUN TO: FANG JUN GUO LEI YANG NING

ASS Succession or assignment of patent right

Owner name: NORTHWESTERN POLYTECHNICAL UNIVERSITY

Effective date: 20140814

Owner name: JIANGSU T.Y. ENVIRONMENTAL ENERGY CO., LTD.

Free format text: FORMER OWNER: NORTHWESTERN POLYTECHNICAL UNIVERSITY

Effective date: 20140814

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 710072 XI'AN, SHAANXI PROVINCE TO: 226600 NANTONG, JIANGSU PROVINCE

TR01 Transfer of patent right

Effective date of registration: 20140814

Address after: 226600 the Yellow Sea Avenue, Haian, Jiangsu province (West), No. 268, No.

Patentee after: JIANGSU TIANYING ENVIRONMENTAL PROTECTION ENERGY Co.,Ltd.

Patentee after: Northwestern Polytechnical University

Address before: 710072 Xi'an friendship West Road, Shaanxi, No. 127

Patentee before: Northwestern Polytechnical University

CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120530

CF01 Termination of patent right due to non-payment of annual fee