CN101944099A

CN101944099A - Method for automatically classifying text documents by utilizing body

Info

Publication number: CN101944099A
Application number: CN 201010210107
Authority: CN
Inventors: 郭雷; 方俊
Original assignee: Northwestern Polytechnical University
Current assignee: Jiangsu Tianying Environmental Protection Energy Co ltd; Northwestern Polytechnical University
Priority date: 2010-06-24
Filing date: 2010-06-24
Publication date: 2011-01-12
Anticipated expiration: 2030-06-24
Also published as: CN101944099B

Abstract

The invention relates to a method for automatically classifying text documents by utilizing a body, comprising the following steps: firstly expressing the characteristic information of a text document by utilizing a weighted key word set; and then expressing the characteristic information of a classifying catalogue by a body which is subject to body disambiguation and body expansion; transforming the body into a weighted word meaning set through analyzing the body structural characteristic; finally calculating the semantic similar value between the key word set of the text document and the body weighted word meaning set by utilizing a Earth Mover's Distance method; further calculating the similar value between the text document and the classifying catalogue; and classifying and sequencing the text document according to the similar value between the text document and the classifying catalogue. By utilizing the method of the invention, the text document can be automatically classified, and the accuracy of the text document classification can be improved.

Description

A kind of body that uses carries out the text document method of classification automatically

Technical field

The present invention relates to a kind of method of using body that text document is classified automatically, belong to fields such as computer information processing, information retrieval.Be applicable to the network text document of magnanimity is classified fast and accurately automatically.

Background technology

In order to improve the efficient of text document tissue, better support the user to browse and search information, the text document classification is the emphasis problem that people paid close attention to all the time.Begin most, text document classification is that the people manually finishes, but more and more along with the text document resource, manual classification become can not, so autotext document classification technology becomes the emphasis of research.

The text document classification generally is divided into three phases: at first, the characteristic information of text document and split catalog is extracted out; Then, classifier calculated goes out the similar value of text document and split catalog; At last, text document belongs to different catalogues according to similar value.

Traditional machine learning method has been applied to text document classifies automatically, comprises neural network, Bayes, support vector machine and k neighbours' method.Some classified text documents of collection that these methods are at first manual use these classified text document collection to come training classifier then, use the sorter that trains that text document is divided in the split catalog at last.The sorting technique of these machine learning has following shortcoming:

1) traditional machine learning method training classifier needs manual a large amount of document sets of classifying text of collecting, and this process is very loaded down with trivial details, and at different split catalogs, needs the manual different text document collection of collection to come training classifier;

2) method of traditional machine learning is not considered the semantic relation between the speech, so be difficult to improve the accuracy rate of classification.

In order to solve the shortcoming of machine learning method, the present invention proposes a kind of body that uses and come the method that text document is classified automatically.

Summary of the invention

The technical matters that solves

In order to solve at present the shortcoming based on the method for machine learning, the present invention proposes to use body that text document is classified automatically, can fast and accurately text document be classified automatically and sort.

Technical scheme

Thought of the present invention is: use body to come the characteristic information of presentation class catalogue, utilize the semantic similar value between text document and the body to carry out real-time classification, saved the process of training study like this, and, will constantly be improved based on the accuracy rate and the recall rate of the sorting technique of body along with body constantly upgrades and evolves; On the other hand, when the similar value of calculating between text document and the body,, thereby improve the accuracy rate of classification based on the semantic relation between the method consideration speech of body.

The invention is characterized in: the characteristic information that proposes the effective presentation class catalogue of this physical efficiency, and treat that by using the body after disambiguation and the extension process to come the characteristic information of presentation class catalogue, utilizing the semantic similar value between classifying text document and the body classifies.

Basic process of the present invention is: at first, use the heavy keyword set of cum rights to represent the characteristic information of text document; Then, use through the body after disambiguation and the extension process and come the characteristic information of presentation class catalogue, and, body is converted into the heavy meaning of a word set of cum rights by analyzing body structural feature; At last, use Earth Mover ' s Distance method to calculate the keyword set of text document and the semantic similar value between the set of the body weight meaning of a word, wherein, similar value between the single meaning of a word and the speech adopts and measures based on the method for WordNet lexical or textual analysis, and utilize this semanteme similar value to calculate similar value between text document and the split catalog, carry out the classification and the ordering of text document according to the similar value between text document and the split catalog.

A kind of body that uses carries out the text document method of classification automatically, it is characterized in that step is as follows:

(1) extracts the keyword set for the treatment of every piece of text document in the classifying text collection of document with the KEA algorithm, obtain the heavy keyword set of cum rights of text document; In Swoogle body search engine, retrieve with each split catalog term by name in the given catalogue set, the body of ordering first is as the body of this split catalog of expression in the result for retrieval that obtains, the body of representing each split catalog is carried out body qi and the body expansion that disappears, obtain representing the new body of this split catalog;

The described body qi process that disappears is:

At first, select the context of the interior speech of each notional word L scope of body middle distance as this notional word; The span of described L is [3,5];

Then, by semantic relatedness computation formula

I that calculates each notional word may meaning of a word s _iJ context con with this notional word _jSemantic relevancy relateness (s _i, con _j), and press

I that calculates each notional word may meaning of a word s _iAverage semantic relevancy Rel (s _i);

Wherein, i=1,2 ..., I, I represent the number of the possible meaning of a word of notional word, j=1, and 2 ..., J, J represent the contextual number of notional word; WordNumInGlossOfs _iExpression s _iThe word number that comprises of WordNet lexical or textual analysis, wordNumInGlossOfcon _jExpression con _jThe word number that comprises of WordNet lexical or textual analysis, NumOfOverlaps_s _iCon _jExpression s _iWordNet lexical or textual analysis and con _jThe word that comprised of WordNet lexical or textual analysis in the number of same word; Described may the meaning of a word be the meaning of a word that is defined among the morphology database WordNet;

At last, select to have the notion meaning of a word of the possible meaning of a word of maximum average semantic relevancy Rel value as notional word;

Described body expansion process is:

Utilize the semantic relevancy computing formula

Calculate through body each notion meaning of a word of the body of qi after handling set of the superordination meaning of a word and the next each meaning of a word in the meaning of a word set and semantic relevancy between this notion meaning of a word of concerning in WordNet that disappear, and judge: for each meaning of a word in the set of the superordination meaning of a word, if the semantic relevancy between it and this notion meaning of a word, then joins this meaning of a word the parent set of this notion meaning of a word greater than given threshold value one; For the next each meaning of a word that concerns in the meaning of a word set, if it with this notion meaning of a word between semantic relevancy greater than given threshold value two, the subclass that then this meaning of a word is joined this notion meaning of a word is gathered; All meaning of a word in the synonymy meaning of a word set of each notion meaning of a word in WordNet are all joined the similar set of this notion meaning of a word;

Wherein,

Expression is through disappear p the notion meaning of a word of the body of qi after handling of body, p=1, and 2 ..., P, P represent through the disappear number of the notion meaning of a word of the body of qi after handling of body; S ' _PqExpression

Superordination meaning of a word set/the next q meaning of a word that concerns in the meaning of a word set, q=1,2 ..., Q, Q represent superordination meaning of a word set/the next number that concerns the meaning of a word in the meaning of a word set;

Expression

The word number that comprises of WordNet lexical or textual analysis, wordNumInGlossOfs ' _PqExpression s ' _PqThe word number that comprises of WordNet lexical or textual analysis,

Expression

WordNet lexical or textual analysis and s ' _PqThe word that comprised of WordNet lexical or textual analysis in the number of same word;

The described given threshold value one and the span of threshold value two are [0.6,1];

(2) the weight meaning of a word that calculates the new body of each split catalog of expression is gathered, and is specially:

At first, body is changed into the digraph of being made up of vertex set and directed edge set: each summit of digraph is a notion meaning of a word in the body, each bar directed edge of digraph is two relation of inclusion between the notion meaning of a word, and the direction of directed edge is pointed to father's notion meaning of a word by the sub-notion meaning of a word;

Then, press

Calculate the weight of each notion meaning of a word;

Wherein, weight represents the weight of the notion meaning of a word, and layer represents the number of plies on the summit of this notion meaning of a word correspondence;

The number of plies on described summit is the shortest path distance of the notion meaning of a word of summit correspondence apart from the body root;

(3) press Sim (d, o)=1-EMD (d, o) the similar value Sim (d between calculating text document and the split catalog, o), if the similar value Sim (d between text document and split catalog, o), then text document is categorized into this split catalog, otherwise text document is not categorized into this split catalog greater than given threshold value δ;

Wherein, d is the heavy keyword set of cum rights of text document, and o is the weight meaning of a word set of body; (d is o) for utilizing text document that Earth Mover ' s Distance method calculates and the semantic similar value between the body for EMD; The span of described given threshold value δ is [0.5,0.6];

(4) to all text documents under the sorted split catalog according to similar value Sim (d, o) the descending ordering.

Beneficial effect

This method of the present invention uses body to represent the characteristic information of catalogue, carries out real-time classification by the semantic similar value of calculating between text document and the body, has saved the process of training study, and has improved the accuracy rate of classification.In addition, the present invention uses the disambiguation technology will represent that the speech in the body becomes the meaning of a word, has solved the inaccurate problem of result of calculation of the similar value that the polysemy of speech causes, improves the precision that semantic similar value is calculated, and has further improved the precision of classification; On the basis of body disambiguation, the present invention comes body is automatically expanded by using WordNet, has enriched the notion content of body, thereby has improved the accuracy rate that follow-up similar value is calculated, and solves the bothersome problem of manual creation body.

Description of drawings

Fig. 1: the basic flow sheet of the inventive method

Embodiment

Now in conjunction with the accompanying drawings the present invention is further described:

The use body that proposes according to the present invention carries out the method for text document classification, and we use Java and Perl language to realize that concrete implementation procedure is as follows:

Use body to carry out the text document sorting technique and be divided into following four steps:

Step 1: the structure of text document keyword set.Here, adopt the extraction of KEA algorithm to treat the heavy keyword set of cum rights of each piece text document in the classifying text collection of document, be specially: for treating classified text collection of document D={d ₁, d ₂..., d _{| D|}Each piece text document d in (| D| represents the text document record among the text document set D) _i, at first, adopt naive Bayesian to estimate, by consider three characteristic attributes of number Length of letter in frequency tf * idf that speech (existing word) occurs, mean place Occurrence that speech occurs and the speech in text document in text document, to d _iIn each speech, adopt following formula to calculate the probability P r of its speech that is the theme:

Pr＝Pr[T|yes]×Pr[O|yes]×Pr[L|yes]×Pr[yes] (1)

Wherein, Pr[T|yes], Pr[O|yes] and Pr[L|yes] be illustrated respectively in the be the theme probability of speech of this speech under the condition that three characteristic attribute tf * idf, Occurrence and Length get currency; Pr[yes] comprise the ratio of number and the number of the text document that does not comprise descriptor of the text document of descriptor in the expression text document set.

Then, select to have preceding n the speech (n gets 4～6 usually) of maximum Pr value as text document d _iKeyword, obtain text document d _iThe heavy keyword set of cum rights, and with text document d _iRepresent with the keyword set that this cum rights is heavy, i.e. d _i={ URL _i, (t ₁, tw ₁) ..., (t _Ij, tw _Ij) ..., wherein, t _IjFor extracting the keyword that obtains, tw as stated above _IjBe keyword t _IjWeight, be its Pr value that calculates by formula (1).

Step 2: body pre-service.At first, retrieve in Swoogle body search engine with each split catalog term by name in the given catalogue set, and represent this split catalog with the body of ordering first in the result for retrieval that obtains, like this, catalogue set CA={ca ₁, ca ₂..., ca _{| CA|}Just use body to gather O={o ₁, o ₂..., o _{| O|}Represent, wherein, | O| represents the body number among the body set O, | CA| represents the split catalog number among the catalogue set CA, satisfies | O|=|CA|.Wherein, split catalog corresponding a body, i.e. a body o _mRepresent a split catalog ca _mCharacteristic information, i.e. ca _m:=o _m

Next, to each body o _mCarry out the body disambiguation of step 2.1 and the body extension process of step 2.2.Wherein, the present invention adopts the meaning of a word that is defined among the morphology database WordNet to represent as the morphology of body, and the path distance of setting between interior any two notional words of same knowledge is 1.

Step 2.1: body disambiguation.Because the corresponding a plurality of meaning of a word of speech possibility, this phenomenon can reduce the precision that semantic similar value is calculated.In order to eliminate the ambiguousness that vocabulary shows in the body, body is carried out disambiguation handle, promptly utilize the context of speech in the body, determine the meaning of a word that it is correct.Be specially:

At first, the speech in the L distance range of the notional word s in the body is chosen as the context of notional word s, obtains the set of context Con={con of notional word s ₁..., con _j..., wherein, con _jJ the context of expression notional word s; The span of L is [3,5];

Then, use formula (2) to calculate notional word s each meaning of a word s in WordNet _i(i=1 ..., N _i, N _iBe the meaning of a word number of notional word s in WordNet) and its set of context Con in average semantic relevancy Rel (s between all contexts _i):

Rel (s_{i}) = \frac{Σ_{j = 1}^{| Con |} relateness (s_{i}, {con}_{j})}{| Con |} - - - (2)

Wherein, | Con| is the context number of notional word s, i.e. the number of speech among the set of context Con; Relateness (s _i, con _j) be i meaning of a word s _iWith its j context con _jSemantic relevancy, its computing formula is as follows:

relateness (s_{i}, {con}_{j}) = \frac{{NumOfOverlaps_s}_{i} {con}_{j}}{({wordNumInGlossOfs}_{i} + {wordNumInGlossOfcon}_{j}) / 2} - - - (2)

Wherein, wordNumInGlossOfs _iBe s _iThe word number that comprises of WordNet lexical or textual analysis, wordNumInGlossOfcon _jBe con _jThe word number that comprises of WordNet lexical or textual analysis, NumOfOverlaps_s _iCon _jBe s _iWordNet lexical or textual analysis and con _jThe word that comprised of WordNet lexical or textual analysis in the number of same word;

At last, select the correct meaning of a word of the corresponding meaning of a word of maximum average semantic relevancy Rel value, i.e. the notion meaning of a word of notional word s as notional word s.Because the notional word that occurs together in same body has certain semantic relation, so the correct meaning of a word is the meaning of a word with the semantic relevancy maximum of neighbours' notional word.

Each notional word in the body is all handled by the said process qi that disappears.

Step 2.2: body expansion.Disappear after qi handles through the body of step 2.1, body is represented that by the notion meaning of a word notion meaning of a word in the body after the qi that disappears to body is handled in the body expansion adds the meaning of a word that is associated, thereby enriches body.

At first, obtain body each the notion meaning of a word cs in the body of qi after handling that disappears _k(k=1 ..., N _k, N _kNumber for the notion meaning of a word in the body) set of the superordination meaning of a word in WordNet (hypernymy), the next meaning of a word set (hyponymy) and the synonymy meaning of a word of concerning are gathered (synonym), use hypernym (cs respectively _k)={ a _K1, a _K2..., hyponym (cs _k)={ b _K1, b _K2... and synonym (cs _k)={ c _K1, c _K2... expression notion meaning of a word cs _kThese three kinds concern meaning of a word set, and set hypernym_value and hyponym_value is two threshold values, the span of hypernym_value and hyponym_value is [0.6,1].

Then, calculate notion meaning of a word cs by formula (4) _kWith superordination meaning of a word set hypernym (cs _k) in each meaning of a word a _Kp(p=1,2 ..., P, P are the meaning of a word number in the set of the superordination meaning of a word) semantic relevancy, if cs _kAnd a _KpSemantic relevancy greater than given hypernym_value threshold value, then with a _KpAdd cs _kParent set in; Calculate notion meaning of a word cs by formula (5) _kWith the next meaning of a word set hyponym (cs that concerns _k) in each meaning of a word b _Kq(q=1,2 ..., Q, Q are the next meaning of a word number in the meaning of a word set that concerns) semantic relevancy, if cs _kAnd b _KqSemantic relevancy greater than given hyponym_value threshold value, then with b _KqAdd cs _kSubclass set in; With synonymy meaning of a word set synonym (cs _k) in all meaning of a word c _Kl(l=1,2 ..., L, L are the next meaning of a word number that concerns in the meaning of a word set) all join cs _kSimilar set in.

relateness ({cs}_{k}, a_{kp}) = \frac{{NumOfOverlaps_cs}_{k} a_{kp}}{({wordNumInGlossOfcs}_{k} + {wordNumInGlossOfa}_{kp}) / 2} - - - (4)

relateness ({cs}_{k}, b_{kq}) = \frac{{NumOfOverlaps_cs}_{k} b_{kq}}{({wordNumInGlossOfcs}_{k} + {wordNumInGlossOfb}_{kq}) / 2} - - - (5)

Wherein, wordNumInGlossOfcs _k, wordNumInGlossOfa _Kp, wordNumInGlossOfb _KqBe respectively cs _k, a _Kp, b _KqThe word number that comprises of WordNet lexical or textual analysis; NumOfOverlaps_cs _ka _KpBe cs _kWordNet lexical or textual analysis and a _KpThe word that comprised of WordNet lexical or textual analysis in the number of same word; NumOfOverlaps_cs _kb _KqBe cs _kWordNet lexical or textual analysis and b _KqThe word that comprised of WordNet lexical or textual analysis in the number of same word.

Each notion meaning of a word in the body is all carried out extension process as stated above.

Through after the step 2, obtain the new body set of presentation class directory feature information

Wherein,

(m=1,2 ..., | o|) be disappear body after qi and the body extension process of body.

In the present embodiment, use the Jena routine package that body is operated, use the operation of JAWS routine package realization Wordnet.

Step 3: the structure of body meaning of a word set.For each body after disappear through body qi and the body extension process

(m=1,2 ..., | o|):

At first, with body Change into digraph G, promptly

Wherein, V is a vertex set, V={v ₁, v ₂. ..., v _{| V|}(| V| represents the number on summit among the vertex set V), E is the directed edge set, E={e ₁, e ₂. ..., e _{| E|}(| E| represents the number of directed edge among the directed edge set E), each summit of digraph is a notion meaning of a word in the body, each bar directed edge of digraph is two relation of inclusion between the notion meaning of a word, and the direction of directed edge is pointed to father's notion meaning of a word by the sub-notion meaning of a word.In digraph G, the number of plies on summit is the shortest path distance of its corresponding notion meaning of a word apart from the body root.According to the big more principle of the high more contribution of notion meaning of a word place level, the present invention uses formula (6) to calculate notion meaning of a word cs _kWeight sw _k:

{sw}_{k} = \frac{1}{{(layer (v_{k}))}^{1 / 4}} - - - (6)

Wherein, layer (v _k) be and notion meaning of a word cs _kCorresponding vertex v _kThe number of plies.

Through after such processing, body

Can be expressed as the heavy body meaning of a word set of cum rights

Wherein,

Be body

In the notion meaning of a word, sw _kBe its pairing weight, The expression body

The number of the middle notion meaning of a word.

Body is gathered

In each body all handle as stated above.

Step 4: classification and ordering.Through after three top step process, text document is represented that by the heavy keyword set of cum rights split catalog is represented with the heavy body meaning of a word set of cum rights.Below, to determine whether according to the similar value between text document and the split catalog text document is referred in a certain split catalog, similar value is big more, and the relation between text document and the split catalog is just tight more, and text document might belong to this split catalog more.

At first, use the measure calculating text document of Earth Mover ' s Distance and the semantic similar value between the body.Be specially:

For text document d={ (t ₁, tw ₁), (t ₂, tw ₂) ..., (t _{| d|}, tw _{| d|}) (t represents the keyword of text document, and tw represents the weight of keyword) and body o={ (cs ₁, sw ₁), (cs ₂, sw ₂) ..., (cs _{| o|}, sw _{| o|}) (cs represents the notion meaning of a word in the body, and sw represents the weight of the notion meaning of a word), can obtain weight graph by them Wherein, W is a distance matrix, its element w _IjKeyword t for text document _i(i=1,2 ..., | d|, | d| is the number of the keyword of text document d) and the notion meaning of a word cs of body _j(j=1,2 ..., | o|, | o| is the number of the notion meaning of a word of body o) between semantic similar value.

Weight graph is arranged

Vertex set be:

W is

Limit set.For weight graph is arranged

, the target that semantic similar value is calculated is to find a paths F={f _Ij, i=1 ..., p, j=1 ..., q} (f _IjBe t _iAnd cs _jBetween the limit), make following formula EMD (d, o) value is minimum:

EMD (d, o) = \frac{Σ_{i = 1}^{p} Σ_{j = 1}^{q} f_{ij} w_{ij}}{Σ_{i = 1}^{p} Σ_{j = 1}^{q} f_{ij}} - - - (7)

(d o) is semantic similar value between text document and the body to EMD.

Then the similar value between text document and the split catalog is:

Sim(d，o)＝1-EMD(d，o) (8)

After obtaining the similar value Sim between text document and the split catalog, set a threshold values δ and determine whether text document is categorized into this split catalog.If the similar value Sim between text document and this split catalog then is classified into this split catalog greater than threshold value δ, otherwise will not be classified into this split catalog.Similar value Sim between text document and the split catalog has represented the closeness relation between text document and the split catalog, so utilize the similar value Sim between text document and the split catalog that the text document under each split catalog is sorted again, similar value Sim is big more, and then the sorting position of text document is forward more.The span of threshold value δ is [0.5,0.6].

Example experiment: carried out one group and test and assess the present invention on the basis that program realizes, in the experiment, threshold value δ is 0.5, and selecting contextual distance range L is 3.Choosing of text document: from the website (http://dmoz.org) of Open Directory Project project (the maximum directory items of artificially webpage being classified), chosen 26 web page text document.In these webpages, there are 11 webpages to belong to the Arts catalogue, 8 webpages belong to the Sports catalogue, and 7 webpages belong to the Games catalogue.These 26 webpages and their url address are as shown in table 1.Use prefix sign " a " " s " and " g " to represent to belong to the text document of Arts, Sports and Games catalogue respectively.

The web page text document that table 1 is collected from http://dmoz.org/

Body is chosen: at first extract Arts from the RDF file (structure.rdf.u8.gz) of representing Open Directory Project bibliographic structure, Sports and Games body.The Arts body comprises 521 notions, and the Sports body comprises 602 notions, and the Games body comprises 558 notions.Then, these bodies are through body disambiguation and body extension process, and in the process of handling, hypernym_value and hyponym_value are set as 0.9.After treatment, the number of the notion that contains in Arts, Sports and the Games body is respectively 1557,1809 and 1719.

Adopt method of the present invention to the processing of classifying of these 26 web page text document, the result is as shown in the table:

The result of table 2 classification and ordering

Table 2 has provided text document that directory A rts, Sports and Games comprised and similar value, and the text document in each catalogue sorts according to similar value, calculate the accuracy rate and the recall rate of sorting technique integral body from these results, the result is as shown in table 3:

Table 3 sorting algorithm performance

Recall rate	Accuracy rate
		96.2％	83.9％

In order to assess the performance of sort method, sorted lists and the manual sorted lists that generates that table 2 is produced compare, and manually the sorted lists that generates is as shown in the table:

The sorted lists that table 4 manually produces

Suppose τ _iBe to use sort algorithm to catalogue c _iIn the tabulation of classified text document after sorting, It is manual standard sorted tabulation.The phase recency of two tabulations calculates with following formula so:

S = \frac{Σ_{i = 1}^{| C |} S^{'} (τ_{i}, τ_{i}^{*})}{| C |} - - - (9)

Be tabulation τ _iAnd tabulation

Identical element number on same sequence position, | C| is tabulation τ _iOr tabulation

The total number of element.The phase recency S that formula above using calculates each directory listing averages, and ranking results of the present invention is 79.1% with the average recency mutually of the sorted lists of table 4 standard.

Table 5 sort method Performance Evaluation

?S _Arts(％)	?S _Sports(％)	S _Games(％)	Average phase recency (%)
				?78.5	?75.0	83.7	79.1

From preliminary assessment experiment as can be seen, classification of the present invention and ordering have reached preferable performance, and accuracy rate and recall rate are all than higher, and ordering also has good effect.This because the body that body structure algorithm is created is fairly perfect, is because when calculating the similar value of text document and body, considered semantic information on the one hand on the other hand.In addition, because the present invention's trouble of having removed the manual collection training set from, and its classification performance can also become along with the evolution of body better, so the use body is classified to text document and the method that sorts has good prospect.

Claims

1. one kind is used body to carry out the text document method of classification automatically, it is characterized in that step is as follows:

The described body qi process that disappears is:

At first, select the context of the interior speech of each notional word L scope of body middle distance as this notional word;

The span of described L is [3,5];

Then, by semantic relatedness computation formula

Described body expansion process is:

Utilize the semantic relevancy computing formula

Wherein, Expression is through disappear p the notion meaning of a word of the body of qi after handling of body, p=1, and 2 ..., P, P represent through the disappear number of the notion meaning of a word of the body of qi after handling of body; S ' _PqExpression

Expression

The word number that comprises of WordNet lexical or textual analysis,

Expression s ' _PqThe word number that comprises of WordNet lexical or textual analysis,

Expression

Then, press

Calculate the weight of each notion meaning of a word;

Wherein, d is the heavy keyword set of cum rights of text document, and o is the weight meaning of a word set of body;

(d is o) for utilizing text document that Earth Mover ' s Distance method calculates and the semantic similar value between the body for EMD; The span of described given threshold value δ is [0.5,0.6];