CN102750315A - Rapid discovering method of conceptual relations based on sovereignty iterative search - Google Patents

Rapid discovering method of conceptual relations based on sovereignty iterative search Download PDF

Info

Publication number
CN102750315A
CN102750315A CN201210125040XA CN201210125040A CN102750315A CN 102750315 A CN102750315 A CN 102750315A CN 201210125040X A CN201210125040X A CN 201210125040XA CN 201210125040 A CN201210125040 A CN 201210125040A CN 102750315 A CN102750315 A CN 102750315A
Authority
CN
China
Prior art keywords
notion
candidate
search
vector
conceptual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201210125040XA
Other languages
Chinese (zh)
Other versions
CN102750315B (en
Inventor
张辉
陈勇
胡红萍
马永星
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN201210125040.XA priority Critical patent/CN102750315B/en
Publication of CN102750315A publication Critical patent/CN102750315A/en
Application granted granted Critical
Publication of CN102750315B publication Critical patent/CN102750315B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a rapid discovering method of conceptual relations based on sovereignty iterative search. The method includes eliminating conceptual pairs which do not contain identical nonzero elements or only contain few nonzero elements with Boolean search and narrowing candidate conceptual sets in magnitude; calculating a conceptual relevancy in vector space with an enumeration method; and obtaining the most relevant concepts through sequencing. According to the rapid discovering method, advantages of the discovering method of conceptual relations of the Boolean model and the discovering method of enumeration relations based on vector space models are combined, advantages are absorbed and shortcomings are avoided, time efficiency is close to the former, and precision and recall are close to the latter.

Description

Based on the quick discover method of the conceptual relation of iterative search with sovereign right
Technical field
The present invention relates to the quick discover method of a kind of conceptual relation, relate in particular to the quick discover method of a kind of conceptual relation, belong to the semantic network technical field based on iterative search with sovereign right.
Background technology
In the natural language world, notion is the abstractdesription to objective entity, is the set of objective entity attribute characteristic.Since the interaction of objective entity, the association that also produces countless ties between the notion, and we are referred to as conceptual relation.Notion and conceptual relation have constituted the basis in the natural language world jointly, if the natural language world is a semantic network, notion is exactly semantic carrier so, and conceptual relation is exactly the tie between semantic carrier.Can reflect content and the character that draws entity associated in the objective world through the research conceptual relation, and then be human work and service for life.
In current information society, the internet is undoubtedly the biggest carrier of data, and is growing with the hypertext information that hyperlink is related, constituted the information network world, thoroughly changed the mode of modern humans's work and life.Yet, also be explosive growth because of information, management and use information become increasingly complex more and more difficult.Be the demand of adaptation semantic reasoning and intelligent service, Semantic Web is that the information interconnected network network of future generation of representative is attempted between any small data, to make up connection, and conceptual relation makes up the basis of semantic network just.Therefore, the conceptual relation extraction technique is the basis that human information is changed for the second time.
Search engine and text mining are the core technologies of platform gate system, and " the semantic relevancy calculating " of notion and text is the key foundation of search engine and text mining.Under pure statistical model, owing to lack the support of knowledge intelligent, can only be with " similarity " replacement " semantic relevancy ", as the basis that solves a series of complicated technical problems in search engine and the text mining.
But; The socialization trend of and internet complicated day by day along with quantity of information explosive increase, message structure; People have proposed the more strong search intelligence and the demand of personalized service to search engine and text mining; At this moment; As if before the problem that semantic relevancy calculates placed search engine and text mining technological side once more, " similarity " calculated all can't satisfy this demand on accuracy still is physical significance, and " semantic relevancy " calculates has become the underlying issue that intelligent search epoch must solve.
At present, boolean's model and vector space model all obtain in text classification extensively and effective application.Boolean search can be found conceptual relation fast; But accuracy and recall rate all can't be guaranteed, and the characteristic element number that is used to construct boolean queries also is difficult to confirm that query characteristics is crossed and can be caused search efficiency to reduce at least; Number of results is too much, and query characteristics is crossed recall rate reduction at most.The boolean search of expansion can be given weight to the characteristic element of match query, can improve recall rate to a certain extent, but can not solve the construction problem of query logic.Also there are some shortcomings in vector space model, mainly shows as: dimension of a vector space is often very high, causes calculated amount big, influences system speed.In addition, the definite of characteristic weights also is difficult part of considering in the vector.
Summary of the invention
Not enough to existing in prior technology, technical matters to be solved by this invention is to provide a kind of conceptual relation quick discover method.This method both can guarantee search efficiency, also can guarantee accuracy and recall rate
For realizing above-mentioned goal of the invention, the present invention adopts following technical scheme:
The quick discover method of a kind of conceptual relation based on iterative search with sovereign right comprises:
Use boolean search to eliminate not comprise identical nonzero element or only to comprise the notion of few nonzero element right;
Use the conceptual dependency degree under the enumerative technique compute vector space;
Related notion is tried to achieve in ordering.
Further, said boolean search comprises:
The semantic feature vector of notion is converted into Boolean expression, and construction feature vector forward index and characteristic inverted index, use the semantic feature constitutive logic inquiry of target concept, search obtains the related notion collection of target concept in the logical expression set.
Conceptual dependency degree step under the further said use enumerative technique compute vector space comprises:
A. searcher is according to proper vector forward indexed search proper vector;
B. obtain characteristic with sovereign right according to the proper vector that searches;
C. searcher is selected candidate's related notion to carry out association search, and puts into candidate's related notion collection;
D. candidate's notion is chosen strategy and is obtained candidate's notion;
E. the saturated control strategy of candidate's concept set searches after linked character arranges slide fastener, chooses candidate's notion;
F. judge whether candidate's concept set reaches capacity,, repeat b~f, reach capacity up to candidate's concept set if candidate's concept set does not reach capacity and then returns b;
F. iterative search stops if candidate's concept set has reached capacity, and the accurate Calculation target concept is with the degree of correlation of candidate's related notion.
Further, said candidate's notion is chosen strategy and is comprised:
A. searcher uses linked character to be query key, in the characteristic inverted index, carries out inquiry, obtains candidate's notion slide fastener;
B. be that the moving window of M places the slide fastener stem with window, if the no more than M of element then all elects candidate's notion as;
If c. element is more than M, monitor monitoring surveillance requirements is about to head and the tail weight amount of decrease and amplitude threshold and compares;
D. if head and the tail weight amount of decrease is less than or equal to amplitude threshold, moving window moves one and return c, repeats c~d, up to head and the tail weight amount of decrease greater than amplitude threshold;
E. if head and the tail weight amount of decrease greater than amplitude threshold, is then cut the element behind the window, residue is promptly elected candidate's notion as.
Further, the saturated control strategy of said candidate's concept set comprises:
A. the search for candidate notion is concentrated if candidate's notion has been present in candidate's related notion, to linked character incremental computations " spurious correlation degree ", calculates " spurious correlation degree " if candidate's notion is used linked character for newly-increased notion;
B. readjust candidate's notion slide fastener order;
C. moving window places the slide fastener stem, judges the size of number of elements and window, if number of elements is not more than window size; Again carry out candidate's notion and choose strategy; If number of elements is not less than window size, then inspection head and the tail candidate's notion " spurious correlation degree " amount of decrease compares with amplitude threshold;
D. if " spurious correlation degree " amount of decrease less than amplitude threshold, gets into step to slide fastener afterbody moving window,, carry out candidate's notion again and choose strategy, otherwise return c, repeat c~d, be not less than amplitude threshold up to " spurious correlation degree " amount of decrease if arrive the slide fastener afterbody;
E. if " spurious correlation degree " amount of decrease is not less than amplitude threshold, candidate's related notion collection reaches capacity, and stops iterative search.
Further, said construction feature vector forward index comprises:
All notions are numbered, the semantic feature of all notions according to the weight descending, and is normalized to vector of unit length with proper vector;
Key assignments with " notion numbering-proper vector " concerns structural attitude vector forward index.
Further, said construction feature inverted index comprises:
Regard each unit character vector as a document, be characterized as with vector and arrange entry, the notion numbering in the index slide fastener is according to the descending sort of feature weight size, structural attitude inverted index.
The present invention is based on the conceptual relation discover method of boolean's model and concern discover method based on enumerating of vector space model; Both advantages are merged; Maximize favourable factors and minimize unfavourable ones, provide a kind of time efficiency near the former, and accuracy rate and recall rate are near the latter's fast method.
Description of drawings
Below in conjunction with accompanying drawing and embodiment the present invention is done further detailed description.
Fig. 1 is normalization of proper vector unit and index structure synoptic diagram;
Fig. 2 is that characteristic falls to arrange structure and index structure synoptic diagram;
Fig. 3 is the operating process synoptic diagram of single iteration search;
Fig. 4 is based on the relation of iterative search with sovereign right and finds schematic flow sheet;
Fig. 5 is the execution schematic flow sheet that candidate's notion is chosen strategy;
Fig. 6 is the execution schematic flow sheet of the saturated control strategy of candidate's concept set;
Fig. 7 is contrast experiment's overall procedure synoptic diagram of recall rate and time efficiency;
Fig. 8 is that the single search concept is chosen several mean value synoptic diagram;
Fig. 9 is the average recall rate synoptic diagram under the saturated control strategy parameter regulation and control;
Figure 10 is the average recall rate synoptic diagram under the saturated control strategy parameter regulation and control;
Figure 11 is time efficiency and recall rate contrast synoptic diagram.
Embodiment
Conceptual relation finds to be meant and finds that the close conceptual relation of semantic relation is right that it is the committed step that conceptual relation extracts.Conceptual relation is found general only search and conceptual relation Top-M notion the most closely.Generally speaking, M is that natural number and value are little, in double-digit scope, and the excessive meaning that just loses related notion of M value.
For a target concept, from 3,000,000 candidate's notions, search the individual related notion of Top-M (M<100), efficiently be merely 1/10000.If use enumerative technique to search related notion, then 99.99% computational resource all will be used on the degree of correlation of calculating with irrelevant notion, and this is that the utmost point is not calculated.The average element number of notion semantic feature vector is merely about 300, and with respect to the notion semantic space of 3,000,000 dimensions, proper vector only has 1/10000 nonzero element, is extremely sparse.Therefore, most notions between do not comprise identical nonzero element, even comprise identical nonzero element, most of notion is to only comprising the identical nonzero element of only a few.
The present invention uses boolean search to eliminate not comprise identical nonzero element or only to comprise the notion of few nonzero element right; On magnitude, dwindle candidate's concept set; Then, in concept set, use enumerative technique compute vector space conceptual dependency degree down, then sort and try to achieve the related notion of Top-M.Then search efficiency can be guaranteed, also accuracy and recall rate can be guaranteed.
Boolean's model adopts Boolean expression that text is represented exactly.Boolean's model is widely used in traditional information retrieval, and it carries out logic through the retrieval type that provides with the user and relatively retrieves text, is a kind of coupling based on keyword.In boolean's model of standard, text is suc as formula shown in (1):
d i=(w I1, w I2... w Ik... w In), k=1 wherein, 2 ..., n (1)
Wherein n is the number of characteristic item, w IkBe 0 or 1, represent whether k characteristic item occurs in text i.Boolean's model is easy to realize, but in the text classification field, its accuracy rate and recall rate are relatively poor.
If it is the adequate condition that have tight semantic relation between notion that notion comprises certain or certain several characteristics jointly, so, conceptual relation is found to be converted into the search procedure under boolean's model.The semantic feature vector of notion is converted into Boolean expression, and makes up inverted index, use the semantic feature constitutive logic inquiry of target concept, search can obtain the related notion collection of target concept in the logical expression set.
Boolean's model disperses successive value and turns to 0 or 1, only between whether existing, does choice, though the way of imposing uniformity without examining individual cases is very not accurate, high efficient is arranged but.But in the semantic space of higher-dimension, the semantic nature of a notion is the common decision of semantic feature all in the proper vector, and their power to make decision is great mutually very unlike.Therefore, also not directly be used for the relation discovery of magnanimity notion based on the searching method of boolean's model.
Relative boolean search re-uses the degree of correlation under the enumerative technique compute vector space after dwindling candidate's concept set, can improve accuracy rate.But guarantee good recall rate, need as much as possible related notion to be included in the candidate's concept set after dwindling, yet for guaranteed efficiency, the set of candidate's notion again can not be too big.So, how to use Boolean logic to dwindle candidate's conceptual relation set as far as possible and lose the core key issue that related notion is this method as far as possible less.
Vector space model is proposed in 1969 by Gerard Salton and McGill.Because of its notion is simple, in information retrieval (Information Retrieval is called for short IR), be used widely.In text automatic classification, make mainly from IR, to introduce in this way and come.The basic thought of vector space model is: with the one dimension of each entry as the feature space coordinate system, regard text a vector of feature space as, weigh two similarities between the text with the angle between two vectors.In vector space model, each text all is represented as a point of the vector space of being opened by one group of standardization quadrature entry vector, and promptly form turns to the vector in the n-dimensional space, thus we can with text abstract be shown in the formula (2):
V(d i)=((t 1,w 1j),......(t i,w ij),......(t n,w nj)),i=1,2,......n (2)
Wherein, t iBe characteristic item i, w IjBe t iAt text d jIn weight.For a training text set, we just can obtain a vector space W as shown in table 1 below:
d 1 d 2 d m
t 1 w 11 w 12 w 1m
t 2 t 21 t 22 t 2m
t n t n1 t n2 t nm
Text representation under table 1 vector space
W is a sparse matrix normally.Training text with treat that classifying text all available identical method representation in vector space model comes out.The vector of treating classifying text explains that more near the training text vector it is big more with the similarity of training the text in the space, and more possible and training text belongs to same classification.
Vector space model is present widely used text representation model, has following advantage:
(1) content of text is provided through vector form by the point of formalization in the hyperspace, and text has been arrived in the real number field with the formal definition of amount, has improved the calculability and the operability of natural language text;
(2) for speech calculates weights, reflect speech and the degree of correlation that belongs to text through the size of regulating the corresponding weights of speech, overcome the defective of traditional boolean's model;
(3) through calculating the similarity between the text, the similar text of attribute is gathered together as far as possible, to improve matching efficiency.
Calculate related notion through the method for vectorial cosine angle under the vector space model, be different from boolean's model in itself.In vector space model, disperse between the dimension, but a dimension in office is inner, is again continuous.Vector space model has been realized the acting in conjunction between characteristic with this mode discrete and that combine continuously just, does not definitely rely on certain characteristic, but does not abandon any one characteristic again.
To each notion in the corpus, use the semantic structured approach of performance to make up the semantic feature vector, to compose power with the TF-ODF characteristic, and transfer power based on Chinese pragmatic, the interpretive semantic proper vector that can obtain notion after the characteristic dimensionality reduction is represented.
If vectorial average length is n, after being sorted, characteristic can in the time of O (n), calculate the cosine angle, and adopting the quicksort time complexity is O (nlogn), so final time complexity is O (nlogn).
Based on explain concern structuring concept semantic feature vector after, semantic feature composed weighs, sort after calculating the conceptual dependency degree in twos, just can obtain the Top-M related notion of any notion.Suppose that the notion in the open encyclopaedia corpus adds up to N, computing time, complexity was O (N in twos 2).Use raft search to preserve M the relevance degree and the corresponding notion of current maximum to arbitrary notion, then calculated the related notion of Top-M that can obtain arbitrary notion after the right degree of correlation of all notions, it is logM that raft is adjusted time loss at every turn.The average complexity of known concept relatedness computation is O (nlogn) again, then can know and use the time complexity of the method Top-M related notion discovery of calculating the conceptual dependency degree in twos to be O (N 2NlognlogM).Wherein, n is vectorial average characteristics number, and n is about 300 behind the dimensionality reduction, and N is to be about 3,000,000 by open encyclopaedia corpus notion sum.Rough calculation, arithmetic capability needed just can accomplish in about 3 years in the single computer of 1,000,000,000/s, and this obviously is unacceptable.
In vector space model, the pairing weights of each dimensional feature item are called feature weight, are called for short weight.Characteristic plays a leading role in the sovereignty characteristic finger speech justice proper vector.Statistics shows, in 80% proper vector, 80% of weight sum concentrates in preceding 20% the characteristic.Iterative search is in search procedure, constantly to adjust and safeguard candidate's concept set, up to the termination of iterations that meets some requirements, confirms the set of candidate's notion.
Combine boolean's model and vector space model based on the quick discover method of the conceptual relation of iterative search with sovereign right; Learn from other's strong points to offset one's weaknesses; Utilize the odds for effectiveness of boolean's model to carry out rough calculation; Candidate's related notion is limited to the decimal order magnitude range, then utilizes the accuracy advantage of vector space model to carry out careful calculating again, finally obtain the related notion of Top-M.
As shown in Figure 1, at first, we number all notions, replace the notion word to store to save the space, afterwards to the semantic feature of all notions according to the weight descending, and proper vector is normalized to vector of unit length.The mould of vector of unit length is 1, is convenient to the included angle cosine in compute vector, also is convenient to the weight of the same characteristic of lateral comparison in different vectors.In addition, with the key assignments pass series structure index of " notion numbering-proper vector ", conveniently obtain arbitrary notion numbering characteristic of correspondence vector at random fast.In order to distinguish mutually with the characteristic inverted index of subsequent builds, we claim that the index that makes up is a proper vector forward index here.
Then, regard each unit character vector as a document, be characterized as with vector and arrange entry, the structure inverted index, the notion numbering in the index slide fastener is according to the descending sort of feature weight size.As shown in Figure 2, the laterally related notion set of the arbitrary semantic feature of random access quickly and easily of this structure, notion is beneficial to by strong to weak sequential access associated concepts by the ordering of weight size.
The forward index of proper vector and characteristic inverted index are the architecture basics and the efficient guarantees of subsequent searches iteration, can carry out conceptual relation according to the search iteration method after the index structure finishes and find.For making things convenient for method to describe, we are called target concept with the notion of related notion to be calculated.
The search iteration method comprises " searcher " and " candidate's related notion collection " two parts.Candidate's related notion collection is made up of candidate's related notion, is initially empty set.Searcher is characterized as linked character with the high weight of target concept and carries out association search, selects candidate's related notion, and puts into candidate's related notion collection.Candidate's notion has identical linked character with target concept, uses the linked character calculated candidate notion of these dynamic changes and the degree of correlation of target concept, and we are called " spurious correlation degree ".Why being called " spurious correlation degree ", is because it only is that target concept and candidate's notion part matching characteristic calculate.But because linked character is the major weight characteristic of target concept; And candidate's notion is to obtain from the stem that linked character is tabulated; Therefore; " spurious correlation degree " is the major part of " true correlation degree ", and when linked character leveled off to whole characteristic of target concept, " spurious correlation degree " equaled " true correlation degree ".Because proper vector is all by unit normalization, mould is 1, so the calculating of " spurious correlation degree " need not to consider the mould of vector, and then can realize incremental computations, improves counting yield.Iterative search flow process framework is as shown in Figure 3, among the figure signature search with sovereign right of front two is combined and shows.Fig. 4 is based on the relation of iterative search with sovereign right and finds schematic flow sheet.
The process of iterative search has two Key Strategy steps, and the one, search after linked character arranges slide fastener, choose how many candidate's notions, we are called " candidate's notion is chosen strategy "; The 2nd, when saturated candidate's related notion collection is, stops association search, and we are called " the saturated control strategy of candidate's concept set ".The moving window of band monitor can be concentrated with crossing the objective result that can not appear at that monitor guarantees the big probability of element that beta pruning is fallen to a certain extent.We also will introduce in two strategy step here with the moving window of monitor, and the same with the characteristic dimensionality reduction, the parameter of monitor also draws through experiment.
The moving window of band monitor need be set window size winLen and amplitude threshold δ.Because we desire to obtain Top-M related notion, therefore, set the big or small winLen of being of moving window that candidate's notion is chosen strategy 1=M, amplitude threshold δ 1Then draw through experiment.The saturated control strategy moving window of candidate's concept set size is made as winLen 2, amplitude threshold is made as δ 2
As shown in Figure 5, the execution in step that candidate's notion is chosen strategy is following:
Step 401, searcher use linked character to be query key, in the characteristic inverted index, carry out inquiry, obtain candidate's notion slide fastener, get into step 402.
Step 402 is that the moving window of M places the slide fastener stem with window, if the no more than M of element then all elects candidate's notion as, if element is more than M then get into step 404.
Step 404, monitor monitoring surveillance requirements is about to head and the tail weight amount of decrease and amplitude threshold δ 1Compare and get into step 405.
Step 405 if head and the tail weight amount of decrease is less than or equal to amplitude threshold, gets into step 407; If head and the tail weight amount of decrease gets into step 406 greater than amplitude threshold.
Step 406 is then cut the element behind the window, and residue is promptly elected candidate's notion as.
Step 407, moving window move one, and turn back to step 404.
The execution in step of the saturated control strategy of candidate's concept set as shown in Figure 6 is following:
Step 501, the search for candidate notion is concentrated if candidate's notion has been present in candidate's related notion, gets into step 502; If candidate's notion gets into step 503 for newly-increased notion.
Step 502 to linked character incremental computations " spurious correlation degree ", gets into step 504.
Step 503, the search for candidate notion uses linked character to calculate " spurious correlation degree ", gets into step 504.
Step 504 is readjusted candidate's notion slide fastener order according to " spurious correlation degree ", gets into step 505.
Step 505, moving window place the slide fastener stem to get into step 506.
Step 506 is judged number of elements and winLen 2Size, if number of elements is not more than winLen 2, carry out candidate's notion again and choose strategy entering sub-process 4; If number of elements is not less than winLen 2, then get into step 507.
Step 507, monitor inspection head and the tail candidate's notion " spurious correlation degree " amount of decrease is with amplitude threshold δ 2Relatively get into step 508.
Step 508 is if " spurious correlation degree " amount of decrease is not less than δ 2, get into step 509; If " spurious correlation degree " amount of decrease is less than δ 2, then get into step 510.
Step 509, candidate's related notion collection reaches capacity, and stops iterative search, and the saturated control strategy flow process of candidate's concept set finishes.
Step 510 gets into step 511 to slide fastener afterbody moving window.
Step 511 if arrive the slide fastener afterbody, is carried out candidate's notion again and is chosen strategy, gets into sub-process 4; Otherwise get back to step 507.
After candidate's concept set reached capacity, iterative search stopped, and had just confirmed the relevant general set of candidate this moment, and the set scale is much smaller than all notion set of open encyclopaedia.Further the accurate Calculation target concept is with the degree of correlation of candidate's related notion, can obtain the related notion of Top-M after the ordering.Conceptual relation discover method overall data flow process based on iterative search with sovereign right is as shown in Figure 6, and concrete steps are following:
Step 601, target concept search beginning gets into step 603.
Step 602 is numbered all notions, construction feature vector forward index.
Step 603, searcher get into step 604 according to proper vector forward indexed search proper vector.
Step 603 is obtained characteristic with sovereign right according to the proper vector that searches, and gets into step 606.
Step 605 is characterized as with vector and arranges entry, the structural attitude inverted index.
Step 606, searcher are characterized as linked character with the high weight of target concept and carry out association search, select candidate's related notion, and put into candidate's related notion collection.
Step 4, candidate's notion is chosen tactful sub-process and is obtained candidate's notion, gets into step 5.
Step 5, the saturated control strategy sub-process of candidate's concept set search after linked character arranges slide fastener, choose candidate's notion and get into step 607.
Step 607 judges whether candidate's concept set reaches capacity, and iterative search stops if candidate's concept set has reached capacity, gets into step 608; Candidate's concept set does not reach capacity and then turns back to step 603.
Step 608, accurate Calculation target concept be with the degree of correlation of candidate's related notion, can obtain the related notion of Top-M after the ordering, gets into step 609.
Step 609 is based on the conceptual relation discover method flow process end of iterative search with sovereign right.
Proper vector forward index, choose strategy and the saturated control strategy of candidate's concept set to measure feature inverted index, searcher, candidate's related notion collection, candidate's notion and formed based on the relation of iterative search with sovereign right and find system.
On this basis; Further to verifying based on iterative search with sovereign right; Ask for the optimum moving window parameter of candidate's notion selection strategy and the saturated control strategy of candidate's concept set, and on efficient and recall rate, compare with " relation under boolean's model is found " and " exhaustive relation is found under the vector space model ".
Efficiency is the problem that conceptual relation need to find the emphasis solution, also is based on the purpose that iterative search method with sovereign right proposes, and therefore, time efficiency is the important investigation index of this experiment.The required time of Top-M related notion of calculating a notion is designated as Time RelM, we use averaging time As the time efficiency index of weighing a conceptual relation discover method.
The average time spent that conceptual relation is found no doubt can reaction method time efficiency; But because the iterative search method is divided into a plurality of steps; For the distribution of efficient analysis time between each step more clearly finding out efficiency bottle neck, we append iterative search number of times and saturated candidate's related notion collection size as two non-cutting time efficiency index.Iterative search needs query feature vector forward index each time, and the size of saturated candidate's related notion collection is the calculated amount of the follow-up accurate degree of correlation of influence directly, so these two indexs can be good at reflecting the efficiency change of iterative search.
The Top-M related notion that we calculate exhaustive conceptual dependency degree under the vector space model is as the standard of recalling, the recall rate that calculated relationship is found, and the Top-M related notion set that exhaustive conceptual dependency degree is calculated is designated as Φ Std, related notion set to be assessed is designated as Φ Est, the element number that uses Card (Φ) expression denumerable sets to comprise, then the recall rate computing formula is following:
recall ( Φ est | Φ std ) = card ( Φ std ∩ Φ est ) card ( Φ std ) * 100 % - - - ( 3 )
Although recall rate can not be reacted the otherness of related notion in the disappearance of coordination preface not; But differential quality is more paid attention in the quality evaluation of related notion; Therefore, recall rate can directly react based on the search the quick discover method that concerns be the outcome quality cost that counting yield is paid.
Above-mentioned based on iterative search with sovereign right to concern that discover method has three parameters to test definite, candidate's notion is chosen tactful amplitude threshold δ 1, the saturated control strategy moving window size of candidate's concept set winLen 2With amplitude threshold δ 2δ wherein 1∈ (0,1), winLen 2∈ [M ,+∞), δ 2∈ (0,1).These three variablees are chosen tactful moving window size winLen with candidate's notion 1Influence each other between=the M, relation is complicated.Test simultaneously that three variablees can increase that experiment divides into groups to make experimental result and complicacy is difficult to analyze.
Candidate's notion is chosen tactful amplitude threshold δ 1Size can influence the notion number that single search is chosen, and then at the saturated control strategy moving window size of candidate's concept set winLen 2With amplitude threshold δ 2Influence the number of times of iterative search under the situation about confirming.Amplitude threshold δ 1It is excessive to cross the notion number that conference chooses single search, and then possibly cause the number of times of iterative search to descend and reduce recall rate; The notion number that amplitude threshold is too small then can to make the single search choose is on the low side, and then possibly cause the search iteration number of times to rise and time efficiency is reduced.Confirm that because candidate's notion is chosen the moving window of strategy through the experience of repetition test, the notion number that the single search is chosen is good between [M, 2M], therefore, can choose the notion number through the single search earlier and as a result of feed back the scope of dwindling amplitude threshold.
After confirming that candidate's notion is chosen the parameter of strategy, can be to two parametric joint experiments of the saturated control strategy of candidate's related notion collection.Confirm the size of moving window earlier according to the tendency of recall rate, be equipped with the recall rate and the time efficiency of many group amplitude threshold combinatory analysis experiments then, confirm the optimized parameter combination.
After parameter is confirmed; The discover method that concerns based on iterative search with sovereign right reaches the known preferred state; And then carry out the contrast experiment of recall rate and time efficiency with the searching method under boolean's model, enumeration methodology under the vector space model, with the validity of verification method.
As shown in Figure 7; The contrast experiment that discover method carries out recall rate and time efficiency with the searching method boolean's model under, enumeration methodology vector space model under that concerns of sovereignty iterative search roughly is divided into three phases, candidate's notion choose the confirming of tactful amplitude threshold, the saturated controlled variable of candidate's related notion collection definite, concern that discover method efficient and recall rate contrast.
(1) candidate's notion is chosen tactful amplitude threshold experiment
The experiment of design 2*9 group, the M value is 10 and 20, δ 1Value is 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%.Every group of average concept number that the experimental calculation target concept concentrates the single search of all target concept to choose, the characteristic of M position was carried out association search before each notion was chosen.The result is as shown in table 2 below:
Figure BDA0000157388440000131
The search of table 2 single is chosen notion and is counted mean value
As shown in Figure 8, single search is chosen notion and is counted 2 times of M values of mean value and appear between amplitude threshold 25% and 30% owing to be interactional between the parameter, so one-parameter float among a small circle in other parameters, to adjust reach optimum.Through this interpretation, we select for use more the amplitude threshold of choosing strategy as candidate's notion near 25% of 2 times of M to continue follow-up test.
(2) the saturated control strategy parameter experiment of candidate's related notion collection
Saturated control strategy parameter is an important experimental content of the present invention, because should to concerning recall rate and the time efficiency found decisive influence be arranged by strategy.In the practical operation, this experiment just is accomplished through repeatedly iterating, and experiment finds that when waiting that finding to close coefficient M is in different range, policing parameter has very big difference.Describe for ease, existing experiment is divided into two parts, first experimentizes to the situation of M >=10 and describes and analyze and sum up, and second portion experimentizes to the situation of 0<M<10 and analyzes.
At first, set forth the first of experiment.Design 4*10*7 group experiment parameter, wherein the M value is that 4 groups, moving window are 5 groups, amplitude threshold is 7 groups.The experiment setting value of amplitude threshold is 20%, 30%, 40%, 50%, 60%, 70% and 80%.The moving window size is an initial value with the M value, at first double growth up near or become M2, then by the equivalent linear growth.The value combination of M value and moving window is as shown in table 3 below:
WinLen10 10 20 40 60 80 100 150 200 250 300
WinLen20 20 40 80 160 240 320 400 500 600 700
WinLen30 30 60 120 240 480 900 1000 1100 1200 1300
WinLen50 50 100 200 400 800 1600 2500 2600 2700 2800
Table 3M value and moving window experiment value
Every group of experiment parameter all can generate a specific saturated control strategy, and the relation of using each group parameter to carry out the iterative search mode is found experiment, asks for the related notion of target concept collection, the saturation value of record candidate concept set.Use enumerative technique to calculate the set of standard related notion, the recall rate of finding according to recall rate formula calculated relationship then simultaneously.
We ask for the saturated related notion collection notion of record and count mean value, under 4 groups of different M values, are the transverse axis variable with the amplitude threshold, are the longitudinal axis with notion numerical value, make saturated set notion number under the different moving windows with the change curve of amplitude threshold.Because the notion number can exceed linear speed increment along with the increase of moving window, so we use the longitudinal axis of index variation, with the convenient curve of observing under the different moving windows.
Fig. 9 shows saturated relevant episode notion and counts mean value, shown in the little figure of four width of cloth that following Fig. 9 comprises, from left to right; Be respectively 10,20,30,50 o'clock saturated set notion number curve figure of M value from top to bottom; Be designated as saturated set 10, saturated set 20, saturated set 30 and saturated set 50.From this four width of cloth figure, can find, when moving window hour, saturated set notion number can be relatively more responsive to amplitude threshold; Increase along with the increase of amplitude threshold, still along with the increase of moving window, susceptibility can weaken greatly; Even when reaching certain window size; Saturated set notion base is originally constant, just when amplitude threshold is big, has a small amount of growth, but with regard to the relative scale that increases, can ignore basically.
The phenomenon that less moving window saturated set exceeds big moving window has all appearred among four width of cloth figure; This phenomenon can make an explanation as follows: when moving window is very little; The changes in amplitude monitor can only be monitored relative small range; When the local separating capacity of this scope less than the conceptual dependency degree, it is more than to cause iterative search to continue, and the notion number of final candidate's related notion collection just might surpass the big level of moving window under same amplitude threshold.Therefore, moving window will be worth increasing and an amount of increasing facing to M, avoids occurring iterative search and continues more than situation.
Next, to the target concept collection concern the recall rate calculating mean value, at the average recall rate broken line graph of making under 4 groups of different M values under moving window and the floating threshold various combination situation.Wherein horizontal ordinate is the moving window size, and ordinate is a recall rate, the corresponding colored curve of each amplitude threshold.Following four width of cloth figure are respectively 10,20,30,50 o'clock recall rate curve map of M value from left to right from top to bottom, are designated as recall rate 10, recall rate 20, recall rate 30 and recall rate 50.
Find out easily that from four width of cloth figure that Figure 10 comprised along with the increase of moving window, the recall rate variation tendency under each amplitude threshold is similar.When amplitude threshold was constant, along with the increase of moving window, recall rate also can increase thereupon, but speedup, is tended towards stability after rising to (0.95,1) interval by fast extremely slow.When moving window size was constant, along with the increase of amplitude threshold, recall rate can rise to some extent, but can not hour reach a gratifying value at moving window.
Analysis and summary recall rate experimental result can obtain to draw a conclusion:
1) when moving window is constant, can improve recall rate through the method that increases amplitude threshold, but the regulating power of amplitude threshold is limited, if moving window is too little, only can't obtain gratifying recall rate through amplitude of accommodation threshold value.
2) when amplitude threshold is constant, can improve recall rate through the method that increases moving window, but the moving window value need not to increase simply, recall rate reaches certain level and will not have tangible lifting again when window increases to a certain degree
3) when moving window reaches a certain size, amplitude threshold is not seen tangible effect in experiment.When amplitude threshold goes to zero extremely to the greatest extent, possibly have the decline of recall rate, but this kind situation do not have practical significance.
4) recall rate all can make recall rate reach 95% when the moving window size reaches certain numerical value surpassing 95% back just no longer by obvious variation, and very unobvious with the relation of amplitude threshold.
Increase floating threshold and can not guarantee that recall rate is bound to rise, on 3,000,000 radix, increasing the moving window size can produce very big influence to counting yield, and the recall rate that brings still more can not guarantee the increase of related coefficient.Therefore, we select littler moving window under 95% condition of recalling.
Dark line among the figure marks under each M value, and recall rate increases the border of district and stable region fast, and experiment proof different amplitude threshold value is very little to the influence on border.Because a definite moving window size is extremely important to the saturated control strategy of candidate's related notion, so we hope through analyzing the relation of match M value with the border.The relation of record M value and boundary window is as shown in table 4 below:
The M value 10 20 30 50
The border moving window 96 385 1017 2530
The corresponding relation of table 4M value and border moving window
We have selected approximating method commonly used for use---and fitting of a polynomial is handled the relation data between M value and the border moving window, and fitting result is simplified, and draws following formula:
WinLen 2=M 2 (4)
Have only four pairs though intend reference data, formula is identical basically to relation data, and as far as the engineering demand of this method, the own meaning of confirming of relation is greater than the accuracy meaning of relation.For this moving window value, can satisfy the needs that relation is found.
The search relation is found in order to contrast under boolean's model, enumerate that relation is found under the vector space model and concern three kinds of time efficiency and recall rates that concern discover method such as discovery based on iterative search with sovereign right; We have designed the experiment of 3*10 group; All notions that every group of experiment concentrated target concept are calculated the related notion of Top-M; Record also calculates average recall rate and time spent, and makes curve map.
Shown in figure 11; Wherein one of the left side is three kinds of efficient comparison diagrams that concern discover method, and transverse axis is 5~50 M value, and the longitudinal axis is for carrying out the time spent that relation is found; Because three kinds of method efficient difference are bigger; Exceeded the range of linearity, so the longitudinal axis uses the denary logarithm scale, scope is 0.0001 second to 10000 seconds.
One of the right is the recall rate comparison diagram of three kinds of relational approaches, and same transverse axis is 5~50 M value, and the longitudinal axis is a recall rate number percent.
Can find out that from efficiency curve diagram the time efficiency relation of three kinds of methods is: boolean search is higher than iterative search with sovereign right, and the latter is higher than vector to enumeration methodology.Wherein boolean search and iterative search method efficient difference with sovereign right be in same magnitude, and vector to enumerative technique efficient low 5~6 magnitudes, promptly the average time spent exceeds hundreds thousand of times.Along with M is worth increasing, all there is growth by three kinds of average times spent of method.Because ordinate of orthogonal axes is a logarithmic scale, so provided the line of correlation of y=x among the figure.Can find out with the comparison of line of correlation that from three efficiency curve slopes the enumerative technique time spent be worth to increase linear growth with M basically, and the rate of growth that boolean search and characteristic iteration with sovereign right are searched two kinds of methods is higher than linearity, M value is than hour particularly evident.
Can find out that from the recall rate curve map recall rate relation of three kinds of methods is: vector is higher than iterative search with sovereign right to enumerative technique, and the latter is higher than the boolean search method.Because vector is the contrast standard of recall rate to enumerating, so its recall rate is 100%.Though the boolean search method has very high efficient, recall rate is unsatisfactory, and the population mean level is lower than 50%, when M is big even can't reach 30%.By contrast, the average recall rate of iterative search with sovereign right exceeds 95%, and insensitive to the M value.
Comprehensive evaluation sovereignty iterative search though efficient is slightly poorer than same boolean search, is in same magnitude, and it is a lot of that recall rate but exceeds boolean search, near 100%.
Above the quick discover method of the conceptual relation based on iterative search with sovereign right provided by the present invention has been carried out detailed explanation.As far as one of ordinary skill in the art, any conspicuous change of under the prerequisite that does not deviate from connotation of the present invention, it being done all will constitute to infringement of patent right of the present invention, with corresponding legal responsibilities.

Claims (6)

1. the quick discover method of the conceptual relation based on iterative search with sovereign right is characterized in that comprising the steps:
Use boolean search to eliminate not comprise identical nonzero element or only to comprise the notion of few nonzero element right; Wherein said boolean search comprises: the semantic feature vector of notion is converted into Boolean expression; And construction feature vector forward index and characteristic inverted index; Use the semantic feature constitutive logic inquiry of target concept, search obtains the related notion collection of target concept in the logical expression set;
Further use the conceptual dependency degree under the enumerative technique compute vector space, try to achieve related notion through ordering.
2. the quick discover method of conceptual relation as claimed in claim 1 is characterized in that:
Conceptual dependency degree step under the said use enumerative technique compute vector space comprises:
A. searcher is according to proper vector forward indexed search proper vector;
B. obtain characteristic with sovereign right according to the proper vector that searches;
C. searcher carries out association search, selects candidate's related notion, and puts into candidate's related notion collection;
D. use candidate's notion to choose strategy and obtain candidate's notion;
E. the saturated control strategy of candidate's concept set searches after linked character arranges slide fastener, chooses candidate's notion;
F. judge whether candidate's concept set reaches capacity,, repeat b~f, reach capacity up to candidate's concept set if candidate's concept set does not reach capacity and then returns b;
F. iterative search stops if candidate's concept set has reached capacity, and the accurate Calculation target concept is with the degree of correlation of candidate's related notion.
3. the quick discover method of conceptual relation as claimed in claim 2 is characterized in that:
Said candidate's notion is chosen strategy and is comprised:
3a. searcher uses linked character to be query key, in the characteristic inverted index, carries out inquiry, obtains candidate's notion slide fastener;
3b. with window is that the moving window of M places the slide fastener stem, if the no more than M of element then all elects candidate's notion as, said M is a natural number;
If 3c. element more than M, monitor monitoring surveillance requirements is about to head and the tail weight amount of decrease and amplitude threshold and compares;
3d. if head and the tail weight amount of decrease is less than or equal to amplitude threshold, moving window moves one and return 3c, repeats 3c~3d, up to head and the tail weight amount of decrease greater than amplitude threshold;
3e. if head and the tail weight amount of decrease is then cut the element behind the window greater than amplitude threshold, residue is promptly elected candidate's notion as.
4. the quick discover method of conceptual relation as claimed in claim 2 is characterized in that:
The saturated control strategy of said candidate's concept set comprises:
4a. the search for candidate notion is concentrated if candidate's notion has been present in candidate's related notion, to linked character incremental computations " spurious correlation degree ", calculates " spurious correlation degree " if candidate's notion is used linked character for newly-increased notion;
4b. readjust candidate's notion slide fastener order;
4c. moving window places the slide fastener stem, judges the size of number of elements and window, if number of elements is not more than window size; Again carry out candidate's notion and choose strategy; If number of elements is not less than window size, then inspection head and the tail candidate's notion " spurious correlation degree " amount of decrease compares with amplitude threshold;
4d. if " spurious correlation degree " amount of decrease less than amplitude threshold, gets into step to slide fastener afterbody moving window, if arrived the slide fastener afterbody; Again carry out candidate's notion and choose strategy; Otherwise return 4c, repeat 4c~4d, be not less than amplitude threshold up to " spurious correlation degree " amount of decrease;
4e. if " spurious correlation degree " amount of decrease is not less than amplitude threshold, candidate's related notion collection reaches capacity, and stops iterative search.
5. the quick discover method of conceptual relation as claimed in claim 1 is characterized in that:
Said construction feature vector forward index comprises:
All notions are numbered, the semantic feature of all notions according to the weight descending, and is normalized to vector of unit length with proper vector;
Key assignments with " notion numbering-proper vector " concerns structural attitude vector forward index.
6. the quick discover method of conceptual relation as claimed in claim 1 is characterized in that:
Said construction feature inverted index comprises:
Regard each unit character vector as a document, be characterized as with vector and arrange entry, the notion numbering in the index slide fastener is according to the descending sort of feature weight size, structural attitude inverted index.
CN201210125040.XA 2012-04-25 2012-04-25 Based on the conceptual relation rapid discovery method of feature iterative search with sovereign right Expired - Fee Related CN102750315B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210125040.XA CN102750315B (en) 2012-04-25 2012-04-25 Based on the conceptual relation rapid discovery method of feature iterative search with sovereign right

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210125040.XA CN102750315B (en) 2012-04-25 2012-04-25 Based on the conceptual relation rapid discovery method of feature iterative search with sovereign right

Publications (2)

Publication Number Publication Date
CN102750315A true CN102750315A (en) 2012-10-24
CN102750315B CN102750315B (en) 2016-03-23

Family

ID=47030502

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210125040.XA Expired - Fee Related CN102750315B (en) 2012-04-25 2012-04-25 Based on the conceptual relation rapid discovery method of feature iterative search with sovereign right

Country Status (1)

Country Link
CN (1) CN102750315B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294662A (en) * 2016-08-05 2017-01-04 华东师范大学 Inquiry based on context-aware theme represents and mixed index method for establishing model
CN106933813A (en) * 2017-02-16 2017-07-07 牡丹江师范学院 A kind of text data processing method for English Translation

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040220944A1 (en) * 2003-05-01 2004-11-04 Behrens Clifford A Information retrieval and text mining using distributed latent semantic indexing
CN101286159B (en) * 2008-06-05 2010-06-23 西北工业大学 Document meaning similarity distance metrization method based on EMD

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040220944A1 (en) * 2003-05-01 2004-11-04 Behrens Clifford A Information retrieval and text mining using distributed latent semantic indexing
CN101286159B (en) * 2008-06-05 2010-06-23 西北工业大学 Document meaning similarity distance metrization method based on EMD

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
QING JUN CUI,HUI ZHANG ,RUI LIU: ""Evaluating semantic relatedness using wikipedia-based representative features ayalysis"", 《PROCEEDINGS OF THE 2011 INTERNATIONAL CONFERENCE ON INSTRUMENTATION,MEASUREMENT,CIRCUITS AND SYSTEMS (ICIMCS 2011)》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294662A (en) * 2016-08-05 2017-01-04 华东师范大学 Inquiry based on context-aware theme represents and mixed index method for establishing model
CN106933813A (en) * 2017-02-16 2017-07-07 牡丹江师范学院 A kind of text data processing method for English Translation

Also Published As

Publication number Publication date
CN102750315B (en) 2016-03-23

Similar Documents

Publication Publication Date Title
US7401073B2 (en) Term-statistics modification for category-based search
CN104199857B (en) A kind of tax document hierarchy classification method based on multi-tag classification
CN106599054B (en) Method and system for classifying and pushing questions
CN107577785A (en) A kind of level multi-tag sorting technique suitable for law identification
CN103279478B (en) A kind of based on distributed mutual information file characteristics extracting method
CN101819601A (en) Method for automatically classifying academic documents
CN104298715B (en) A kind of more indexed results ordering by merging methods based on TF IDF
CN101944099A (en) Method for automatically classifying text documents by utilizing body
CN102622373A (en) Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm
CN103218436A (en) Similar problem retrieving method fusing user category labels and device thereof
CN112307182B (en) Question-answering system-based pseudo-correlation feedback extended query method
CN105930531A (en) Method for optimizing cloud dimensions of agricultural domain ontological knowledge on basis of hybrid models
Mohsen et al. On the automatic construction of an Arabic thesaurus
Al-Obaydy et al. Document classification using term frequency-inverse document frequency and K-means clustering
CN102750315A (en) Rapid discovering method of conceptual relations based on sovereignty iterative search
Peng et al. An integrated feature selection and classification scheme
Phadnis et al. Framework for document retrieval using latent semantic indexing
Kostkina et al. Document categorization based on usage of features reduction with synonyms clustering in weak semantic map
Mei et al. Learning probabilistic box embeddings for effective and efficient ranking
CN102651014A (en) Processing method and retrieval method for conceptual relation-based field data semantics
Cui et al. A new Chinese text clustering algorithm based on WRD and improved K-means
Meadi et al. New use of the HITS algorithm for fast web page classification
Lee et al. A neural network document classifier with linguistic feature selection
Zhao et al. The application of vector space model in the information retrieval system
CN106202405A (en) A kind of compactedness Text Extraction based on text similarity relation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160323