CN100418093C - Multiple file summarization method facing subject or inquiry based on cluster arrangement - Google Patents

Multiple file summarization method facing subject or inquiry based on cluster arrangement Download PDF

Info

Publication number
CN100418093C
CN100418093C CNB2006100725872A CN200610072587A CN100418093C CN 100418093 C CN100418093 C CN 100418093C CN B2006100725872 A CNB2006100725872 A CN B2006100725872A CN 200610072587 A CN200610072587 A CN 200610072587A CN 100418093 C CN100418093 C CN 100418093C
Authority
CN
China
Prior art keywords
sentence
document
arrangement
value
inquiry
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2006100725872A
Other languages
Chinese (zh)
Other versions
CN1828609A (en
Inventor
万小军
杨建武
吴於茜
陈晓鸥
肖建国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
BEIDA FANGZHENG TECHN INST Co Ltd BEIJING
Peking University
Peking University Founder Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIDA FANGZHENG TECHN INST Co Ltd BEIJING, Peking University, Peking University Founder Group Co Ltd filed Critical BEIDA FANGZHENG TECHN INST Co Ltd BEIJING
Priority to CNB2006100725872A priority Critical patent/CN100418093C/en
Publication of CN1828609A publication Critical patent/CN1828609A/en
Application granted granted Critical
Publication of CN100418093C publication Critical patent/CN100418093C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The present invention relates to a multiple file summarization method facing a topic or a query based on cluster arrangement, which belongs to the technical field of language word processing. When a user searches for interested topics, the existing multiple file summarization method can not accurately return relevant news information and summarization facing user's attributes according to attributes, such as user-defined interests and tastes, etc. The method of the present invention provides a new semi-supervised learning arithmetic based on a cluster arrangement arithmetic and is characterized in that the correlation among sentences and the relationship between the sentences and users' topics or queries are overall considered by the method; consequently, a generated abstract can comprise main information of a document collection, and can also interpret topics or answer queries; simultaneous, a difference penalty arithmetic ensures the novelty of the summarization. The relevant news information can be returned according to requirements, such as users' interests and tastes, etc., via the adoption of the present invention. Consequently, the good multiple file summarization facing the topics or the queries is obtained, which can meet individuation requirements of different users.

Description

A kind of based on the subject-oriented of bunch arrangement or the multiple file summarization method of inquiry
Technical field
The invention belongs to the spoken and written languages processing technology field, be specifically related to a kind of based on bunch subject-oriented of arrangement (manifold-ranking) or the multiple file summarization method of inquiry.
Background technology
Multi-document summary is a key problem of natural language processing field, is widely used in text/Web retrieval in recent years and waits in the application.For example, search engine such as Google, Baidu all provides press service, form a plurality of Special Topics in Journalism by the news information on the collection network, the user browses interested Special Topics in Journalism for convenience, need utilize the multi-document summary technology to generate a brief and concise summary for each Special Topics in Journalism.The multi-document summary of subject-oriented or inquiry then can be regarded as a kind of special multi-document summary task, the multi-document summary that this task generates need reflect certain theme or the inquiry (or being called user property) of user's appointment, that is to say that the focus of user's concern or the information requirement of proposition can be explained or answer to the summary of generation.In above-mentioned press service product, the Personalize News service more and more comes into one's own, the user only is concerned about own interested theme of news usually, according to attributes such as user-defined hobbies, requires the press service product can return the summary of relevant news information and user oriented attribute.In some intelligent question answering systems, we also require system to generate from relevant documentation can answer the summary that the user puts question to, and this summary also is the multi-document summary of a kind of typical subject-oriented or inquiry.
The where the shoe pinches of the multi-document summary of subject-oriented or inquiry is: first, difficult the same with common multi-document summary, because existing, the information that comprises in the different document repeats significantly and redundancy, therefore good multiple file summarization method wants to merge effectively the information in the different document, promptly should make the main information in the summary stet shelves of generation, make the information in the summary keep certain novelty again.Second, different with common multi-document summary is, the multi-document summary of subject-oriented or inquiry requires its information that comprises relevant with theme or inquiry, can annotate theme and answer inquiry, therefore will make full use of theme or the Query Information that the user provides in digest procedure.In recent years, multi-document summary has become the popular research topic of natural language processing field and information retrieval field, and its progress is reflected in a series of academic conferences about the automatic document summary, comprises NTCIR, DUC and ACL, COLING and SIGIR.
In short, common multiple file summarization method can be divided into the method (Extraction) that extracts based on sentence and based on the method (Abstraction) of sentence generation.Based on the fairly simple practicality of method that sentence extracts, do not need to utilize the natural language understanding technology of deep layer; This method is given certain weight to each sentence after text is carried out subordinate sentence, reflect its importance, and several sentences of weight selection maximum form summary then.Then need to utilize the natural language understanding technology of deep layer based on the method for sentence generation, after former document is carried out sentence structure, semantic analysis, utilize information extraction or natural language generation technique to produce new sentence, thereby form summary.
Present most of multiple file summarization method has also been put down in writing the many pieces of methods about multi-document summary all based on the sentence extraction technique in the existing document.(this article author is article Centroid-based summarization ofmultiple documents: D.R.Radev, H.Y.Jing, M.Stys andD.Tam, be published in the periodical Information Processing andManagement that published in 2004) a kind of sentence abstracting method based on central point disclosed, this method is at present popular a kind of method of abstracting that extracts based on sentence, MEAD is a prototype abstract system that utilizes this method to realize, it is in the process of giving the sentence weight, taken all factors into consideration the feature between sentence level and the sentence, comprise class bunch central point, the sentence position, TF*IDF etc.Article From Single toMulti-document Summarization:A Prototype System and its Evaluation (author: C.-Y.Lin and E.H.Hovy, be published in the periodical of publication in 2002: Proceedingsof the 40th Anniversary Meeting of the Association for ComputationalLinguistics (ACL-02),) the sentence extraction system of a kind of NeATS by name disclosed, this system is a multi-document summary system of ISI exploitation, it comes from single document abstract system-SUMMARIST, this system has considered the sentence position when selecting important sentences, the word frequency, a plurality of features such as theme signature and word class bunch utilize the MMR technology to the sentence weight that disappears simultaneously.Article Cross-documentsummarization by concept classification (author: H.Hardy, N.Shimizu, T.Strzalkowski, L.Ting, G.B.Wise, and X.Zhang, be published in the periodical of publication in 2003: the sentence extraction system that Proceedings of SIGIR ' 02) discloses a kind of XdoX by name, this system is suitably for large-scale document sets and generates summary, it at first detects most important theme in the document sets by the paragraph cluster, and the sentence that extracts the reflection important theme then forms summary.Article Topic themes for multi-document summarization (author S.Harabagiuand F.Lacatusu, be published in the periodical Proceedings of SIGIR ' 05 that published in 2005) method of Harabagiu and Lacatusu is disclosed, this method has been inquired into five kinds of different many document subject matter manifestation modes and has been proposed a kind of new theme manifestation mode.
Method based on graph structure also is used to the importance of sentence is sorted.Article Summarizing Similarities and Diffe rences Among Related Documents (author: I.Mani and E.Bloedorn, be published in the periodical InformationRetrieval that published in 2000) method of a kind of WebSumm by name disclosed, this method is utilized the figure link model, and this supposes to come the importance to sentence to sort to have higher significant according to the summit that is connected with a plurality of other summits.Article LexPageRank:prestige in multi-document textsummarization (author: G.Erkan and D.Radev, be published in the periodical of publication in 2004: the method that Proceedings of the Conference on Empirical Methods in NaturalLanguage Processing (EMNLP ' 04)) discloses a kind of LexPageRank by name, this method at first makes up the sentence connection matrix, and the algorithm based on similar PageRank calculates sentence importance then.Article Alanguage independent algorithm for single and multipledocument summarization (author: R.Mihalcea and P.Tarau, be published in the periodical of publication in 2005: Proceedings of the Second International JointConference on Natural Language Processing (IJCNLP ' 05)) disclose the method for a kind of Mihalcea by name and Tarau, this method has also proposed a similar algorithm computation sentence importance based on PageRank and HITS.
The multiple file summarization method of subject-oriented or inquiry is usually based on common multiple file summarization method; integration theme or Query Information in digest procedure; feasible summary can satisfy user's customizing messages demand, has also put down in writing the many pieces of methods about multi-document summary in the existing document.Article Robust genericand query-based summarization (author: H.Saggion, K.Bontcheva, andH.Cunningham, be published in the 2005 periodical Proceedings of EACL-2003 that publish) multiple file summarization method of a kind of subject-oriented or inquiry disclosed, this method utilization is calculated the similarity of each sentence and inquiry based on the weight calculator of inquiry, considers this similarity value then in based on the digest procedure of inquiry.Article Approaches to event-focused summarization basedon named entities and query words (author: J.Ge., X.Huang, and L.Wu, be published in the periodical Proceedings of the 2003 DocumentUnderstanding Workshop that published in 2003) multiple file summarization method of a kind of subject-oriented or inquiry disclosed, article CLASSY query-based multi-document summarization (author: J.M.Conroy and J.D.Schlesinger, be published in the periodical Proceedingsof the 2005 Document Understanding Workshop that published in 2005) multiple file summarization method of a kind of subject-oriented or inquiry also disclosed, the method for these two kinds of multi-document summaries has been inquired into query word and named entity in the subject description to the effect towards the multi-document summary of incident or inquiry.Article CATS atopic-oriented multi-document summarization system at DUC 2005 (authors: A.Farzindar, F.Rozon, and G.Lapalme, be published in the periodical Proceedings of the 2005 Document Unders tanding Workshop that published in 2005) multiple file summarization method of a kind of subject-oriented or inquiry disclosed, this method is at first carried out subject analysis to document, the theme that the theme that obtains and user are provided mates then, obtains the multi-document summary of subject-oriented at last.But, said method still comes with some shortcomings, these methods are failed to take all factors into consideration the subject-oriented or the information inquiring of sentence and are enriched degree and the novel degree of information, thereby can not return the summary of relevant news information and user oriented attribute accurately according to attributes such as user-defined hobbies.
Summary of the invention
At the defective that exists in the prior art, the purpose of this invention is to provide a kind of based on bunch subject-oriented of arrangement (manifold-ranking) or the multiple file summarization method of inquiry, this method can take all factors into consideration the subject-oriented of sentence or information inquiring is enriched degree and the novel degree of information, and utilizes bunch permutation algorithm to consider integratedly naturally that mutual relationship between the sentence and user's theme or information inquiring can be implemented under the situation of given theme or inquiry to form the summary that more meets user's request for a plurality of documents.
For reaching above purpose, the technical solution used in the present invention is: a kind of based on the subject-oriented of bunch arrangement or the multiple file summarization method of inquiry, may further comprise the steps:
(1) reads in theme and document, perhaps read in inquiry and document; Theme and document are carried out subordinate sentence, perhaps subordinate sentence is carried out in inquiry and document; The sentence set is
χ = { x 1 , . . . , x p , x p + 1 , . . . , x n } ⋐ R m , X wherein iTo x pP sentence that from theme or inquiry, obtains of expression, x P+1To x nN-p sentence that from document, obtains of expression.Calculate the similarity of any two sentences in this n sentence, make up the sentence graph of a relation, the normalized sentence similar matrix of its correspondence is S;
(2) the arrangement value of each sentence in the employing bunch permutation algorithm iterative computation document, described arrangement value is the initial weight value;
(3) each sentence in the document is carried out otherness punishment, obtain the final weighted value of each sentence;
(4), from document, select the big sentence of final weighted value to form summary according to the final weighted value of each sentence.
Furthermore, bunch permutation algorithm concrete grammar described in the step (2) is as follows:
Make f:x → R represent a permutation function, each sentence x among the distich subclass x i, wherein, 1≤i≤n gives an arrangement value f i, regard f as a vector f=[f i..., f n] T, simultaneously, define a vectorial y=[y i..., y n] T, wherein y is arranged for 1≤i≤p i=1, represent this p sentence from user given theme or inquiry, and y is all arranged for the sentence of the n-p in the document i=0 (p+1≤i≤n), wherein, T represents vectorial transposition;
According to the arrangement value of each sentence of following formula iterative computation, up to convergence:
f(t+1)=αSf(t)+(1-α)y (1)
The vector that obtains of the t time iteration of f (t) expression wherein, t is a positive integer, S is the normalized sentence similar matrix that step (1) obtains, α is [0,1] parameter between determines the arrangement value of adjacent sentence of certain sentence and this sentence initial arrangement value respectively to the contribution of finally arrangement value of this sentence; The arrangement value that all calculates of iterative process each time based on last iteration, utilize following formula to calculate the new arrangement value of each sentence, till the arrangement value that twice iterative computation in the front and back of all sentences obtains no longer changes, during actual computation if the variation of the arrangement value of all sentences during less than threshold value algorithm stop, making f (1)=y; Make f i *Sentence x after the expression algorithm convergence iThe arrangement value that obtains.
Above-mentioned basic idea is that the arrangement value between the adjacent sentence to a certain extent should be close, therefore each sentence all is diffused into self arrangement value its adjacent sentence, reach up to this process till the steady state (SS) of an overall situation, sentence in last each document has all obtained an arrangement value, the user oriented theme or the information inquiring that reflect this sentence are enriched degree
Above-mentioned algorithm can prove theoretically and converge to
f *=β(I-αS) -1y (2)
β=1-α wherein, f *The arrangement value vector that expression obtains, I is a unit matrix;
Further, for making the present invention obtain better invention effect, theme described in the step (1) or Query Information are to describe with the relevant personalization of specific user, comprise that user property, user put question to, user inquiring, these descriptions are directly provided by the user, and perhaps the behavioural analysis from the user obtains.
Further again, in the step (1) theme or Query Information are divided into 1 to 5 sentence, the span that is to say p is 1 to 5.
Further, obtain better invention effect for making the present invention, calculate sentence similarity in the step (1), when making up the sentence graph of a relation, concrete grammar is as follows:
1) to user given theme or inquiry subordinate sentence, obtains x iTo x pThis p sentence carries out subordinate sentence to all documents and obtains x P+1To x nThis n-p sentence, to this n sentence participle, the cosine formula distich subclass below utilizing then χ = { x 1 , . . . , x p , x p + 1 , . . . , x n } ⋐ R m In any two sentence x iAnd x jCalculate the similarity value:
sim ( x i , x j ) = cos ( x → 1 , x → j ) = x → 1 · x → j | | x → 1 | | · | | x → j | | - - - ( 3 )
Wherein
Figure C20061007258700093
With
Figure C20061007258700094
The term vector that is two sentence correspondences represents that the weight of speech t correspondence is according to tf in the vector t* isf tFormula calculates, tf tThe frequency of expression speech t in sentence, isf tExpression speech t arranges the sentence frequency, just 1+log (N/n t), wherein N is the total quantity of sentence, n tBe the sentence quantity that comprises speech t;
2) each sentence is used as a summit, if two sentence x iAnd x jBetween the similarity value greater than threshold value, between these two sentences, set up a limit so, the weight on limit is the similarity value between the sentence, thereby obtains a weighted graph G, makes the adjacency matrix of W presentation graphs G correspondence, if sentence x iAnd x jBetween have the limit, W so Ij=sim (x iAnd x j), and for all i, W Ii=0;
3) for the weighted graph G that obtains, the present invention distinguishes wherein in the document sentence relation between sentence relation and document, if two sentences belong to same document, the pass between them is a sentence relation in the document so; If two sentences adhere to different document separately, the pass between them is a sentence relation between document so.In order to distinguish the different importance of these two kinds of relations, the present invention is decomposed into the adjacency matrix that obtains:
W ~ = λ 1 W mtra + λ 2 W mter - - - ( 4 )
W wherein IntraBe the adjacency matrix (the limit weights of sentence relation are made as 0 between the expression document) that only comprises the limit of sentence relation in the expression document, W InterThen be the adjacency matrix (the limit weights of sentence relation are made as 0 in the expression document) that only comprises the limit of sentence relation between the expression document, λ 1, λ 2∈ [0,1];
4) to new adjacency matrix
Figure C20061007258700102
Standardize and obtain new similar matrix S = D - 1 / 2 W ~ D - 1 / 2 , Wherein D is a diagonal matrix, and (i i) equals the element of the capable i row of matrix S i
Figure C20061007258700104
I row element sum; Order to the former adjacency matrix W matrix that obtains that standardizes equally is
Figure C20061007258700105
Further again, obtain better invention effect for making the present invention, set two sentence x in the step (1) iAnd x jBetween similarity value during greater than threshold value, threshold setting is 0.01.
Further, obtain better invention effect, distinguish when sentence concerns between interior sentence relation of document and document λ in the formula (4) in the step (1) for making the present invention 1Be made as 0.3, λ 2Be made as 1.
Further, obtain better invention effect for making the present invention, the middle α of formula (1) is set at 0.6 in the step (2).
Further, obtain better invention effect for making the present invention, when the variation of the arrangement value of setting sentence was less than threshold value in the step (2), threshold setting was 0.0001.
Further, obtain better invention effect for making the present invention, in the step (3) sentence is carried out otherness when punishment, adopt greedy algorithm to come each sentence is carried out otherness punishment, thereby guarantee the novelty of candidate's sentence, concrete grammar is as follows:
1) two set A=φ of initialization, B={x i| i=p+1 ..., n}, the final weighted value of each sentence is initialized as its arrangement value, that is to say RankScore (x i)=f i *, i=p+1 ... n;
2) according to the sentence among the current final weighted value descending sort B;
3) supposition x iBe the highest sentence of rank, first sentence in the sequence just is with x iMove on to A from B, and to each and x among the B iAdjacent sentence xj (J ≠ i) carry out following otherness to punish:
RankScore ( x j ) = RankScore ( x j ) - ω · S ^ ji · f i * - - - ( 5 )
ω>0th wherein, the punishment degree factor, ω is big more, and otherness punishment is strong more; If ω is 0, so just there is not otherness punishment;
4) circulation execution in step 2) and step 3), up to B=φ.
Further again, obtain better invention effect for making the present invention, the punishment degree factor ω described in the formula (5) is set at 8 in the step (3).
Further, in the step (4), from document sentence x P+1To x nMiddle 2-10 the sentence of weighted value maximum of selecting forms summary.
Effect of the present invention is: adopt method of the present invention, mutual relationship and user's theme or information inquiring between the sentence can have been considered comprehensively, realized making the multi-document summary of generation can comprise the main information of document sets, can annotate theme again or answer inquiry, can access the multi-document summary of better subject-oriented or inquiry.
Why the present invention has the foregoing invention effect, be because the present invention has following characteristics: the present invention proposes a kind of brand-new method of abstracting, this method is based on a kind of new semi-supervised learning algorithm-based on the algorithm of bunch arrangement, mutual relationship between the integrated consideration sentence and user's theme or information inquiring, thereby make the summary that generates to comprise the main information of document sets, can annotate theme again or answer inquiry, utilization variance punishment algorithm guarantees to generate the novelty of summary simultaneously.This method has treated also in based on the algorithm of bunch arrangement in the document that sentence concerns this two kinds of different relations between sentence relation and document with a certain discrimination, gives the bigger contribution weight of sentence relation between document.
Description of drawings
Fig. 1 is the process flow diagram of the method for the invention;
The method that Fig. 2 is to use the present invention to propose is improved the synoptic diagram of file retrieval.
Embodiment
The invention will be further described below in conjunction with drawings and Examples:
As shown in Figure 1, a kind of based on the subject-oriented of bunch arrangement or the multiple file summarization method of inquiry, may further comprise the steps:
(1) reads in document, theme or Query Information as sentence, to each document and theme or Query Information subordinate sentence, participle, are calculated sentence similarity, make up the sentence graph of a relation;
Theme described in the present embodiment comprises user property, user's enquirement, user inquiring etc. with the relevant personalization description of specific user, and these descriptions are directly to be provided by the user, can certainly obtain from user's behavioural analysis; If topics long can be divided into theme a plurality of sentences, preferably be divided into 1 to 5 sentence.Because the theme in the present embodiment is shorter, so just theme is used as a sentence, just makes p=1.
Calculate sentence similarity in the present embodiment, when making up the sentence graph of a relation, adopt concrete grammar as follows:
A sentence x be used as in the theme that the user is given 1, each document subordinate sentence is obtained n-1 sentence, obtain the sentence set simultaneously χ = { x 1 , x 2 , . . . , x n } ⋐ R m , X wherein 1Given theme or the inquiry of expression user, x 22..., x nN-1 sentence in the expression document; To this n sentence participle, the cosine formula distich subclass below utilizing then χ = { x 1 , x 2 , . . . , x n } ⋐ R m In any two sentence x iAnd x jCalculate the similarity value:
sim ( x i , x j ) = cos ( x → i , x → j ) = x → i · x → j | | x → i | | · | | x → j | | - - - ( 3 )
Wherein
Figure C20061007258700124
With
Figure C20061007258700125
The term vector that is two sentence correspondences represents that the weight of speech t correspondence is according to tf in the vector t* isf tFormula calculates, tf tThe frequency of expression speech t in sentence, isf tExpression speech t arranges the sentence frequency, just 1+log (N/n t), wherein N is the total quantity of sentence, n tBe the sentence quantity that comprises speech t.
A summit be used as in each sentence, if two sentence x iAnd x jBetween the similarity value greater than threshold value, in the present embodiment, setting threshold is 0.01; Set up a limit so between these two sentences, the weight on limit is the similarity value between the sentence, thereby obtains a weighted graph G.Make the adjacency matrix of W presentation graphs G correspondence, if sentence x iAnd x jBetween have the limit, W so Ij=sim (x i, x j), and for all i, W Ii=0.
For the weighted graph G that obtains, the present invention distinguishes wherein in the document sentence relation between sentence relation and document.If two sentences belong to same document, the pass between them is a sentence relation in the document so; If two sentences adhere to different document separately, the pass between them is a sentence relation between document so.In order to distinguish the different importance of these two kinds of relations, the present invention is decomposed into the adjacency matrix that obtains:
W ~ = λ 1 W intra + λ 2 W inter - - - ( 4 )
W wherein IntraBe the adjacency matrix (the limit weights of sentence relation are made as 0 between the expression document) that only comprises the limit of sentence relation in the expression document, W InterThen be the adjacency matrix (the limit weights of sentence relation are made as 0 in the expression document) that only comprises the limit of sentence relation between the expression document, λ 1, λ 2∈ [0,1] in the present embodiment, establishes λ 1=0.3, λ 2=1, thus give more more important property to sentence relation between document.
To new adjacency matrix Standardize and obtain new similar matrix S = D - 1 / 2 W ~ D - 1 / 2 , Wherein D is a diagonal matrix, and (i, i) individual element equals I row element sum; Order to the former adjacency matrix W matrix that obtains that standardizes equally is
Figure C200610072587001210
(2) the arrangement value of each sentence in the employing bunch permutation algorithm iterative computation document;
In the present embodiment, bunch permutation algorithm concrete grammar is as follows:
Make f:x → R represent a permutation function, to each sentence x i(1≤i≤n) gives an arrangement value f i. we can regard f as a vector f=[f i..., f n] TSimultaneously, we define a vectorial y=[y i..., y n] T, y wherein i=1 has reflected sentence x iGiven theme or the inquiry of expression user, and y is all arranged for all sentences in the document i=0 (2≤i≤n).
According to the arrangement value of each sentence of following formula iterative computation, up to convergence:
f(t+1)=αSf(t)+(1-α)y (1)
The vector that obtains of the t time iteration of f (t) expression wherein, α is a parameter between [0,1], is determining the arrangement value contribution relative with the initial arrangement value of its adjacent sentence in the arrangement value computation process of certain sentence, to be set at be 0.6 to α in the present embodiment; Usually make f (1)=y, the arrangement value that all calculates of iterative process each time based on last iteration, utilize following formula to calculate the new arrangement value of each sentence, till the arrangement value that twice iterative computation in the front and back of all sentences obtains no longer changes, during actual computation if the variation of the arrangement value of all sentences during less than threshold value algorithm stop, in the present embodiment, threshold setting is 0.0001; Make f i *Sentence x after the expression algorithm convergence iThe arrangement value that obtains.
Above-mentioned basic idea is that the arrangement value between the adjacent sentence to a certain extent should be close, so each sentence all is diffused into self arrangement value its adjacent sentence, reaches up to this process till the steady state (SS) of an overall situation.Sentence in last each document has all obtained an arrangement value, reflects that the user oriented theme or the information inquiring of this sentence enriched degree.
Above-mentioned algorithm can prove theoretically and converge to
f *=β(I-αS) -1y (2)
β=1-α wherein.
(3) sentence is carried out otherness punishment, obtain the final weighted value of each sentence;
When sentence is carried out otherness punishment, adopt greedy algorithm to come each sentence is carried out otherness punishment, thereby guarantee the novelty of candidate's sentence, concrete grammar is as follows:
1) two set A=φ of initialization, B={x i| i=2 ..., n}, the final weighted value of each sentence is initialized as its arrangement value, that is to say RankScore (x i)=f i *, i=2 ... n;
2) according to the sentence among the current final weighted value descending sort B;
3) supposition x iBe the highest sentence of rank, first sentence in the sequence just is with x iMove on to A from B, and to each and x among the B iAdjacent sentence x j(j ≠ i) carry out following otherness to punish:
RankScore ( x j ) = RankScore ( x j ) - ω · S ^ ji · f i * - - - ( 5 )
ω>0th wherein, the punishment degree factor, ω is big more, and otherness punishment is strong more, and in the present embodiment, punishment degree factor ω is set at 8; If ω is 0, so just there is not otherness punishment;
4) circulation execution in step 2) and step 3), up to B=φ.
(4) according to the final weighted value of each sentence, from x 2..., x nIn select the big sentence of weighted value to form summary, in general select 2-10 sentence of weighted value maximum to form summary and get final product, in the present embodiment, select 8 sentences formation of weighted value maximum to make a summary.
Be illustrated in figure 2 as the method for using the present invention to propose and improve the synoptic diagram of file retrieval.
In order to verify validity of the present invention, adopt document to understand the evaluation and test data and the task of conference (DUC) conference (http://duc.nist.gov).We have adopted the multi-document summary evaluation and test task of the subject-oriented of DUC2003 and DUC2005 or inquiry, the 2nd of DUC2003 the evaluation and test task just, unique evaluation and test task of the 3rd evaluation and test task and DUC2005.The 2nd the evaluation and test task of DUC2003 provides the event topic of 30 document sets and 30 TDT, requires the person of participating in evaluation and electing to provide 100 speech length with the interior summary towards event topic.DUC and the 3rd evaluation and test task provide 30 document sets and 30 User Perspectives, require the person of participating in evaluation and electing that the summary of 100 speech with the user oriented viewpoint of interior length is provided.Unique summary task of DUC2005 provides 50 document sets and 50 user properties and DUC theme, requires the person of participating in evaluation and electing to provide 250 speech with the user oriented attribute of interior length and the summary of DUC theme.The summary that the person of participating in evaluation and electing submits to will be done contrast with artificial summary.We have adopted popular documentation summary evaluating method ROUGE evaluating method to evaluate and test method of the present invention, comprise three evaluation index ROUGE-1, ROUGE-2 and ROUGE-W, and the ROUGE value is big more, and effect is good more, and the ROUGE-1 value is topmost evaluation index.The most excellent three systems and two baseline systems of performance are made comparisons among the method for the present invention and the person of participating in evaluation and electing, and experimental result is distinguished as shown in Table 1 and Table 2:
Table 1: the comparative result of on the 2nd task of DUC2003, evaluating and testing
Figure C20061007258700141
Table 2: the comparative result of on the 3rd task of DUC2003, evaluating and testing
Figure C20061007258700151
Table 3: the comparative result of on the unique task of DUC2005, evaluating and testing
Figure C20061007258700152
Experimental result shows that method performance of the present invention is excellent, all is better than the person's of participating in evaluation and electing system and baseline system on three evaluation indexes.
The ROUGE evaluating method can be referring to document Automatic Evaluation of SummariesUsing N-gram Co-occurrence Statistics (author: C.-Y.Lin and E.H.Hovy be published in the periodical Proceedings of 2003 LanguageTechnology Conference (HLT-NAACL 2003) that published in 2003)
Method of the present invention is not limited to the embodiment described in the embodiment, and those skilled in the art's technical scheme according to the present invention draws other embodiment, belongs to technological innovation scope of the present invention equally.

Claims (13)

1. one kind based on the subject-oriented of bunch arrangement or the multiple file summarization method of inquiry, may further comprise the steps:
(1) reads in theme and document, perhaps read in inquiry and document; Theme and document are carried out subordinate sentence, perhaps subordinate sentence is carried out in inquiry and document; The sentence set is χ = { x 1 ,..., x p , x p + 1 ,..., x n } ⋐ R m , X wherein 1To x pP sentence that from theme or inquiry, obtains of expression, x P+1To x nRepresent n-p sentence that obtains from document, calculate the similarity of any two sentences in this n sentence, make up the sentence graph of a relation, the normalized sentence similar matrix of its correspondence is S;
(2) the arrangement value of each sentence in the employing bunch permutation algorithm iterative computation document, described arrangement value is the initial weight value;
(3) each sentence in the document is carried out otherness punishment, obtain the final weighted value of each sentence;
(4) select the big sentence of final weighted value to form summary.
2. as claimed in claim 1 a kind of based on the subject-oriented of bunch arrangement or the multiple file summarization method of inquiry, it is characterized in that: bunch permutation algorithm concrete grammar is as follows in the step (2):
Make f: χ → R represents a permutation function, each sentence x among the distich subclass χ 1, wherein, 1≤i≤n gives an arrangement value f 1, regard f as a vector f=[f 1..., f n] T, simultaneously, define a vectorial y=[y 1..., y n] T, wherein y is arranged for 1≤i≤p i=1, represent this p sentence from user given theme or inquiry, and y is all arranged for the sentence of the n-p in the document 1=0, p+1≤i≤n, wherein, T represents vectorial transposition;
According to the arrangement value of each sentence of following formula iterative computation, up to convergence:
f(t+1)=αSf(t)+(1-α)y (1)
The vector that obtains of the t time iteration of f (t) expression wherein, t is a positive integer, S is the normalized sentence similar matrix that step (1) obtains, α is [0,1] parameter between determines the arrangement value of adjacent sentence of certain sentence and this sentence initial arrangement value respectively to the contribution of finally arrangement value of this sentence; The arrangement value that all calculates based on last iteration of iterative process each time utilizes following formula to calculate the new arrangement value of each sentence, the variation of the arrangement value that obtains up to twice iterative computation in the front and back of all sentences during less than threshold value algorithm stop, making f (1)=y; Make f 1 *Sentence x after the expression algorithm convergence 1The arrangement value that obtains.
3. as claimed in claim 2 a kind of based on the subject-oriented of bunch arrangement or the multiple file summarization method of inquiry, it is characterized in that: theme in the step (1) or Query Information are to describe with the relevant personalization of specific user, comprise that user property, user put question to, user inquiring, these descriptions are directly provided by the user, and perhaps the behavioural analysis from the user obtains.
4. as claimed in claim 3 a kind of based on the subject-oriented of bunch arrangement or the multiple file summarization method of inquiry, it is characterized in that: in the step (1) theme or Query Information are divided into 1 to 5 sentence, just the p span is 1 to 5.
5. as claim 2,3 or 4 described a kind of based on the subject-oriented of bunch arrangement or the multiple file summarization method of inquiry, it is characterized in that: calculate sentence similarity in the step (1), when making up the sentence graph of a relation, concrete grammar is as follows:
1) to user given theme or inquiry subordinate sentence, obtains x 1To x pThis p sentence carries out subordinate sentence to all documents and obtains x P+1To x nThis n-p sentence, to this n sentence participle, the cosine formula distich subclass below utilizing then χ = { x 1 ,..., x p , x p + 1 ,..., x n } ⋐ R m In any two sentence x iAnd x jCalculate the similarity value:
sim ( x i , x j ) = cos ( x → i , x → j ) = x → 1 · x → j | | x → 1 | | · | | x → j | | - - - ( 3 )
Wherein With
Figure C2006100725870003C4
The term vector that is two sentence correspondences represents that the weight of speech t correspondence is according to tf in the vector t *Isf tFormula calculates, tf tThe frequency of expression speech t in sentence, isf tExpression speech t arranges the sentence frequency, just 1+log (N/n t), wherein N is the total quantity of sentence, n tBe the sentence quantity that comprises speech t;
2) each sentence is used as a summit, if two sentence x iAnd x jBetween the similarity value greater than threshold value, between these two sentences, set up a limit so, the weight on limit is the similarity value between the sentence, thereby obtains a weighted graph G, makes the adjacency matrix of W presentation graphs G correspondence, if sentence x iAnd x jBetween have the limit, W so Ij=sim (x i, x j), and for all i, W Ii=0;
3) for the weighted graph G that obtains, distinguish wherein in the document sentence relation between sentence relation and document, if two sentences belong to same document, the pass between them is a sentence relation in the document so; If two sentences adhere to different document separately, the pass between them is a sentence relation between document so, in order to distinguish the different importance of these two kinds of relations, the adjacency matrix that obtains is decomposed into:
W ~ = λ 1 W intra + λ 2 W inter - - - ( 4 )
W wherein IntraBe the adjacency matrix that only comprises the limit of sentence relation in the expression document, the limit weights of sentence relation are made as 0, W between the expression document InterThen be the adjacency matrix that only comprises the limit of sentence relation between the expression document, the limit weights of sentence relation are made as 0, λ in the expression document 1, λ 2∈ [0,1];
4) to new adjacency matrix
Figure C2006100725870003C6
Standardize and obtain new similar matrix S = D - 1 / 2 W ~ D - 1 / 2 , Wherein D is a diagonal matrix, and (i i) equals the element of the capable i row of matrix S i
Figure C2006100725870003C8
I row element sum, order to the former adjacency matrix W matrix that obtains that standardizes equally is
6. as claimed in claim 5 a kind of based on the subject-oriented of bunch arrangement or the multiple file summarization method of inquiry, it is characterized in that: judge two sentence x iAnd x jBetween similarity value during whether greater than threshold value, threshold setting is 0.01.
7. as claimed in claim 5 a kind of based on the subject-oriented of bunch arrangement or the multiple file summarization method of inquiry, it is characterized in that: distinguish when sentence concerns between interior sentence relation of document and document λ in the formula (4) in the step (1) 1Be made as 0.3, λ 2Be made as 1.
8. as claim 2,3 or 4 described a kind of, it is characterized in that based on the subject-oriented of bunch arrangement or the multiple file summarization method of inquiry: in the step (2) in the formula (1) α be set at 0.6.
9. as claim 2,3 or 4 described a kind of based on the subject-oriented of bunch arrangement or the multiple file summarization method of inquiry, it is characterized in that: whether the variation of arrangement value of judging sentence in the step (2) is during less than threshold value, and threshold setting is 0.0001.
10. as claimed in claim 5 a kind of based on the subject-oriented of bunch arrangement or the multiple file summarization method of inquiry, it is characterized in that: in the step (3) sentence is carried out otherness when punishment, adopt greedy algorithm to come each sentence is carried out otherness punishment, thereby guarantee the novelty of candidate's sentence, concrete grammar is as follows:
A) two set A=φ of initialization, B={x 1| i=p+1 ..., n}, the final weighted value of each sentence is initialized as its arrangement value, that is to say RankScore (x 1)=f 1 *, i=p+1 ..., n;
B) according to the sentence among the current final weighted value descending sort B;
C) supposition x 1Be the highest sentence of rank, first sentence in the sequence just is with x 1Move on to A from B, and to each and x among the B 1Adjacent sentence x 1, j ≠ i, carry out following otherness punishment:
RankScore ( x j ) = RankScore ( x j ) - ω · S ^ J 1 · f 1 * - - - ( 5 )
ω>0th wherein, the punishment degree factor, ω is big more, and otherness punishment is strong more; If ω is 0, so just there is not otherness punishment;
D) circulation execution in step b) and step c), up to B=φ.
11. as claimed in claim 10 a kind of based on the subject-oriented of bunch arrangement or the multiple file summarization method of inquiry, it is characterized in that: the punishment degree factor ω described in the formula (5) is set at 8 in the step (3);
In the step (4), document sentence x P+1To x nMiddle 2-10 the sentence of weighted value maximum of selecting forms summary.
12., it is characterized in that as claim 1,2,3 or 4 described a kind of based on the subject-oriented of bunch arrangement or the multiple file summarization method of inquiry: in the step (4), document sentence x P+1To x nMiddle 2-10 the sentence of weighted value maximum of selecting forms summary.
CNB2006100725872A 2006-04-13 2006-04-13 Multiple file summarization method facing subject or inquiry based on cluster arrangement Expired - Fee Related CN100418093C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2006100725872A CN100418093C (en) 2006-04-13 2006-04-13 Multiple file summarization method facing subject or inquiry based on cluster arrangement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2006100725872A CN100418093C (en) 2006-04-13 2006-04-13 Multiple file summarization method facing subject or inquiry based on cluster arrangement

Publications (2)

Publication Number Publication Date
CN1828609A CN1828609A (en) 2006-09-06
CN100418093C true CN100418093C (en) 2008-09-10

Family

ID=36947001

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2006100725872A Expired - Fee Related CN100418093C (en) 2006-04-13 2006-04-13 Multiple file summarization method facing subject or inquiry based on cluster arrangement

Country Status (1)

Country Link
CN (1) CN100418093C (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101398814B (en) * 2007-09-26 2010-08-25 北京大学 Method and system for simultaneously abstracting document summarization and key words
US8402369B2 (en) * 2008-05-28 2013-03-19 Nec Laboratories America, Inc. Multiple-document summarization using document clustering
CN101620596B (en) * 2008-06-30 2012-02-15 东北大学 Multi-document auto-abstracting method facing to inquiry
US9727556B2 (en) 2012-10-26 2017-08-08 Entit Software Llc Summarization of a document
CN105868175A (en) * 2015-12-03 2016-08-17 乐视网信息技术(北京)股份有限公司 Abstract generation method and device
CN108573045B (en) * 2018-04-18 2021-12-24 同方知网数字出版技术股份有限公司 Comparison matrix similarity retrieval method based on multi-order fingerprints
US10831793B2 (en) 2018-10-23 2020-11-10 International Business Machines Corporation Learning thematic similarity metric from article text units
WO2020082272A1 (en) * 2018-10-24 2020-04-30 Alibaba Group Holding Limited Intelligent customer services based on a vector propagation on a click graph model
CN109582967B (en) * 2018-12-03 2023-08-18 深圳前海微众银行股份有限公司 Public opinion abstract extraction method, device, equipment and computer readable storage medium
CN111368066B (en) * 2018-12-06 2024-02-09 北京京东尚科信息技术有限公司 Method, apparatus and computer readable storage medium for obtaining dialogue abstract

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1341899A (en) * 2000-09-07 2002-03-27 国际商业机器公司 Method for automatic generating abstract from word or file
US6397209B1 (en) * 1996-08-30 2002-05-28 Telexis Corporation Real time structured summary search engine
US6477534B1 (en) * 1998-05-20 2002-11-05 Lucent Technologies, Inc. Method and system for generating a statistical summary of a database using a join synopsis
CN1614587A (en) * 2003-11-07 2005-05-11 杨立伟 Method for digesting Chinese document automatically

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6397209B1 (en) * 1996-08-30 2002-05-28 Telexis Corporation Real time structured summary search engine
US6477534B1 (en) * 1998-05-20 2002-11-05 Lucent Technologies, Inc. Method and system for generating a statistical summary of a database using a join synopsis
CN1341899A (en) * 2000-09-07 2002-03-27 国际商业机器公司 Method for automatic generating abstract from word or file
CN1614587A (en) * 2003-11-07 2005-05-11 杨立伟 Method for digesting Chinese document automatically

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
一种新的句子相似度度量及其在文本自动摘要中的应用. 张奇,黄萱菁,吴立德.NCIRCS2004第一届全国信息检索与内容安全学术会议论文集. 2004
一种新的句子相似度度量及其在文本自动摘要中的应用. 张奇,黄萱菁,吴立德.NCIRCS2004第一届全国信息检索与内容安全学术会议论文集. 2004 *

Also Published As

Publication number Publication date
CN1828609A (en) 2006-09-06

Similar Documents

Publication Publication Date Title
CN100418093C (en) Multiple file summarization method facing subject or inquiry based on cluster arrangement
Lim et al. Multiple sets of features for automatic genre classification of web documents
Rousseau et al. Main core retention on graph-of-words for single-document keyword extraction
CN101398814B (en) Method and system for simultaneously abstracting document summarization and key words
Hu et al. Auditing the partisanship of Google search snippets
Au Yeung et al. Contextualising tags in collaborative tagging systems
Noruzi Folksonomies: Why do we need controlled vocabulary?
Gupta et al. An overview of social tagging and applications
CN100435145C (en) Multiple file summarization method based on sentence relation graph
Belhadi et al. Exploring pattern mining algorithms for hashtag retrieval problem
CN100511214C (en) Method and system for abstracting batch single document for document set
Seo et al. Online community search using conversational structures
Han et al. Knowledge based collection selection for distributed information retrieval
Chirigati et al. Knowledge exploration using tables on the web
Shi et al. Mining related queries from web search engine query logs using an improved association rule mining model
Xu et al. Using social annotations to improve language model for information retrieval
Liu et al. The research of Web mining
García et al. Techniques for comparing and recommending conferences
Blooma et al. Quadripartite graph-based clustering of questions
Zhao et al. Modeling Chinese microblogs with five Ws for topic hashtags extraction
Mekthanavanh et al. Social web video clustering based on multi-modal and clustering ensemble
Limpens et al. Linking folksonomies and ontologies for supporting knowledge sharing: a state of the art
Rotella et al. A domain based approach to information retrieval in digital libraries
Zhu et al. Improving web search by categorization, clustering, and personalization
Bjelland et al. Web link analysis: estimating document’s importance from its context

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220919

Address after: 3007, Hengqin international financial center building, No. 58, Huajin street, Hengqin new area, Zhuhai, Guangdong 519031

Patentee after: New founder holdings development Co.,Ltd.

Patentee after: Peking University

Patentee after: PEKING University FOUNDER R & D CENTER

Address before: 100871, fangzheng building, 298 Fu Cheng Road, Beijing, Haidian District

Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: Peking University

Patentee before: PEKING University FOUNDER R & D CENTER

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230330

Address after: 100871 No. 5, the Summer Palace Road, Beijing, Haidian District

Patentee after: Peking University

Address before: 3007, Hengqin international financial center building, No. 58, Huajin street, Hengqin new area, Zhuhai, Guangdong 519031

Patentee before: New founder holdings development Co.,Ltd.

Patentee before: Peking University

Patentee before: PEKING University FOUNDER R & D CENTER

CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20080910