CN100418093C

CN100418093C - Multiple file summarization method facing subject or inquiry based on cluster arrangement

Info

Publication number: CN100418093C
Application number: CNB2006100725872A
Authority: CN
Inventors: 万小军; 杨建武; 吴於茜; 陈晓鸥; 肖建国
Original assignee: BEIDA FANGZHENG TECHN INST Co Ltd BEIJING; Peking University; Peking University Founder Group Co Ltd
Current assignee: Peking University
Priority date: 2006-04-13
Filing date: 2006-04-13
Publication date: 2008-09-10
Anticipated expiration: 2026-04-13
Also published as: CN1828609A

Abstract

The present invention relates to a multiple file summarization method facing a topic or a query based on cluster arrangement, which belongs to the technical field of language word processing. When a user searches for interested topics, the existing multiple file summarization method can not accurately return relevant news information and summarization facing user's attributes according to attributes, such as user-defined interests and tastes, etc. The method of the present invention provides a new semi-supervised learning arithmetic based on a cluster arrangement arithmetic and is characterized in that the correlation among sentences and the relationship between the sentences and users' topics or queries are overall considered by the method; consequently, a generated abstract can comprise main information of a document collection, and can also interpret topics or answer queries; simultaneous, a difference penalty arithmetic ensures the novelty of the summarization. The relevant news information can be returned according to requirements, such as users' interests and tastes, etc., via the adoption of the present invention. Consequently, the good multiple file summarization facing the topics or the queries is obtained, which can meet individuation requirements of different users.

Description

A kind of based on the subject-oriented of bunch arrangement or the multiple file summarization method of inquiry

Technical field

The invention belongs to the spoken and written languages processing technology field, be specifically related to a kind of based on bunch subject-oriented of arrangement (manifold-ranking) or the multiple file summarization method of inquiry.

Background technology

Multi-document summary is a key problem of natural language processing field, is widely used in text/Web retrieval in recent years and waits in the application.For example, search engine such as Google, Baidu all provides press service, form a plurality of Special Topics in Journalism by the news information on the collection network, the user browses interested Special Topics in Journalism for convenience, need utilize the multi-document summary technology to generate a brief and concise summary for each Special Topics in Journalism.The multi-document summary of subject-oriented or inquiry then can be regarded as a kind of special multi-document summary task, the multi-document summary that this task generates need reflect certain theme or the inquiry (or being called user property) of user's appointment, that is to say that the focus of user's concern or the information requirement of proposition can be explained or answer to the summary of generation.In above-mentioned press service product, the Personalize News service more and more comes into one's own, the user only is concerned about own interested theme of news usually, according to attributes such as user-defined hobbies, requires the press service product can return the summary of relevant news information and user oriented attribute.In some intelligent question answering systems, we also require system to generate from relevant documentation can answer the summary that the user puts question to, and this summary also is the multi-document summary of a kind of typical subject-oriented or inquiry.

The where the shoe pinches of the multi-document summary of subject-oriented or inquiry is: first, difficult the same with common multi-document summary, because existing, the information that comprises in the different document repeats significantly and redundancy, therefore good multiple file summarization method wants to merge effectively the information in the different document, promptly should make the main information in the summary stet shelves of generation, make the information in the summary keep certain novelty again.Second, different with common multi-document summary is, the multi-document summary of subject-oriented or inquiry requires its information that comprises relevant with theme or inquiry, can annotate theme and answer inquiry, therefore will make full use of theme or the Query Information that the user provides in digest procedure.In recent years, multi-document summary has become the popular research topic of natural language processing field and information retrieval field, and its progress is reflected in a series of academic conferences about the automatic document summary, comprises NTCIR, DUC and ACL, COLING and SIGIR.

In short, common multiple file summarization method can be divided into the method (Extraction) that extracts based on sentence and based on the method (Abstraction) of sentence generation.Based on the fairly simple practicality of method that sentence extracts, do not need to utilize the natural language understanding technology of deep layer; This method is given certain weight to each sentence after text is carried out subordinate sentence, reflect its importance, and several sentences of weight selection maximum form summary then.Then need to utilize the natural language understanding technology of deep layer based on the method for sentence generation, after former document is carried out sentence structure, semantic analysis, utilize information extraction or natural language generation technique to produce new sentence, thereby form summary.

Present most of multiple file summarization method has also been put down in writing the many pieces of methods about multi-document summary all based on the sentence extraction technique in the existing document.(this article author is article Centroid-based summarization ofmultiple documents: D.R.Radev, H.Y.Jing, M.Stys andD.Tam, be published in the periodical Information Processing andManagement that published in 2004) a kind of sentence abstracting method based on central point disclosed, this method is at present popular a kind of method of abstracting that extracts based on sentence, MEAD is a prototype abstract system that utilizes this method to realize, it is in the process of giving the sentence weight, taken all factors into consideration the feature between sentence level and the sentence, comprise class bunch central point, the sentence position, TF*IDF etc.Article From Single toMulti-document Summarization:A Prototype System and its Evaluation (author: C.-Y.Lin and E.H.Hovy, be published in the periodical of publication in 2002: Proceedingsof the 40th Anniversary Meeting of the Association for ComputationalLinguistics (ACL-02),) the sentence extraction system of a kind of NeATS by name disclosed, this system is a multi-document summary system of ISI exploitation, it comes from single document abstract system-SUMMARIST, this system has considered the sentence position when selecting important sentences, the word frequency, a plurality of features such as theme signature and word class bunch utilize the MMR technology to the sentence weight that disappears simultaneously.Article Cross-documentsummarization by concept classification (author: H.Hardy, N.Shimizu, T.Strzalkowski, L.Ting, G.B.Wise, and X.Zhang, be published in the periodical of publication in 2003: the sentence extraction system that Proceedings of SIGIR ' 02) discloses a kind of XdoX by name, this system is suitably for large-scale document sets and generates summary, it at first detects most important theme in the document sets by the paragraph cluster, and the sentence that extracts the reflection important theme then forms summary.Article Topic themes for multi-document summarization (author S.Harabagiuand F.Lacatusu, be published in the periodical Proceedings of SIGIR ' 05 that published in 2005) method of Harabagiu and Lacatusu is disclosed, this method has been inquired into five kinds of different many document subject matter manifestation modes and has been proposed a kind of new theme manifestation mode.

Method based on graph structure also is used to the importance of sentence is sorted.Article Summarizing Similarities and Diffe rences Among Related Documents (author: I.Mani and E.Bloedorn, be published in the periodical InformationRetrieval that published in 2000) method of a kind of WebSumm by name disclosed, this method is utilized the figure link model, and this supposes to come the importance to sentence to sort to have higher significant according to the summit that is connected with a plurality of other summits.Article LexPageRank:prestige in multi-document textsummarization (author: G.Erkan and D.Radev, be published in the periodical of publication in 2004: the method that Proceedings of the Conference on Empirical Methods in NaturalLanguage Processing (EMNLP ' 04)) discloses a kind of LexPageRank by name, this method at first makes up the sentence connection matrix, and the algorithm based on similar PageRank calculates sentence importance then.Article Alanguage independent algorithm for single and multipledocument summarization (author: R.Mihalcea and P.Tarau, be published in the periodical of publication in 2005: Proceedings of the Second International JointConference on Natural Language Processing (IJCNLP ' 05)) disclose the method for a kind of Mihalcea by name and Tarau, this method has also proposed a similar algorithm computation sentence importance based on PageRank and HITS.

The multiple file summarization method of subject-oriented or inquiry is usually based on common multiple file summarization method; integration theme or Query Information in digest procedure; feasible summary can satisfy user's customizing messages demand, has also put down in writing the many pieces of methods about multi-document summary in the existing document.Article Robust genericand query-based summarization (author: H.Saggion, K.Bontcheva, andH.Cunningham, be published in the 2005 periodical Proceedings of EACL-2003 that publish) multiple file summarization method of a kind of subject-oriented or inquiry disclosed, this method utilization is calculated the similarity of each sentence and inquiry based on the weight calculator of inquiry, considers this similarity value then in based on the digest procedure of inquiry.Article Approaches to event-focused summarization basedon named entities and query words (author: J.Ge., X.Huang, and L.Wu, be published in the periodical Proceedings of the 2003 DocumentUnderstanding Workshop that published in 2003) multiple file summarization method of a kind of subject-oriented or inquiry disclosed, article CLASSY query-based multi-document summarization (author: J.M.Conroy and J.D.Schlesinger, be published in the periodical Proceedingsof the 2005 Document Understanding Workshop that published in 2005) multiple file summarization method of a kind of subject-oriented or inquiry also disclosed, the method for these two kinds of multi-document summaries has been inquired into query word and named entity in the subject description to the effect towards the multi-document summary of incident or inquiry.Article CATS atopic-oriented multi-document summarization system at DUC 2005 (authors: A.Farzindar, F.Rozon, and G.Lapalme, be published in the periodical Proceedings of the 2005 Document Unders tanding Workshop that published in 2005) multiple file summarization method of a kind of subject-oriented or inquiry disclosed, this method is at first carried out subject analysis to document, the theme that the theme that obtains and user are provided mates then, obtains the multi-document summary of subject-oriented at last.But, said method still comes with some shortcomings, these methods are failed to take all factors into consideration the subject-oriented or the information inquiring of sentence and are enriched degree and the novel degree of information, thereby can not return the summary of relevant news information and user oriented attribute accurately according to attributes such as user-defined hobbies.

Summary of the invention

At the defective that exists in the prior art, the purpose of this invention is to provide a kind of based on bunch subject-oriented of arrangement (manifold-ranking) or the multiple file summarization method of inquiry, this method can take all factors into consideration the subject-oriented of sentence or information inquiring is enriched degree and the novel degree of information, and utilizes bunch permutation algorithm to consider integratedly naturally that mutual relationship between the sentence and user's theme or information inquiring can be implemented under the situation of given theme or inquiry to form the summary that more meets user's request for a plurality of documents.

For reaching above purpose, the technical solution used in the present invention is: a kind of based on the subject-oriented of bunch arrangement or the multiple file summarization method of inquiry, may further comprise the steps:

(1) reads in theme and document, perhaps read in inquiry and document; Theme and document are carried out subordinate sentence, perhaps subordinate sentence is carried out in inquiry and document; The sentence set is

χ = {x_{1}, . . ., x_{p}, x_{p + 1}, . . ., x_{n}} &Subset; R^{m},

X wherein _iTo x _pP sentence that from theme or inquiry, obtains of expression, x _P+1To x _nN-p sentence that from document, obtains of expression.Calculate the similarity of any two sentences in this n sentence, make up the sentence graph of a relation, the normalized sentence similar matrix of its correspondence is S;

(2) the arrangement value of each sentence in the employing bunch permutation algorithm iterative computation document, described arrangement value is the initial weight value;

(3) each sentence in the document is carried out otherness punishment, obtain the final weighted value of each sentence;

(4), from document, select the big sentence of final weighted value to form summary according to the final weighted value of each sentence.

Furthermore, bunch permutation algorithm concrete grammar described in the step (2) is as follows:

Make f:x → R represent a permutation function, each sentence x among the distich subclass x _i, wherein, 1≤i≤n gives an arrangement value f _i, regard f as a vector f=[f _i..., f _n] ^T, simultaneously, define a vectorial y=[y _i..., y _n] ^T, wherein y is arranged for 1≤i≤p _i=1, represent this p sentence from user given theme or inquiry, and y is all arranged for the sentence of the n-p in the document _i=0 (p+1≤i≤n), wherein, T represents vectorial transposition;

According to the arrangement value of each sentence of following formula iterative computation, up to convergence:

f(t+1)＝αSf(t)+(1-α)y (1)

The vector that obtains of the t time iteration of f (t) expression wherein, t is a positive integer, S is the normalized sentence similar matrix that step (1) obtains, α is [0,1] parameter between determines the arrangement value of adjacent sentence of certain sentence and this sentence initial arrangement value respectively to the contribution of finally arrangement value of this sentence; The arrangement value that all calculates of iterative process each time based on last iteration, utilize following formula to calculate the new arrangement value of each sentence, till the arrangement value that twice iterative computation in the front and back of all sentences obtains no longer changes, during actual computation if the variation of the arrangement value of all sentences during less than threshold value algorithm stop, making f (1)=y; Make f _i ^*Sentence x after the expression algorithm convergence _iThe arrangement value that obtains.

Above-mentioned basic idea is that the arrangement value between the adjacent sentence to a certain extent should be close, therefore each sentence all is diffused into self arrangement value its adjacent sentence, reach up to this process till the steady state (SS) of an overall situation, sentence in last each document has all obtained an arrangement value, the user oriented theme or the information inquiring that reflect this sentence are enriched degree

Above-mentioned algorithm can prove theoretically and converge to

f ^*＝β(I-αS) ^-1y (2)

β=1-α wherein, f ^*The arrangement value vector that expression obtains, I is a unit matrix;

Further, for making the present invention obtain better invention effect, theme described in the step (1) or Query Information are to describe with the relevant personalization of specific user, comprise that user property, user put question to, user inquiring, these descriptions are directly provided by the user, and perhaps the behavioural analysis from the user obtains.

Further again, in the step (1) theme or Query Information are divided into 1 to 5 sentence, the span that is to say p is 1 to 5.

Further, obtain better invention effect for making the present invention, calculate sentence similarity in the step (1), when making up the sentence graph of a relation, concrete grammar is as follows:

1) to user given theme or inquiry subordinate sentence, obtains x _iTo x _pThis p sentence carries out subordinate sentence to all documents and obtains x _P+1To x _nThis n-p sentence, to this n sentence participle, the cosine formula distich subclass below utilizing then

χ = {x_{1}, . . ., x_{p}, x_{p + 1}, . . ., x_{n}} &Subset; R^{m}

In any two sentence x _iAnd x _jCalculate the similarity value:

sim (x_{i}, x_{j}) = \cos ({\overset{&RightArrow;}{x}}_{1}, {\overset{&RightArrow;}{x}}_{j}) = \frac{{\overset{&RightArrow;}{x}}_{1} \cdot {\overset{&RightArrow;}{x}}_{j}}{| | {\overset{&RightArrow;}{x}}_{1} | | \cdot | | {\overset{&RightArrow;}{x}}_{j} | |} - - - (3)

Wherein

With

The term vector that is two sentence correspondences represents that the weight of speech t correspondence is according to tf in the vector _t* isf _tFormula calculates, tf _tThe frequency of expression speech t in sentence, isf _tExpression speech t arranges the sentence frequency, just 1+log (N/n _t), wherein N is the total quantity of sentence, n _tBe the sentence quantity that comprises speech t;

2) each sentence is used as a summit, if two sentence x _iAnd x _jBetween the similarity value greater than threshold value, between these two sentences, set up a limit so, the weight on limit is the similarity value between the sentence, thereby obtains a weighted graph G, makes the adjacency matrix of W presentation graphs G correspondence, if sentence x _iAnd x _jBetween have the limit, W so _Ij=sim (x _iAnd x _j), and for all i, W _Ii=0;

3) for the weighted graph G that obtains, the present invention distinguishes wherein in the document sentence relation between sentence relation and document, if two sentences belong to same document, the pass between them is a sentence relation in the document so; If two sentences adhere to different document separately, the pass between them is a sentence relation between document so.In order to distinguish the different importance of these two kinds of relations, the present invention is decomposed into the adjacency matrix that obtains:

\tilde{W} = λ_{1} W_{mtra} + λ_{2} W_{mter} - - - (4)

W wherein _IntraBe the adjacency matrix (the limit weights of sentence relation are made as 0 between the expression document) that only comprises the limit of sentence relation in the expression document, W _InterThen be the adjacency matrix (the limit weights of sentence relation are made as 0 in the expression document) that only comprises the limit of sentence relation between the expression document, λ ₁, λ ₂∈ [0,1];

4) to new adjacency matrix

Standardize and obtain new similar matrix

S = D^{- 1 / 2} \tilde{W} D^{- 1 / 2},

Wherein D is a diagonal matrix, and (i i) equals the element of the capable i row of matrix S i

I row element sum; Order to the former adjacency matrix W matrix that obtains that standardizes equally is

Further again, obtain better invention effect for making the present invention, set two sentence x in the step (1) _iAnd x _jBetween similarity value during greater than threshold value, threshold setting is 0.01.

Further, obtain better invention effect, distinguish when sentence concerns between interior sentence relation of document and document λ in the formula (4) in the step (1) for making the present invention ₁Be made as 0.3, λ ₂Be made as 1.

Further, obtain better invention effect for making the present invention, the middle α of formula (1) is set at 0.6 in the step (2).

Further, obtain better invention effect for making the present invention, when the variation of the arrangement value of setting sentence was less than threshold value in the step (2), threshold setting was 0.0001.

Further, obtain better invention effect for making the present invention, in the step (3) sentence is carried out otherness when punishment, adopt greedy algorithm to come each sentence is carried out otherness punishment, thereby guarantee the novelty of candidate's sentence, concrete grammar is as follows:

1) two set A=φ of initialization, B={x _i| i=p+1 ..., n}, the final weighted value of each sentence is initialized as its arrangement value, that is to say RankScore (x _i)=f _i ^*, i=p+1 ... n;

2) according to the sentence among the current final weighted value descending sort B;

3) supposition x _iBe the highest sentence of rank, first sentence in the sequence just is with x _iMove on to A from B, and to each and x among the B _iAdjacent sentence xj ₍J ≠ i) carry out following otherness to punish:

RankScore (x_{j}) = RankScore (x_{j}) - ω \cdot {\hat{S}}_{ji} \cdot f_{i}^{*} - - - (5)

ω＞0th wherein, the punishment degree factor, ω is big more, and otherness punishment is strong more; If ω is 0, so just there is not otherness punishment;

4) circulation execution in step 2) and step 3), up to B=φ.

Further again, obtain better invention effect for making the present invention, the punishment degree factor ω described in the formula (5) is set at 8 in the step (3).

Further, in the step (4), from document sentence x _P+1To x _nMiddle 2-10 the sentence of weighted value maximum of selecting forms summary.

Effect of the present invention is: adopt method of the present invention, mutual relationship and user's theme or information inquiring between the sentence can have been considered comprehensively, realized making the multi-document summary of generation can comprise the main information of document sets, can annotate theme again or answer inquiry, can access the multi-document summary of better subject-oriented or inquiry.

Why the present invention has the foregoing invention effect, be because the present invention has following characteristics: the present invention proposes a kind of brand-new method of abstracting, this method is based on a kind of new semi-supervised learning algorithm-based on the algorithm of bunch arrangement, mutual relationship between the integrated consideration sentence and user's theme or information inquiring, thereby make the summary that generates to comprise the main information of document sets, can annotate theme again or answer inquiry, utilization variance punishment algorithm guarantees to generate the novelty of summary simultaneously.This method has treated also in based on the algorithm of bunch arrangement in the document that sentence concerns this two kinds of different relations between sentence relation and document with a certain discrimination, gives the bigger contribution weight of sentence relation between document.

Description of drawings

Fig. 1 is the process flow diagram of the method for the invention;

The method that Fig. 2 is to use the present invention to propose is improved the synoptic diagram of file retrieval.

Embodiment

The invention will be further described below in conjunction with drawings and Examples:

As shown in Figure 1, a kind of based on the subject-oriented of bunch arrangement or the multiple file summarization method of inquiry, may further comprise the steps:

(1) reads in document, theme or Query Information as sentence, to each document and theme or Query Information subordinate sentence, participle, are calculated sentence similarity, make up the sentence graph of a relation;

Theme described in the present embodiment comprises user property, user's enquirement, user inquiring etc. with the relevant personalization description of specific user, and these descriptions are directly to be provided by the user, can certainly obtain from user's behavioural analysis; If topics long can be divided into theme a plurality of sentences, preferably be divided into 1 to 5 sentence.Because the theme in the present embodiment is shorter, so just theme is used as a sentence, just makes p=1.

Calculate sentence similarity in the present embodiment, when making up the sentence graph of a relation, adopt concrete grammar as follows:

A sentence x be used as in the theme that the user is given ₁, each document subordinate sentence is obtained n-1 sentence, obtain the sentence set simultaneously

χ = {x_{1}, x_{2}, . . ., x_{n}} &Subset; R^{m},

X wherein ₁Given theme or the inquiry of expression user, x ₂₂..., x _nN-1 sentence in the expression document; To this n sentence participle, the cosine formula distich subclass below utilizing then

χ = {x_{1}, x_{2}, . . ., x_{n}} &Subset; R^{m}

In any two sentence x _iAnd x _jCalculate the similarity value:

sim (x_{i}, x_{j}) = \cos ({\overset{&RightArrow;}{x}}_{i}, {\overset{&RightArrow;}{x}}_{j}) = \frac{{\overset{&RightArrow;}{x}}_{i} \cdot {\overset{&RightArrow;}{x}}_{j}}{| | {\overset{&RightArrow;}{x}}_{i} | | \cdot | | {\overset{&RightArrow;}{x}}_{j} | |} - - - (3)

Wherein

With

The term vector that is two sentence correspondences represents that the weight of speech t correspondence is according to tf in the vector _t* isf _tFormula calculates, tf _tThe frequency of expression speech t in sentence, isf _tExpression speech t arranges the sentence frequency, just 1+log (N/n _t), wherein N is the total quantity of sentence, n _tBe the sentence quantity that comprises speech t.

A summit be used as in each sentence, if two sentence x _iAnd x _jBetween the similarity value greater than threshold value, in the present embodiment, setting threshold is 0.01; Set up a limit so between these two sentences, the weight on limit is the similarity value between the sentence, thereby obtains a weighted graph G.Make the adjacency matrix of W presentation graphs G correspondence, if sentence x _iAnd x _jBetween have the limit, W so _Ij=sim (x _i, x _j), and for all i, W _Ii=0.

For the weighted graph G that obtains, the present invention distinguishes wherein in the document sentence relation between sentence relation and document.If two sentences belong to same document, the pass between them is a sentence relation in the document so; If two sentences adhere to different document separately, the pass between them is a sentence relation between document so.In order to distinguish the different importance of these two kinds of relations, the present invention is decomposed into the adjacency matrix that obtains:

\tilde{W} = λ_{1} W_{intra} + λ_{2} W_{inter} - - - (4)

W wherein _IntraBe the adjacency matrix (the limit weights of sentence relation are made as 0 between the expression document) that only comprises the limit of sentence relation in the expression document, W _InterThen be the adjacency matrix (the limit weights of sentence relation are made as 0 in the expression document) that only comprises the limit of sentence relation between the expression document, λ ₁, λ ₂∈ [0,1] in the present embodiment, establishes λ ₁=0.3, λ ₂=1, thus give more more important property to sentence relation between document.

To new adjacency matrix Standardize and obtain new similar matrix

S = D^{- 1 / 2} \tilde{W} D^{- 1 / 2},

Wherein D is a diagonal matrix, and (i, i) individual element equals I row element sum; Order to the former adjacency matrix W matrix that obtains that standardizes equally is

(2) the arrangement value of each sentence in the employing bunch permutation algorithm iterative computation document;

In the present embodiment, bunch permutation algorithm concrete grammar is as follows:

Make f:x → R represent a permutation function, to each sentence x _i(1≤i≤n) gives an arrangement value f _i. we can regard f as a vector f=[f _i..., f _n] ^TSimultaneously, we define a vectorial y=[y _i..., y _n] ^T, y wherein _i=1 has reflected sentence x _iGiven theme or the inquiry of expression user, and y is all arranged for all sentences in the document _i=0 (2≤i≤n).

f(t+1)＝αSf(t)+(1-α)y (1)

The vector that obtains of the t time iteration of f (t) expression wherein, α is a parameter between [0,1], is determining the arrangement value contribution relative with the initial arrangement value of its adjacent sentence in the arrangement value computation process of certain sentence, to be set at be 0.6 to α in the present embodiment; Usually make f (1)=y, the arrangement value that all calculates of iterative process each time based on last iteration, utilize following formula to calculate the new arrangement value of each sentence, till the arrangement value that twice iterative computation in the front and back of all sentences obtains no longer changes, during actual computation if the variation of the arrangement value of all sentences during less than threshold value algorithm stop, in the present embodiment, threshold setting is 0.0001; Make f _i ^*Sentence x after the expression algorithm convergence _iThe arrangement value that obtains.

Above-mentioned basic idea is that the arrangement value between the adjacent sentence to a certain extent should be close, so each sentence all is diffused into self arrangement value its adjacent sentence, reaches up to this process till the steady state (SS) of an overall situation.Sentence in last each document has all obtained an arrangement value, reflects that the user oriented theme or the information inquiring of this sentence enriched degree.

Above-mentioned algorithm can prove theoretically and converge to

f ^*＝β(I-αS) ^-1y (2)

β=1-α wherein.

(3) sentence is carried out otherness punishment, obtain the final weighted value of each sentence;

When sentence is carried out otherness punishment, adopt greedy algorithm to come each sentence is carried out otherness punishment, thereby guarantee the novelty of candidate's sentence, concrete grammar is as follows:

1) two set A=φ of initialization, B={x _i| i=2 ..., n}, the final weighted value of each sentence is initialized as its arrangement value, that is to say RankScore (x _i)=f _i ^*, i=2 ... n;

3) supposition x _iBe the highest sentence of rank, first sentence in the sequence just is with x _iMove on to A from B, and to each and x among the B _iAdjacent sentence x _j(j ≠ i) carry out following otherness to punish:

RankScore (x_{j}) = RankScore (x_{j}) - ω \cdot {\hat{S}}_{ji} \cdot f_{i}^{*} - - - (5)

ω＞0th wherein, the punishment degree factor, ω is big more, and otherness punishment is strong more, and in the present embodiment, punishment degree factor ω is set at 8; If ω is 0, so just there is not otherness punishment;

4) circulation execution in step 2) and step 3), up to B=φ.

(4) according to the final weighted value of each sentence, from x ₂..., x _nIn select the big sentence of weighted value to form summary, in general select 2-10 sentence of weighted value maximum to form summary and get final product, in the present embodiment, select 8 sentences formation of weighted value maximum to make a summary.

Be illustrated in figure 2 as the method for using the present invention to propose and improve the synoptic diagram of file retrieval.

In order to verify validity of the present invention, adopt document to understand the evaluation and test data and the task of conference (DUC) conference (http://duc.nist.gov).We have adopted the multi-document summary evaluation and test task of the subject-oriented of DUC2003 and DUC2005 or inquiry, the 2nd of DUC2003 the evaluation and test task just, unique evaluation and test task of the 3rd evaluation and test task and DUC2005.The 2nd the evaluation and test task of DUC2003 provides the event topic of 30 document sets and 30 TDT, requires the person of participating in evaluation and electing to provide 100 speech length with the interior summary towards event topic.DUC and the 3rd evaluation and test task provide 30 document sets and 30 User Perspectives, require the person of participating in evaluation and electing that the summary of 100 speech with the user oriented viewpoint of interior length is provided.Unique summary task of DUC2005 provides 50 document sets and 50 user properties and DUC theme, requires the person of participating in evaluation and electing to provide 250 speech with the user oriented attribute of interior length and the summary of DUC theme.The summary that the person of participating in evaluation and electing submits to will be done contrast with artificial summary.We have adopted popular documentation summary evaluating method ROUGE evaluating method to evaluate and test method of the present invention, comprise three evaluation index ROUGE-1, ROUGE-2 and ROUGE-W, and the ROUGE value is big more, and effect is good more, and the ROUGE-1 value is topmost evaluation index.The most excellent three systems and two baseline systems of performance are made comparisons among the method for the present invention and the person of participating in evaluation and electing, and experimental result is distinguished as shown in Table 1 and Table 2:

Table 1: the comparative result of on the 2nd task of DUC2003, evaluating and testing

Table 2: the comparative result of on the 3rd task of DUC2003, evaluating and testing

Table 3: the comparative result of on the unique task of DUC2005, evaluating and testing

Experimental result shows that method performance of the present invention is excellent, all is better than the person's of participating in evaluation and electing system and baseline system on three evaluation indexes.

The ROUGE evaluating method can be referring to document Automatic Evaluation of SummariesUsing N-gram Co-occurrence Statistics (author: C.-Y.Lin and E.H.Hovy be published in the periodical Proceedings of 2003 LanguageTechnology Conference (HLT-NAACL 2003) that published in 2003)

Method of the present invention is not limited to the embodiment described in the embodiment, and those skilled in the art's technical scheme according to the present invention draws other embodiment, belongs to technological innovation scope of the present invention equally.

Claims

1. one kind based on the subject-oriented of bunch arrangement or the multiple file summarization method of inquiry, may further comprise the steps:

χ = {x_{1},..., x_{p}, x_{p + 1},..., x_{n}} &Subset; R^{m},

X wherein ₁To x _pP sentence that from theme or inquiry, obtains of expression, x _P+1To x _nRepresent n-p sentence that obtains from document, calculate the similarity of any two sentences in this n sentence, make up the sentence graph of a relation, the normalized sentence similar matrix of its correspondence is S;

(4) select the big sentence of final weighted value to form summary.

2. as claimed in claim 1 a kind of based on the subject-oriented of bunch arrangement or the multiple file summarization method of inquiry, it is characterized in that: bunch permutation algorithm concrete grammar is as follows in the step (2):

Make f: χ → R represents a permutation function, each sentence x among the distich subclass χ ₁, wherein, 1≤i≤n gives an arrangement value f ₁, regard f as a vector f=[f ₁..., f _n] ^T, simultaneously, define a vectorial y=[y ₁..., y _n] ^T, wherein y is arranged for 1≤i≤p _i=1, represent this p sentence from user given theme or inquiry, and y is all arranged for the sentence of the n-p in the document ₁=0, p+1≤i≤n, wherein, T represents vectorial transposition;

f(t+1)＝αSf(t)+(1-α)y (1)

The vector that obtains of the t time iteration of f (t) expression wherein, t is a positive integer, S is the normalized sentence similar matrix that step (1) obtains, α is [0,1] parameter between determines the arrangement value of adjacent sentence of certain sentence and this sentence initial arrangement value respectively to the contribution of finally arrangement value of this sentence; The arrangement value that all calculates based on last iteration of iterative process each time utilizes following formula to calculate the new arrangement value of each sentence, the variation of the arrangement value that obtains up to twice iterative computation in the front and back of all sentences during less than threshold value algorithm stop, making f (1)=y; Make f ₁ ^*Sentence x after the expression algorithm convergence ₁The arrangement value that obtains.

3. as claimed in claim 2 a kind of based on the subject-oriented of bunch arrangement or the multiple file summarization method of inquiry, it is characterized in that: theme in the step (1) or Query Information are to describe with the relevant personalization of specific user, comprise that user property, user put question to, user inquiring, these descriptions are directly provided by the user, and perhaps the behavioural analysis from the user obtains.

4. as claimed in claim 3 a kind of based on the subject-oriented of bunch arrangement or the multiple file summarization method of inquiry, it is characterized in that: in the step (1) theme or Query Information are divided into 1 to 5 sentence, just the p span is 1 to 5.

5. as claim 2,3 or 4 described a kind of based on the subject-oriented of bunch arrangement or the multiple file summarization method of inquiry, it is characterized in that: calculate sentence similarity in the step (1), when making up the sentence graph of a relation, concrete grammar is as follows:

1) to user given theme or inquiry subordinate sentence, obtains x ₁To x _pThis p sentence carries out subordinate sentence to all documents and obtains x _P+1To x _nThis n-p sentence, to this n sentence participle, the cosine formula distich subclass below utilizing then

χ = {x_{1},..., x_{p}, x_{p + 1},..., x_{n}} &Subset; R^{m}

In any two sentence x _iAnd x _jCalculate the similarity value:

sim (x_{i}, x_{j}) = \cos ({\overset{&RightArrow;}{x}}_{i}, {\overset{&RightArrow;}{x}}_{j}) = \frac{{\overset{&RightArrow;}{x}}_{1} \cdot {\overset{&RightArrow;}{x}}_{j}}{| | {\overset{&RightArrow;}{x}}_{1} | | \cdot | | {\overset{&RightArrow;}{x}}_{j} | |} - - - (3)

Wherein With

The term vector that is two sentence correspondences represents that the weight of speech t correspondence is according to tf in the vector _t ^*Isf _tFormula calculates, tf _tThe frequency of expression speech t in sentence, isf _tExpression speech t arranges the sentence frequency, just 1+log (N/n _t), wherein N is the total quantity of sentence, n _tBe the sentence quantity that comprises speech t;

2) each sentence is used as a summit, if two sentence x _iAnd x _jBetween the similarity value greater than threshold value, between these two sentences, set up a limit so, the weight on limit is the similarity value between the sentence, thereby obtains a weighted graph G, makes the adjacency matrix of W presentation graphs G correspondence, if sentence x _iAnd x _jBetween have the limit, W so _Ij=sim (x _i, x _j), and for all i, W _Ii=0;

3) for the weighted graph G that obtains, distinguish wherein in the document sentence relation between sentence relation and document, if two sentences belong to same document, the pass between them is a sentence relation in the document so; If two sentences adhere to different document separately, the pass between them is a sentence relation between document so, in order to distinguish the different importance of these two kinds of relations, the adjacency matrix that obtains is decomposed into:

\tilde{W} = λ_{1} W_{intra} + λ_{2} W_{inter} - - - (4)

W wherein _IntraBe the adjacency matrix that only comprises the limit of sentence relation in the expression document, the limit weights of sentence relation are made as 0, W between the expression document _InterThen be the adjacency matrix that only comprises the limit of sentence relation between the expression document, the limit weights of sentence relation are made as 0, λ in the expression document ₁, λ ₂∈ [0,1];

4) to new adjacency matrix

Standardize and obtain new similar matrix

S = D^{- 1 / 2} \tilde{W} D^{- 1 / 2},

I row element sum, order to the former adjacency matrix W matrix that obtains that standardizes equally is

6. as claimed in claim 5 a kind of based on the subject-oriented of bunch arrangement or the multiple file summarization method of inquiry, it is characterized in that: judge two sentence x _iAnd x _jBetween similarity value during whether greater than threshold value, threshold setting is 0.01.

7. as claimed in claim 5 a kind of based on the subject-oriented of bunch arrangement or the multiple file summarization method of inquiry, it is characterized in that: distinguish when sentence concerns between interior sentence relation of document and document λ in the formula (4) in the step (1) ₁Be made as 0.3, λ ₂Be made as 1.

8. as claim 2,3 or 4 described a kind of, it is characterized in that based on the subject-oriented of bunch arrangement or the multiple file summarization method of inquiry: in the step (2) in the formula (1) α be set at 0.6.

9. as claim 2,3 or 4 described a kind of based on the subject-oriented of bunch arrangement or the multiple file summarization method of inquiry, it is characterized in that: whether the variation of arrangement value of judging sentence in the step (2) is during less than threshold value, and threshold setting is 0.0001.

10. as claimed in claim 5 a kind of based on the subject-oriented of bunch arrangement or the multiple file summarization method of inquiry, it is characterized in that: in the step (3) sentence is carried out otherness when punishment, adopt greedy algorithm to come each sentence is carried out otherness punishment, thereby guarantee the novelty of candidate's sentence, concrete grammar is as follows:

A) two set A=φ of initialization, B={x ₁| i=p+1 ..., n}, the final weighted value of each sentence is initialized as its arrangement value, that is to say RankScore (x ₁)=f ₁ ^*, i=p+1 ..., n;

B) according to the sentence among the current final weighted value descending sort B;

C) supposition x ₁Be the highest sentence of rank, first sentence in the sequence just is with x ₁Move on to A from B, and to each and x among the B ₁Adjacent sentence x ₁, j ≠ i, carry out following otherness punishment:

RankScore (x_{j}) = RankScore (x_{j}) - ω \cdot {\hat{S}}_{J 1} \cdot f_{1}^{*} - - - (5)

D) circulation execution in step b) and step c), up to B=φ.

11. as claimed in claim 10 a kind of based on the subject-oriented of bunch arrangement or the multiple file summarization method of inquiry, it is characterized in that: the punishment degree factor ω described in the formula (5) is set at 8 in the step (3);

In the step (4), document sentence x _P+1To x _nMiddle 2-10 the sentence of weighted value maximum of selecting forms summary.

12., it is characterized in that as claim 1,2,3 or 4 described a kind of based on the subject-oriented of bunch arrangement or the multiple file summarization method of inquiry: in the step (4), document sentence x _P+1To x _nMiddle 2-10 the sentence of weighted value maximum of selecting forms summary.