CN103324666A

CN103324666A - Topic tracing method and device based on micro-blog data

Info

Publication number: CN103324666A
Application number: CN2013101778909A
Authority: CN
Inventors: 杜毅; 罗峰; 黄苏支; 李娜
Original assignee: IZP (BEIJING) TECHNOLOGIES Co Ltd
Current assignee: IZP (BEIJING) TECHNOLOGIES Co Ltd
Priority date: 2013-05-14
Filing date: 2013-05-14
Publication date: 2013-09-25

Abstract

The invention provides a topic tracing method and device based on micro-blog data. The method comprises the following steps: collecting document data of multiple micro-blog webpages to set up a first spatial vector for the data of each document; acquiring a second spatial vector of a preset topic; sequentially computing the similarity between the first spatial vector and the second spatial vector of the document data; judging whether the document data corresponding to the first spatial vector is the associated information of the preset topic according to the similarity. According to the invention, the defects in the prior art that the topic excursion phenomenon easily occurs and the topic tracing quality is not high are overcome.

Description

A kind of Topic Tracking method and device based on the microblogging data

Technical field

The present invention relates to the technical field that the network information is handled, particularly relate to a kind of Topic Tracking method based on the microblogging data, and a kind of Topic Tracking device based on the microblogging data.

Background technology

Fast development along with the internet, how effectively utilizing network public-opinion is a kind of important research project, and network public-opinion is that the people that propagate the internet of passing through of producing of the stimulation owing to variety of event are for the set of all cognitions, attitude, emotion and the behavior disposition of this event.In the research process of network public-opinion, topic (event) tracking technique is an important techniques, and topic (event) tracking technique will solve the problem of " what information that the present and the future is relevant with this topic is " exactly.

Microblogging is as emerging a kind of communication form, become people in order to one of main platform of obtaining information consultation and releasing news, and the user is can be on microblogging free disclosedly to express an opinion and exchange with other people any network public-opinion focus and event.Topic (event) tracking for microblogging refers to: for certain much-talked-about topic (event) in microblogging, the user can be known that in the past everybody centers on this topic and issue what information, simultaneously can know also what the present and the future's information relevant with this topic is, can determine in advance under one or several topic and the situation about the microblogging information of these topics, according to certain algorithm, track identification belongs to Twitter message or the comment of the subsequent issued of specific topics, and the page link of these Twitter messages or comment is provided.

At present, a lot of Topic Tracking algorithms are based on all that some training texts set up, as KNN algorithm, decision Tree algorithms etc., the tracing task Topic Tracking algorithms that adopt based on query vector for the microblogging topic more, yet, because all As time goes on and constantly development and change of event in the reality, dynamic variation also can take place in a corresponding topic, just exist the phenomenon of topic drift, simple employing can't solve the problem of topic drift based on the Topic Tracking algorithm of query vector, thereby causes the of low quality of Topic Tracking.

Therefore, the problem that those skilled in the art press for solution is: provide a kind of Topic Tracking mechanism based on the microblogging data, in order to overcome the shortcoming of low quality that occurs topic drift phenomenon, Topic Tracking in the prior art easily.

Summary of the invention

Technical matters to be solved by this invention provides a kind of Topic Tracking method based on the microblogging data, the phenomenon that topic drifts about occurs in order to minimizing, improves the Topic Tracking quality.

Accordingly, the present invention also provides a kind of Topic Tracking device based on the microblogging data, in order to guarantee said method application in practice.

In order to address the above problem, the invention discloses a kind of Topic Tracking method based on the microblogging data, comprising:

Gather the document data of a plurality of microblogging webpages, set up first space vector of each document data;

Obtain second space vector of default topic;

Calculate the similarity of first space vector and second space vector of described document data successively;

Judge according to described similarity whether the document data of the described first space vector correspondence is the related information of described default topic.

Preferably, describedly judge according to similarity whether the document data of the described first space vector correspondence is the related information of described default topic, specifically comprises:

If described similarity is greater than predetermined threshold value, the document data that then is judged as the described first space vector correspondence is the related information of described default topic, stores described related information, and upgrades described second space vector according to described first space vector;

Return the step of the similarity of first space vector that calculates described document data successively and second space vector, dispose up to described a plurality of web document data.

Preferably, described method also comprises:

When described a plurality of document datas are finished dealing with, calculate the indicating characteristic value of each related information, and by described indicating characteristic value each related information is sorted;

Represent N the related information the preceding that sort every the Preset Time section, wherein N is positive integer.

Preferably, the document data of a plurality of webpages of described collection, the step of setting up first space vector of each document data comprises:

Obtain the characteristic information of described document data;

Described characteristic information is carried out participle, obtain the vocabulary of composition characteristic information;

Calculate the weight of described vocabulary, and set up first space vector of document data according to the weight of described vocabulary.

Preferably, described second space vector comprises query vector, and the described step of obtaining second space vector of default topic comprises:

Obtain the topic vector of described default topic;

With described topic vector as query vector.

Preferably, described method also comprises:

Adopt Method of Cosine to calculate the similarity of described first space vector and second space vector.

The invention also discloses a kind of Topic Tracking device based on the microblogging data, comprising:

First space vector is set up module, is used for gathering the document data of a plurality of microblogging webpages, sets up first space vector of each document data;

Second space vector is set up module, is used for obtaining second space vector of default topic;

The similarity comparison module is for the similarity of calculating first space vector and second space vector of described document data successively;

The related information judge module is used for judging according to described similarity whether the document data of the described first space vector correspondence is the related information of described default topic.

Preferably, described related information judge module comprises following submodule:

The related information sub module stored is used in described similarity during greater than predetermined threshold value, and the document data that then is judged as the described first space vector correspondence is the related information of described default topic, stores described related information.

The second space vector updating submodule is used for upgrading described second space vector according to described first space vector.

Preferably, described device also comprises:

The related information order module is used for calculating the indicating characteristic value of each related information, and by described indicating characteristic value each related information being sorted when described a plurality of document datas are finished dealing with;

The related information display module is used for representing N the related information the preceding that sort every the Preset Time section, and wherein N is positive integer.

Preferably, described first space vector is set up module and is comprised:

Characteristic information obtains submodule, is used for obtaining the characteristic information of described document data;

The participle submodule is used for described characteristic information is carried out participle, obtains the vocabulary of composition characteristic information;

Vocabulary weight calculation submodule is used for calculating the weight of described vocabulary, and sets up first space vector of document data according to the weight of described vocabulary.

Compared with prior art, the present invention has the following advantages:

At first, example of the present invention according to much-talked-about topic (the monitoring stage sets up) model, has proposed the Topic Tracking algorithm based on the dynamic adjustment of query vector in conjunction with the characteristics of traditional news topic tracking technique and microblogging self.Make up query vector according to existing document data, on the basis that topic detection task is finished, the topic vector as query vector, calculate the text vector of tracked microblogging document data and the similarity of this query vector, by comparing the magnitude relationship of similarity and predetermined threshold value, determine that tracked microblogging document data is not the related information of this topic (follow-up report), and dynamically adjust query vector, with the problem of topic drift in the solution microblogging data, thereby improve the Topic Tracking quality.

Secondly, because the tracked follow-up report quantity that arrives of possibility is many, if directly present, which the user still is difficult to distinguish is the report that the concern number is many, discussion is ardenter, therefore by calculating the temperature of follow-up, will ardenter orderly the presenting of report be discussed in the topic.

Description of drawings

Fig. 1 is the flow chart of steps of a kind of Topic Tracking method embodiment 1 based on the microblogging data of the present invention;

Fig. 2 is the flow chart of steps of a kind of Topic Tracking method embodiment 2 based on the microblogging data of the present invention;

Fig. 3 is the structural framing figure of a kind of Topic Tracking device embodiment based on the microblogging data of the present invention.

Embodiment

For above-mentioned purpose of the present invention, feature and advantage can be become apparent more, the present invention is further detailed explanation below in conjunction with the drawings and specific embodiments.

With reference to Fig. 1, show the flow chart of steps of a kind of Topic Tracking method embodiment 1 based on the microblogging data of the present invention, specifically can may further comprise the steps:

Step 101 is gathered the document data of a plurality of microblogging webpages, sets up first space vector of each document data;

Step 102 is obtained second space vector of presetting topic;

Step 103 is calculated the similarity of first space vector and second space vector of described document data successively;

Step 104 judges according to described similarity whether the document data of the described first space vector correspondence is the related information of described default topic.

Example of the present invention according to much-talked-about topic (the monitoring stage sets up) model, has proposed the Topic Tracking algorithm based on the dynamic adjustment of query vector in conjunction with the characteristics of traditional news topic tracking technique and microblogging self.Particularly, in the process of carrying out the tracking of microblogging related information, can not consider the sparse property of priori report, make up query vector according to existing document data, on the basis that topic detection task is finished, the topic vector as query vector, calculate the text vector of tracked microblogging document data and the similarity of this query vector, by comparing the magnitude relationship of similarity and predetermined threshold value, determine that tracked microblogging document data is not the related information of this topic (follow-up report), and dynamically adjust query vector, with the problem of topic drift in the solution microblogging data, thereby improve the Topic Tracking quality.

With reference to Fig. 2, show the flow chart of steps of a kind of Topic Tracking method embodiment 2 based on the microblogging data of the present invention, specifically can may further comprise the steps:

Step 201 is gathered the document data of a plurality of microblogging webpages, sets up first space vector of each document data;

Particularly, microblogging is the abbreviation of microblogging visitor (MicroBlog), is one and shares, propagate based on customer relationship information and obtain platform that the user can be set up individual community by various clients such as WEB, WAP, with the literal lastest imformation about 140 words, and realize sharing immediately.

Microblogging has following characteristics:

(1) microblogging information obtain have very strong independence, social selectivity, the user can be according to the interest preference of oneself, according to the other side content distributed classification and quality, whether select " concern " certain user, and can classify to the customer group of all " concerns ";

(2) influence power of microblogging publicity has very big elasticity, and with the content quality height correlation, its influence power is based on the quantity of the existing quilt of user " concern ".The attractive force that the user releases news, news are more strong, and number interested in this user, as to pay close attention to this user is also more many, and influence power is more big.In addition, the authentication of microblogging platform itself and recommendation also help increase by the quantity of " concern ";

(3) the microblogging content is short and pithy.The content of microblogging is defined as about 140 words, and content is brief, need not make a long speech, and threshold is lower;

(4) information sharing is convenient rapidly.Can release news immediately at any time and any place by the platform of various connection networks, its information issue speed surpasses traditional paper media and the network media.

In specific implementation, the document data of described microblogging also can be called microblogging model data, can gather microblogging model data by open interface, described microblogging model data can be stored in the formation, therefore from formation, take out microblogging model data and handle.

For in a kind of preferred embodiment of the embodiment of the invention, described step 201 can comprise following substep:

Substep S01 obtains the characteristic information of described document data;

In practice, because the document data that collects almost just is not deposited in the database through any processing, in original document data, there are a lot of nugatory information, as advertisement, repetition guidance to website instrument or the HTML code that some are semi-structured, these nugatory information have influenced the accuracy that topic detects to a great extent, therefore need handle original microblogging data, therefrom extract valuable characteristic information, and these characteristic informations are stored with more rational form (as XML form or JSON form), so that follow-up flexible processing.

As a kind of preferred exemplary of present embodiment, described characteristic information can comprise the time of posting of microblogging, the user who changes obedient number of times, comment content, comment time, comment correspondence, bean vermicelli quantity, microblogging master information etc.

As a kind of example, can adopt DOM Document Object Model (Document Object Model is called for short DOM) to identify the characteristic information in the document data effectively, the process that webpage is resolved to dom tree is as follows:

1) code structure in the analyzing web page is identified the minimal structure unit in the webpage, can not comprise the code of structure of web page character in the structural unit of described minimum again;

2) structural unit of described minimum is corresponded to the terminal minor matters of dom tree, content wherein is exactly the leaf node on the dom tree end segment;

3) identify the structural unit of these minimal structure unit last layers, these unit correspond to node, and the minimal structure cellular chain that belongs to same structural unit is connected under the described node;

4) constantly to skin expansion recognition structure unit, and correspondence generates node and links down, and the existing node of one deck has only＜HTML up to the structural code of identifying〉...＜/HTML〉time stops, and refers to handle＜HTML〉correspond to the tree root node.

Substep S02 carries out participle to described characteristic information, obtains the vocabulary of composition characteristic information;

In practice, the function information in the microblogging that make such as the forwarding comment of microblogging have repeatability, and because natural language not only is made up of title, verb and the adjective of the main expression text meaning, also comprises some the text representation meaning is worth the little pronoun that can remove, article, conjunction, preposition and punctuation mark etc.In order to reduce the calculated amount of subsequent treatment, improve the degree of accuracy that algorithm efficiency and topic detect, need carry out the data pre-service to described characteristic information, described pre-service can comprise Chinese word segmentation, part-of-speech tagging etc., obtains the vocabulary of composition characteristic information.

Chinese word segmentation refers to a Chinese character sequence is cut into independent one by one word, and participle is exactly the process that continuous word sequence is reassembled into word sequence according to certain standard.Chinese word segmentation is the basis of text mining, not only can reach the effect that computer is identified the statement implication automatically by Chinese word segmentation.Chinese word segmentation algorithm commonly used can be divided into three major types: based on the segmenting method of string matching, based on the segmenting method of understanding with based on the segmenting method of adding up; According to whether combining with the part-of-speech tagging process, can be divided into the integral method that simple segmenting method and participle combine with mark again.Those skilled in the art can adopt above-mentioned any or several algorithm all to be fine according to actual needs, and the embodiment of the invention is not restricted at this.

Substep S03 calculates the weight of described vocabulary, and sets up first space vector of document data according to the weight of described vocabulary.

In document data, the contribution difference of general idea expressed in different vocabulary to document data, in order to embody the significance level of different vocabulary in document data, embody the ability that each document data implication distinguished in different vocabulary, need add different weights to the vocabulary in the characteristic information.

As a kind of preferred exemplary of this enforcement, can adopt following formula to calculate the weight of vocabulary:

w_{ij} = \frac{\sqrt{{tf}_{ij}} \times \log (\frac{N}{m_{ij}} + 0.01)}{\sqrt{Σ_{j}^{m} = [\sqrt{{tf}_{ij}} \times \log (\frac{N}{m_{ij}} + 0.01)]^{2}}}

Wherein, D _iBe i document data, t _IjBe j characteristic information in i the document data, w _IjBe characteristic information t _IjWeight, be t _IjAt document data D _iThe middle number of times that occurs,

Be contrary word frequency IDF, N is current document data sums, and M is document data D _iIn the characteristic information sum, m _IjFor comprising characteristic information t _IjWith comprise and the quantity of characteristic information similarity greater than the document data of α (α is preset value, gets the value between 0.8 to 1 usually).

Certainly, the method for above-mentioned calculating weight only is a kind of example of the embodiment of the invention, and those skilled in the art adopt other weighing computation method all to be fine, and the present invention need not this to be limited.

In specific implementation, described first space vector can be called text vector, making up vector space model (Vector Space Model is called for short VSM) is exactly to realize the Language Processing problem is changed into the mathematical problem that is easy to calculate by a document data being converted into a space vector.Each vocabulary is corresponding to each dimension of vector in the characteristic information, and whole dimensions that described lexical set transforms have constituted whole first space vector, and each vocabulary is represented with the weight of each dimension the representativeness of document data.

Step 202 is obtained second space vector of presetting topic;

In a kind of preferred embodiment of the embodiment of the invention, described step 202 can comprise following substep:

Substep S03 obtains the topic vector of described default topic;

Substep S04, with described topic vector as query vector.

In embodiments of the present invention, described second space vector can be query vector.In the topic testing process, for the way that vector space model is a kind of basis set up in each topic, the topic vector representation of the vector space model of topic, query vector by the topic vector can inquire and topic vector characteristic of correspondence information, and the topic vector of setting up when the embodiment of the invention makes topic detect is as query vector.

Step 203 is calculated the similarity of first space vector and second space vector of described document data successively;

In specific implementation, can adopt multiple mode to calculate similarity, the algorithm of more common calculating user similarity has cosine similarity, Pearson's coefficient, adjusts cosine similarity, Minkowski distance, KL distance, Dice coefficient etc., those skilled in the art can adopt arbitrary similarity calculating method to calculate the similarity of text vector and query vector, and the embodiment of the invention need not this to be limited.

Step 204 judges according to described similarity whether the document data of the described first space vector correspondence is the related information of described default topic;

Step 205, greater than predetermined threshold value, the document data that then is judged as the described first space vector correspondence is the related information of described default topic, stores described related information as if described similarity, and upgrades described second space vector according to described first space vector; Return step 203, dispose up to described a plurality of web document data;

In specific implementation, described related information can be the follow-up of topic, if the degree of correlation of the query vector of the text vector of document data and default topic is greater than predetermined threshold value, then described document data is follow-up or the follow-up report of described default topic, described follow-up is stored in the database at described default topic place, and adjust described query vector according to described text vector, thereby the phenomenon that reduces the topic drift occurs, and improves the Topic Tracking quality.

As a kind of preferred exemplary of present embodiment, described foundation first space vector upgrades described second space vector, can be in the following way:

(1) obtain contrast characteristic's item of described query vector, described contrast characteristic's item be in the query vector in the weight except the ordering characteristic item of four characteristic item external sort minimums the preceding;

(2) user ID that belongs to greater than the former document data of the text vector of predetermined threshold value of location similarity;

Whether (3) judgement belongs in the Preset Time section in the microblogging model data of this user ID, except above-mentioned document data, exist other document datas to belong to the topic at described query vector place; If then change (4);

(4) belonging to same user ID and being in the text vector of the document data under the same topic, obtain the characteristic item of weight maximum;

(5) whether the characteristic item of judging described this weight maximum greater than described contrast characteristic's item, if, then the characteristic item of described weight maximum is replaced described contrast characteristic's item, if not, do not deal with.

In embodiments of the present invention, in the Topic Tracking query script, constantly adjust the query vector of revising topic, thereby reduce the information of the redundancy in the Topic Tracking process, improve the quality of Topic Tracking.But owing to can not guarantee each to adjust all accurately, so the embodiment of the invention keeps the ordering of query vector weight four characteristic items is constant the preceding, to reduce the wrong influence of adjusting.

Certainly, also can manually upgrade described query vector, to reach purpose of the present invention.

After first document data that takes out from formation carries out the query vector adjustment and finishes, after getting access to new query vector, second document data that takes out in the formation continues step 203-205, is empty up to formation, and namely the document data of described a plurality of webpages disposes.

Step 206 when described a plurality of document datas are finished dealing with, is calculated the indicating characteristic value of each related information, and by described indicating characteristic value each related information is sorted;

Step 207 represents N the related information the preceding that sort every the Preset Time section, and wherein N is positive integer.

Particularly, described indicating characteristic value also can be called the temperature of follow-up, and described temperature can be calculated with following mode:

{hc}_{n} = \frac{T_{n} - t_{n 1}}{(t_{n 2} - t_{n 1} + 1) \times \sqrt[3]{{cm}_{n}}} \times \log ({fl}_{n} + 1)

Wherein, hc _nRepresent N bar comment indicating characteristic value; T _nThe comment time of representing the comment of N bar; t _N1Represent that the N bar comments on first comment time of corresponding microblogging; t _N2Expression is to N the time of comment to commenting at last with regard to microblogging; Expression cm _nThe number of times that this microblogging is commented on; Fl _nRepresent that the N bar comments on the user's of corresponding this microblogging of issue bean vermicelli number.

Because the tracked follow-up report quantity that arrives of possibility is many, if directly present, which the user still is difficult to distinguish is the report that the concern number is many, discussion is ardenter, therefore by calculating the temperature of follow-up, will ardenter orderly the presenting of report be discussed in the topic.

Need to prove, for method embodiment, for simple description, so it all is expressed as a series of combination of actions, but those skilled in the art should know, the present invention is not subjected to the restriction of described sequence of movement, because according to the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in the instructions all belongs to preferred embodiment, and related action might not be that the present invention is necessary.

With reference to Fig. 3, show the structural framing figure of a kind of Topic Tracking device embodiment based on the microblogging data of the present invention, specifically can comprise with lower module:

First space vector is set up module 301, is used for gathering the document data of a plurality of microblogging webpages, sets up first space vector of each document data;

In a kind of preferred embodiment of the embodiment of the invention, first space vector is set up module 301 can comprise following submodule:

Second space vector is set up module 302, is used for obtaining second space vector of default topic;

In a kind of preferred embodiment of the embodiment of the invention, described second space vector comprises query vector, and second space vector is set up module 302 can comprise following submodule:

The topic vector obtains submodule, is used for obtaining the topic vector of described default topic;

Query vector is obtained submodule, is used for described topic vector as query vector.

Similarity comparison module 303 is for the similarity of calculating first space vector and second space vector of described document data successively;

As a kind of preferred exemplary of the embodiment of the invention, can adopt Method of Cosine to calculate the similarity of described first space vector and second space vector.

Related information judge module 304 is used for judging according to described similarity whether the document data of the described first space vector correspondence is the related information of described default topic;

In a preferred embodiment of the present invention, described related information judge module 304 can comprise following submodule:

The related information sub module stored is used in described similarity during greater than predetermined threshold value, and the document data that then is judged as the described first space vector correspondence is the related information of described default topic, stores described related information;

The second space vector updating submodule is used for upgrading described second space vector according to described first space vector;

Related information order module 305 is used for calculating the indicating characteristic value of each related information, and by described indicating characteristic value each related information being sorted when described a plurality of document datas are finished dealing with;

Related information display module 306 is used for representing N the related information the preceding that sort every the Preset Time section, and wherein N is positive integer.

Because the device embodiment of described Fig. 3 is substantially corresponding to preceding method embodiment, so not detailed part in the description of present embodiment can just not given unnecessary details at this referring to the related description among the preceding method embodiment.

Each embodiment in this instructions all adopts the mode of going forward one by one to describe, and what each embodiment stressed is and the difference of other embodiment that identical similar part is mutually referring to getting final product between each embodiment.For device embodiment, because it is similar substantially to method embodiment, so description is fairly simple, relevant part gets final product referring to the part explanation of method embodiment.

Those skilled in the art should understand that embodiments of the invention can be provided as method, system or computer program.Therefore, the present invention can adopt complete hardware embodiment, complete software embodiment or in conjunction with the form of the embodiment of software and hardware aspect.And the present invention can adopt the form of the computer program of implementing in one or more computer-usable storage medium (including but not limited to magnetic disk memory, CD-ROM, optical memory etc.) that wherein include computer usable program code.

The present invention is that reference is described according to process flow diagram and/or the block scheme of method, equipment (system) and the computer program of the embodiment of the invention.Should understand can be by the flow process in each flow process in computer program instructions realization flow figure and/or the block scheme and/or square frame and process flow diagram and/or the block scheme and/or the combination of square frame.Can provide these computer program instructions to the processor of multi-purpose computer, special purpose computer, Embedded Processor or other programmable data processing device to produce a machine, make the instruction of carrying out by the processor of computing machine or other programmable data processing device produce to be used for the device of the function that is implemented in flow process of process flow diagram or a plurality of flow process and/or square frame of block scheme or a plurality of square frame appointments.

These computer program instructions also can be stored in energy vectoring computer or the computer-readable memory of other programmable data processing device with ad hoc fashion work, make the instruction that is stored in this computer-readable memory produce the manufacture that comprises command device, this command device is implemented in the function of appointment in flow process of process flow diagram or a plurality of flow process and/or square frame of block scheme or a plurality of square frame.

These computer program instructions also can be loaded on computing machine or other programmable data processing device, make and carry out the sequence of operations step producing computer implemented processing at computing machine or other programmable devices, thereby be provided for being implemented in the step of the function of appointment in flow process of process flow diagram or a plurality of flow process and/or square frame of block scheme or a plurality of square frame in the instruction that computing machine or other programmable devices are carried out.

Although described the preferred embodiments of the present invention, in a single day those skilled in the art get the basic creative concept of cicada, then can make other change and modification to these embodiment.So claims are intended to all changes and the modification that are interpreted as comprising preferred embodiment and fall into the scope of the invention.

At last, also need to prove, in this article, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thereby make and comprise that process, method, article or the equipment of a series of key elements not only comprise those key elements, but also comprise other key elements of clearly not listing, or also be included as the intrinsic key element of this process, method, article or equipment.Do not having under the situation of more restrictions, the key element that is limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment that comprises described key element and also have other identical element.

More than the Topic Tracking method and apparatus based on the microblogging data provided by the present invention is described in detail, used specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims

1. the Topic Tracking method based on the microblogging data is characterized in that, comprising:

Obtain second space vector of default topic;

2. according to the described method of claim 1, it is characterized in that describedly judge according to similarity whether the document data of the described first space vector correspondence is the related information of described default topic, specifically comprises:

3. according to claim 1 or 2 described methods, it is characterized in that, also comprise:

4. according to the described method of claim 1, it is characterized in that, the document data of a plurality of webpages of described collection, the step of setting up first space vector of each document data comprises:

Obtain the characteristic information of described document data;

5. according to the described method of claim 1, it is characterized in that described second space vector comprises query vector, the described step of obtaining second space vector of default topic comprises:

Obtain the topic vector of described default topic;

With described topic vector as query vector.

6. according to claim 1 or 2 described methods, it is characterized in that, also comprise:

7. the Topic Tracking device based on the microblogging data is characterized in that, comprising:

8. according to the described device of claim 7, it is characterized in that described related information judge module comprises following submodule:

9. according to claim 7 or 8 described devices, it is characterized in that, also comprise:

10. according to the described device of claim 7, it is characterized in that described first space vector is set up module and comprised: