CN103324666A - Topic tracing method and device based on micro-blog data - Google Patents

Topic tracing method and device based on micro-blog data Download PDF

Info

Publication number
CN103324666A
CN103324666A CN2013101778909A CN201310177890A CN103324666A CN 103324666 A CN103324666 A CN 103324666A CN 2013101778909 A CN2013101778909 A CN 2013101778909A CN 201310177890 A CN201310177890 A CN 201310177890A CN 103324666 A CN103324666 A CN 103324666A
Authority
CN
China
Prior art keywords
space vector
document data
topic
related information
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013101778909A
Other languages
Chinese (zh)
Inventor
杜毅
罗峰
黄苏支
李娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
IZP (BEIJING) TECHNOLOGIES Co Ltd
Original Assignee
IZP (BEIJING) TECHNOLOGIES Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by IZP (BEIJING) TECHNOLOGIES Co Ltd filed Critical IZP (BEIJING) TECHNOLOGIES Co Ltd
Priority to CN2013101778909A priority Critical patent/CN103324666A/en
Publication of CN103324666A publication Critical patent/CN103324666A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a topic tracing method and device based on micro-blog data. The method comprises the following steps: collecting document data of multiple micro-blog webpages to set up a first spatial vector for the data of each document; acquiring a second spatial vector of a preset topic; sequentially computing the similarity between the first spatial vector and the second spatial vector of the document data; judging whether the document data corresponding to the first spatial vector is the associated information of the preset topic according to the similarity. According to the invention, the defects in the prior art that the topic excursion phenomenon easily occurs and the topic tracing quality is not high are overcome.

Description

A kind of Topic Tracking method and device based on the microblogging data
Technical field
The present invention relates to the technical field that the network information is handled, particularly relate to a kind of Topic Tracking method based on the microblogging data, and a kind of Topic Tracking device based on the microblogging data.
Background technology
Fast development along with the internet, how effectively utilizing network public-opinion is a kind of important research project, and network public-opinion is that the people that propagate the internet of passing through of producing of the stimulation owing to variety of event are for the set of all cognitions, attitude, emotion and the behavior disposition of this event.In the research process of network public-opinion, topic (event) tracking technique is an important techniques, and topic (event) tracking technique will solve the problem of " what information that the present and the future is relevant with this topic is " exactly.
Microblogging is as emerging a kind of communication form, become people in order to one of main platform of obtaining information consultation and releasing news, and the user is can be on microblogging free disclosedly to express an opinion and exchange with other people any network public-opinion focus and event.Topic (event) tracking for microblogging refers to: for certain much-talked-about topic (event) in microblogging, the user can be known that in the past everybody centers on this topic and issue what information, simultaneously can know also what the present and the future's information relevant with this topic is, can determine in advance under one or several topic and the situation about the microblogging information of these topics, according to certain algorithm, track identification belongs to Twitter message or the comment of the subsequent issued of specific topics, and the page link of these Twitter messages or comment is provided.
At present, a lot of Topic Tracking algorithms are based on all that some training texts set up, as KNN algorithm, decision Tree algorithms etc., the tracing task Topic Tracking algorithms that adopt based on query vector for the microblogging topic more, yet, because all As time goes on and constantly development and change of event in the reality, dynamic variation also can take place in a corresponding topic, just exist the phenomenon of topic drift, simple employing can't solve the problem of topic drift based on the Topic Tracking algorithm of query vector, thereby causes the of low quality of Topic Tracking.
Therefore, the problem that those skilled in the art press for solution is: provide a kind of Topic Tracking mechanism based on the microblogging data, in order to overcome the shortcoming of low quality that occurs topic drift phenomenon, Topic Tracking in the prior art easily.
Summary of the invention
Technical matters to be solved by this invention provides a kind of Topic Tracking method based on the microblogging data, the phenomenon that topic drifts about occurs in order to minimizing, improves the Topic Tracking quality.
Accordingly, the present invention also provides a kind of Topic Tracking device based on the microblogging data, in order to guarantee said method application in practice.
In order to address the above problem, the invention discloses a kind of Topic Tracking method based on the microblogging data, comprising:
Gather the document data of a plurality of microblogging webpages, set up first space vector of each document data;
Obtain second space vector of default topic;
Calculate the similarity of first space vector and second space vector of described document data successively;
Judge according to described similarity whether the document data of the described first space vector correspondence is the related information of described default topic.
Preferably, describedly judge according to similarity whether the document data of the described first space vector correspondence is the related information of described default topic, specifically comprises:
If described similarity is greater than predetermined threshold value, the document data that then is judged as the described first space vector correspondence is the related information of described default topic, stores described related information, and upgrades described second space vector according to described first space vector;
Return the step of the similarity of first space vector that calculates described document data successively and second space vector, dispose up to described a plurality of web document data.
Preferably, described method also comprises:
When described a plurality of document datas are finished dealing with, calculate the indicating characteristic value of each related information, and by described indicating characteristic value each related information is sorted;
Represent N the related information the preceding that sort every the Preset Time section, wherein N is positive integer.
Preferably, the document data of a plurality of webpages of described collection, the step of setting up first space vector of each document data comprises:
Obtain the characteristic information of described document data;
Described characteristic information is carried out participle, obtain the vocabulary of composition characteristic information;
Calculate the weight of described vocabulary, and set up first space vector of document data according to the weight of described vocabulary.
Preferably, described second space vector comprises query vector, and the described step of obtaining second space vector of default topic comprises:
Obtain the topic vector of described default topic;
With described topic vector as query vector.
Preferably, described method also comprises:
Adopt Method of Cosine to calculate the similarity of described first space vector and second space vector.
The invention also discloses a kind of Topic Tracking device based on the microblogging data, comprising:
First space vector is set up module, is used for gathering the document data of a plurality of microblogging webpages, sets up first space vector of each document data;
Second space vector is set up module, is used for obtaining second space vector of default topic;
The similarity comparison module is for the similarity of calculating first space vector and second space vector of described document data successively;
The related information judge module is used for judging according to described similarity whether the document data of the described first space vector correspondence is the related information of described default topic.
Preferably, described related information judge module comprises following submodule:
The related information sub module stored is used in described similarity during greater than predetermined threshold value, and the document data that then is judged as the described first space vector correspondence is the related information of described default topic, stores described related information.
The second space vector updating submodule is used for upgrading described second space vector according to described first space vector.
Preferably, described device also comprises:
The related information order module is used for calculating the indicating characteristic value of each related information, and by described indicating characteristic value each related information being sorted when described a plurality of document datas are finished dealing with;
The related information display module is used for representing N the related information the preceding that sort every the Preset Time section, and wherein N is positive integer.
Preferably, described first space vector is set up module and is comprised:
Characteristic information obtains submodule, is used for obtaining the characteristic information of described document data;
The participle submodule is used for described characteristic information is carried out participle, obtains the vocabulary of composition characteristic information;
Vocabulary weight calculation submodule is used for calculating the weight of described vocabulary, and sets up first space vector of document data according to the weight of described vocabulary.
Compared with prior art, the present invention has the following advantages:
At first, example of the present invention according to much-talked-about topic (the monitoring stage sets up) model, has proposed the Topic Tracking algorithm based on the dynamic adjustment of query vector in conjunction with the characteristics of traditional news topic tracking technique and microblogging self.Make up query vector according to existing document data, on the basis that topic detection task is finished, the topic vector as query vector, calculate the text vector of tracked microblogging document data and the similarity of this query vector, by comparing the magnitude relationship of similarity and predetermined threshold value, determine that tracked microblogging document data is not the related information of this topic (follow-up report), and dynamically adjust query vector, with the problem of topic drift in the solution microblogging data, thereby improve the Topic Tracking quality.
Secondly, because the tracked follow-up report quantity that arrives of possibility is many, if directly present, which the user still is difficult to distinguish is the report that the concern number is many, discussion is ardenter, therefore by calculating the temperature of follow-up, will ardenter orderly the presenting of report be discussed in the topic.
Description of drawings
Fig. 1 is the flow chart of steps of a kind of Topic Tracking method embodiment 1 based on the microblogging data of the present invention;
Fig. 2 is the flow chart of steps of a kind of Topic Tracking method embodiment 2 based on the microblogging data of the present invention;
Fig. 3 is the structural framing figure of a kind of Topic Tracking device embodiment based on the microblogging data of the present invention.
Embodiment
For above-mentioned purpose of the present invention, feature and advantage can be become apparent more, the present invention is further detailed explanation below in conjunction with the drawings and specific embodiments.
With reference to Fig. 1, show the flow chart of steps of a kind of Topic Tracking method embodiment 1 based on the microblogging data of the present invention, specifically can may further comprise the steps:
Step 101 is gathered the document data of a plurality of microblogging webpages, sets up first space vector of each document data;
Step 102 is obtained second space vector of presetting topic;
Step 103 is calculated the similarity of first space vector and second space vector of described document data successively;
Step 104 judges according to described similarity whether the document data of the described first space vector correspondence is the related information of described default topic.
Example of the present invention according to much-talked-about topic (the monitoring stage sets up) model, has proposed the Topic Tracking algorithm based on the dynamic adjustment of query vector in conjunction with the characteristics of traditional news topic tracking technique and microblogging self.Particularly, in the process of carrying out the tracking of microblogging related information, can not consider the sparse property of priori report, make up query vector according to existing document data, on the basis that topic detection task is finished, the topic vector as query vector, calculate the text vector of tracked microblogging document data and the similarity of this query vector, by comparing the magnitude relationship of similarity and predetermined threshold value, determine that tracked microblogging document data is not the related information of this topic (follow-up report), and dynamically adjust query vector, with the problem of topic drift in the solution microblogging data, thereby improve the Topic Tracking quality.
With reference to Fig. 2, show the flow chart of steps of a kind of Topic Tracking method embodiment 2 based on the microblogging data of the present invention, specifically can may further comprise the steps:
Step 201 is gathered the document data of a plurality of microblogging webpages, sets up first space vector of each document data;
Particularly, microblogging is the abbreviation of microblogging visitor (MicroBlog), is one and shares, propagate based on customer relationship information and obtain platform that the user can be set up individual community by various clients such as WEB, WAP, with the literal lastest imformation about 140 words, and realize sharing immediately.
Microblogging has following characteristics:
(1) microblogging information obtain have very strong independence, social selectivity, the user can be according to the interest preference of oneself, according to the other side content distributed classification and quality, whether select " concern " certain user, and can classify to the customer group of all " concerns ";
(2) influence power of microblogging publicity has very big elasticity, and with the content quality height correlation, its influence power is based on the quantity of the existing quilt of user " concern ".The attractive force that the user releases news, news are more strong, and number interested in this user, as to pay close attention to this user is also more many, and influence power is more big.In addition, the authentication of microblogging platform itself and recommendation also help increase by the quantity of " concern ";
(3) the microblogging content is short and pithy.The content of microblogging is defined as about 140 words, and content is brief, need not make a long speech, and threshold is lower;
(4) information sharing is convenient rapidly.Can release news immediately at any time and any place by the platform of various connection networks, its information issue speed surpasses traditional paper media and the network media.
In specific implementation, the document data of described microblogging also can be called microblogging model data, can gather microblogging model data by open interface, described microblogging model data can be stored in the formation, therefore from formation, take out microblogging model data and handle.
For in a kind of preferred embodiment of the embodiment of the invention, described step 201 can comprise following substep:
Substep S01 obtains the characteristic information of described document data;
In practice, because the document data that collects almost just is not deposited in the database through any processing, in original document data, there are a lot of nugatory information, as advertisement, repetition guidance to website instrument or the HTML code that some are semi-structured, these nugatory information have influenced the accuracy that topic detects to a great extent, therefore need handle original microblogging data, therefrom extract valuable characteristic information, and these characteristic informations are stored with more rational form (as XML form or JSON form), so that follow-up flexible processing.
As a kind of preferred exemplary of present embodiment, described characteristic information can comprise the time of posting of microblogging, the user who changes obedient number of times, comment content, comment time, comment correspondence, bean vermicelli quantity, microblogging master information etc.
As a kind of example, can adopt DOM Document Object Model (Document Object Model is called for short DOM) to identify the characteristic information in the document data effectively, the process that webpage is resolved to dom tree is as follows:
1) code structure in the analyzing web page is identified the minimal structure unit in the webpage, can not comprise the code of structure of web page character in the structural unit of described minimum again;
2) structural unit of described minimum is corresponded to the terminal minor matters of dom tree, content wherein is exactly the leaf node on the dom tree end segment;
3) identify the structural unit of these minimal structure unit last layers, these unit correspond to node, and the minimal structure cellular chain that belongs to same structural unit is connected under the described node;
4) constantly to skin expansion recognition structure unit, and correspondence generates node and links down, and the existing node of one deck has only<HTML up to the structural code of identifying〉...</HTML〉time stops, and refers to handle<HTML〉correspond to the tree root node.
Substep S02 carries out participle to described characteristic information, obtains the vocabulary of composition characteristic information;
In practice, the function information in the microblogging that make such as the forwarding comment of microblogging have repeatability, and because natural language not only is made up of title, verb and the adjective of the main expression text meaning, also comprises some the text representation meaning is worth the little pronoun that can remove, article, conjunction, preposition and punctuation mark etc.In order to reduce the calculated amount of subsequent treatment, improve the degree of accuracy that algorithm efficiency and topic detect, need carry out the data pre-service to described characteristic information, described pre-service can comprise Chinese word segmentation, part-of-speech tagging etc., obtains the vocabulary of composition characteristic information.
Chinese word segmentation refers to a Chinese character sequence is cut into independent one by one word, and participle is exactly the process that continuous word sequence is reassembled into word sequence according to certain standard.Chinese word segmentation is the basis of text mining, not only can reach the effect that computer is identified the statement implication automatically by Chinese word segmentation.Chinese word segmentation algorithm commonly used can be divided into three major types: based on the segmenting method of string matching, based on the segmenting method of understanding with based on the segmenting method of adding up; According to whether combining with the part-of-speech tagging process, can be divided into the integral method that simple segmenting method and participle combine with mark again.Those skilled in the art can adopt above-mentioned any or several algorithm all to be fine according to actual needs, and the embodiment of the invention is not restricted at this.
Substep S03 calculates the weight of described vocabulary, and sets up first space vector of document data according to the weight of described vocabulary.
In document data, the contribution difference of general idea expressed in different vocabulary to document data, in order to embody the significance level of different vocabulary in document data, embody the ability that each document data implication distinguished in different vocabulary, need add different weights to the vocabulary in the characteristic information.
As a kind of preferred exemplary of this enforcement, can adopt following formula to calculate the weight of vocabulary:
w ij = tf ij × log ( N m ij + 0.01 ) Σ j m = [ tf ij × log ( N m ij + 0.01 ) ] 2
Wherein, D iBe i document data, t IjBe j characteristic information in i the document data, w IjBe characteristic information t IjWeight, be t IjAt document data D iThe middle number of times that occurs,
Figure BDA00003187526300082
Be contrary word frequency IDF, N is current document data sums, and M is document data D iIn the characteristic information sum, m IjFor comprising characteristic information t IjWith comprise and the quantity of characteristic information similarity greater than the document data of α (α is preset value, gets the value between 0.8 to 1 usually).
Certainly, the method for above-mentioned calculating weight only is a kind of example of the embodiment of the invention, and those skilled in the art adopt other weighing computation method all to be fine, and the present invention need not this to be limited.
In specific implementation, described first space vector can be called text vector, making up vector space model (Vector Space Model is called for short VSM) is exactly to realize the Language Processing problem is changed into the mathematical problem that is easy to calculate by a document data being converted into a space vector.Each vocabulary is corresponding to each dimension of vector in the characteristic information, and whole dimensions that described lexical set transforms have constituted whole first space vector, and each vocabulary is represented with the weight of each dimension the representativeness of document data.
Step 202 is obtained second space vector of presetting topic;
In a kind of preferred embodiment of the embodiment of the invention, described step 202 can comprise following substep:
Substep S03 obtains the topic vector of described default topic;
Substep S04, with described topic vector as query vector.
In embodiments of the present invention, described second space vector can be query vector.In the topic testing process, for the way that vector space model is a kind of basis set up in each topic, the topic vector representation of the vector space model of topic, query vector by the topic vector can inquire and topic vector characteristic of correspondence information, and the topic vector of setting up when the embodiment of the invention makes topic detect is as query vector.
Step 203 is calculated the similarity of first space vector and second space vector of described document data successively;
In specific implementation, can adopt multiple mode to calculate similarity, the algorithm of more common calculating user similarity has cosine similarity, Pearson's coefficient, adjusts cosine similarity, Minkowski distance, KL distance, Dice coefficient etc., those skilled in the art can adopt arbitrary similarity calculating method to calculate the similarity of text vector and query vector, and the embodiment of the invention need not this to be limited.
Step 204 judges according to described similarity whether the document data of the described first space vector correspondence is the related information of described default topic;
Step 205, greater than predetermined threshold value, the document data that then is judged as the described first space vector correspondence is the related information of described default topic, stores described related information as if described similarity, and upgrades described second space vector according to described first space vector; Return step 203, dispose up to described a plurality of web document data;
In specific implementation, described related information can be the follow-up of topic, if the degree of correlation of the query vector of the text vector of document data and default topic is greater than predetermined threshold value, then described document data is follow-up or the follow-up report of described default topic, described follow-up is stored in the database at described default topic place, and adjust described query vector according to described text vector, thereby the phenomenon that reduces the topic drift occurs, and improves the Topic Tracking quality.
As a kind of preferred exemplary of present embodiment, described foundation first space vector upgrades described second space vector, can be in the following way:
(1) obtain contrast characteristic's item of described query vector, described contrast characteristic's item be in the query vector in the weight except the ordering characteristic item of four characteristic item external sort minimums the preceding;
(2) user ID that belongs to greater than the former document data of the text vector of predetermined threshold value of location similarity;
Whether (3) judgement belongs in the Preset Time section in the microblogging model data of this user ID, except above-mentioned document data, exist other document datas to belong to the topic at described query vector place; If then change (4);
(4) belonging to same user ID and being in the text vector of the document data under the same topic, obtain the characteristic item of weight maximum;
(5) whether the characteristic item of judging described this weight maximum greater than described contrast characteristic's item, if, then the characteristic item of described weight maximum is replaced described contrast characteristic's item, if not, do not deal with.
In embodiments of the present invention, in the Topic Tracking query script, constantly adjust the query vector of revising topic, thereby reduce the information of the redundancy in the Topic Tracking process, improve the quality of Topic Tracking.But owing to can not guarantee each to adjust all accurately, so the embodiment of the invention keeps the ordering of query vector weight four characteristic items is constant the preceding, to reduce the wrong influence of adjusting.
Certainly, also can manually upgrade described query vector, to reach purpose of the present invention.
After first document data that takes out from formation carries out the query vector adjustment and finishes, after getting access to new query vector, second document data that takes out in the formation continues step 203-205, is empty up to formation, and namely the document data of described a plurality of webpages disposes.
Step 206 when described a plurality of document datas are finished dealing with, is calculated the indicating characteristic value of each related information, and by described indicating characteristic value each related information is sorted;
Step 207 represents N the related information the preceding that sort every the Preset Time section, and wherein N is positive integer.
Particularly, described indicating characteristic value also can be called the temperature of follow-up, and described temperature can be calculated with following mode:
hc n = T n - t n 1 ( t n 2 - t n 1 + 1 ) × cm n 3 × log ( fl n + 1 )
Wherein, hc nRepresent N bar comment indicating characteristic value; T nThe comment time of representing the comment of N bar; t N1Represent that the N bar comments on first comment time of corresponding microblogging; t N2Expression is to N the time of comment to commenting at last with regard to microblogging; Expression cm nThe number of times that this microblogging is commented on; Fl nRepresent that the N bar comments on the user's of corresponding this microblogging of issue bean vermicelli number.
Because the tracked follow-up report quantity that arrives of possibility is many, if directly present, which the user still is difficult to distinguish is the report that the concern number is many, discussion is ardenter, therefore by calculating the temperature of follow-up, will ardenter orderly the presenting of report be discussed in the topic.
Need to prove, for method embodiment, for simple description, so it all is expressed as a series of combination of actions, but those skilled in the art should know, the present invention is not subjected to the restriction of described sequence of movement, because according to the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in the instructions all belongs to preferred embodiment, and related action might not be that the present invention is necessary.
With reference to Fig. 3, show the structural framing figure of a kind of Topic Tracking device embodiment based on the microblogging data of the present invention, specifically can comprise with lower module:
First space vector is set up module 301, is used for gathering the document data of a plurality of microblogging webpages, sets up first space vector of each document data;
In a kind of preferred embodiment of the embodiment of the invention, first space vector is set up module 301 can comprise following submodule:
Characteristic information obtains submodule, is used for obtaining the characteristic information of described document data;
The participle submodule is used for described characteristic information is carried out participle, obtains the vocabulary of composition characteristic information;
Vocabulary weight calculation submodule is used for calculating the weight of described vocabulary, and sets up first space vector of document data according to the weight of described vocabulary.
Second space vector is set up module 302, is used for obtaining second space vector of default topic;
In a kind of preferred embodiment of the embodiment of the invention, described second space vector comprises query vector, and second space vector is set up module 302 can comprise following submodule:
The topic vector obtains submodule, is used for obtaining the topic vector of described default topic;
Query vector is obtained submodule, is used for described topic vector as query vector.
Similarity comparison module 303 is for the similarity of calculating first space vector and second space vector of described document data successively;
As a kind of preferred exemplary of the embodiment of the invention, can adopt Method of Cosine to calculate the similarity of described first space vector and second space vector.
Related information judge module 304 is used for judging according to described similarity whether the document data of the described first space vector correspondence is the related information of described default topic;
In a preferred embodiment of the present invention, described related information judge module 304 can comprise following submodule:
The related information sub module stored is used in described similarity during greater than predetermined threshold value, and the document data that then is judged as the described first space vector correspondence is the related information of described default topic, stores described related information;
The second space vector updating submodule is used for upgrading described second space vector according to described first space vector;
Related information order module 305 is used for calculating the indicating characteristic value of each related information, and by described indicating characteristic value each related information being sorted when described a plurality of document datas are finished dealing with;
Related information display module 306 is used for representing N the related information the preceding that sort every the Preset Time section, and wherein N is positive integer.
Because the device embodiment of described Fig. 3 is substantially corresponding to preceding method embodiment, so not detailed part in the description of present embodiment can just not given unnecessary details at this referring to the related description among the preceding method embodiment.
Each embodiment in this instructions all adopts the mode of going forward one by one to describe, and what each embodiment stressed is and the difference of other embodiment that identical similar part is mutually referring to getting final product between each embodiment.For device embodiment, because it is similar substantially to method embodiment, so description is fairly simple, relevant part gets final product referring to the part explanation of method embodiment.
Those skilled in the art should understand that embodiments of the invention can be provided as method, system or computer program.Therefore, the present invention can adopt complete hardware embodiment, complete software embodiment or in conjunction with the form of the embodiment of software and hardware aspect.And the present invention can adopt the form of the computer program of implementing in one or more computer-usable storage medium (including but not limited to magnetic disk memory, CD-ROM, optical memory etc.) that wherein include computer usable program code.
The present invention is that reference is described according to process flow diagram and/or the block scheme of method, equipment (system) and the computer program of the embodiment of the invention.Should understand can be by the flow process in each flow process in computer program instructions realization flow figure and/or the block scheme and/or square frame and process flow diagram and/or the block scheme and/or the combination of square frame.Can provide these computer program instructions to the processor of multi-purpose computer, special purpose computer, Embedded Processor or other programmable data processing device to produce a machine, make the instruction of carrying out by the processor of computing machine or other programmable data processing device produce to be used for the device of the function that is implemented in flow process of process flow diagram or a plurality of flow process and/or square frame of block scheme or a plurality of square frame appointments.
These computer program instructions also can be stored in energy vectoring computer or the computer-readable memory of other programmable data processing device with ad hoc fashion work, make the instruction that is stored in this computer-readable memory produce the manufacture that comprises command device, this command device is implemented in the function of appointment in flow process of process flow diagram or a plurality of flow process and/or square frame of block scheme or a plurality of square frame.
These computer program instructions also can be loaded on computing machine or other programmable data processing device, make and carry out the sequence of operations step producing computer implemented processing at computing machine or other programmable devices, thereby be provided for being implemented in the step of the function of appointment in flow process of process flow diagram or a plurality of flow process and/or square frame of block scheme or a plurality of square frame in the instruction that computing machine or other programmable devices are carried out.
Although described the preferred embodiments of the present invention, in a single day those skilled in the art get the basic creative concept of cicada, then can make other change and modification to these embodiment.So claims are intended to all changes and the modification that are interpreted as comprising preferred embodiment and fall into the scope of the invention.
At last, also need to prove, in this article, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thereby make and comprise that process, method, article or the equipment of a series of key elements not only comprise those key elements, but also comprise other key elements of clearly not listing, or also be included as the intrinsic key element of this process, method, article or equipment.Do not having under the situation of more restrictions, the key element that is limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment that comprises described key element and also have other identical element.
More than the Topic Tracking method and apparatus based on the microblogging data provided by the present invention is described in detail, used specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims (10)

1. the Topic Tracking method based on the microblogging data is characterized in that, comprising:
Gather the document data of a plurality of microblogging webpages, set up first space vector of each document data;
Obtain second space vector of default topic;
Calculate the similarity of first space vector and second space vector of described document data successively;
Judge according to described similarity whether the document data of the described first space vector correspondence is the related information of described default topic.
2. according to the described method of claim 1, it is characterized in that describedly judge according to similarity whether the document data of the described first space vector correspondence is the related information of described default topic, specifically comprises:
If described similarity is greater than predetermined threshold value, the document data that then is judged as the described first space vector correspondence is the related information of described default topic, stores described related information, and upgrades described second space vector according to described first space vector;
Return the step of the similarity of first space vector that calculates described document data successively and second space vector, dispose up to described a plurality of web document data.
3. according to claim 1 or 2 described methods, it is characterized in that, also comprise:
When described a plurality of document datas are finished dealing with, calculate the indicating characteristic value of each related information, and by described indicating characteristic value each related information is sorted;
Represent N the related information the preceding that sort every the Preset Time section, wherein N is positive integer.
4. according to the described method of claim 1, it is characterized in that, the document data of a plurality of webpages of described collection, the step of setting up first space vector of each document data comprises:
Obtain the characteristic information of described document data;
Described characteristic information is carried out participle, obtain the vocabulary of composition characteristic information;
Calculate the weight of described vocabulary, and set up first space vector of document data according to the weight of described vocabulary.
5. according to the described method of claim 1, it is characterized in that described second space vector comprises query vector, the described step of obtaining second space vector of default topic comprises:
Obtain the topic vector of described default topic;
With described topic vector as query vector.
6. according to claim 1 or 2 described methods, it is characterized in that, also comprise:
Adopt Method of Cosine to calculate the similarity of described first space vector and second space vector.
7. the Topic Tracking device based on the microblogging data is characterized in that, comprising:
First space vector is set up module, is used for gathering the document data of a plurality of microblogging webpages, sets up first space vector of each document data;
Second space vector is set up module, is used for obtaining second space vector of default topic;
The similarity comparison module is for the similarity of calculating first space vector and second space vector of described document data successively;
The related information judge module is used for judging according to described similarity whether the document data of the described first space vector correspondence is the related information of described default topic.
8. according to the described device of claim 7, it is characterized in that described related information judge module comprises following submodule:
The related information sub module stored is used in described similarity during greater than predetermined threshold value, and the document data that then is judged as the described first space vector correspondence is the related information of described default topic, stores described related information.
The second space vector updating submodule is used for upgrading described second space vector according to described first space vector.
9. according to claim 7 or 8 described devices, it is characterized in that, also comprise:
The related information order module is used for calculating the indicating characteristic value of each related information, and by described indicating characteristic value each related information being sorted when described a plurality of document datas are finished dealing with;
The related information display module is used for representing N the related information the preceding that sort every the Preset Time section, and wherein N is positive integer.
10. according to the described device of claim 7, it is characterized in that described first space vector is set up module and comprised:
Characteristic information obtains submodule, is used for obtaining the characteristic information of described document data;
The participle submodule is used for described characteristic information is carried out participle, obtains the vocabulary of composition characteristic information;
Vocabulary weight calculation submodule is used for calculating the weight of described vocabulary, and sets up first space vector of document data according to the weight of described vocabulary.
CN2013101778909A 2013-05-14 2013-05-14 Topic tracing method and device based on micro-blog data Pending CN103324666A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013101778909A CN103324666A (en) 2013-05-14 2013-05-14 Topic tracing method and device based on micro-blog data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013101778909A CN103324666A (en) 2013-05-14 2013-05-14 Topic tracing method and device based on micro-blog data

Publications (1)

Publication Number Publication Date
CN103324666A true CN103324666A (en) 2013-09-25

Family

ID=49193409

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013101778909A Pending CN103324666A (en) 2013-05-14 2013-05-14 Topic tracing method and device based on micro-blog data

Country Status (1)

Country Link
CN (1) CN103324666A (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103984731A (en) * 2014-05-19 2014-08-13 北京大学 Self-adaption topic tracing method and device under microblog environment
CN103984729A (en) * 2014-05-19 2014-08-13 北京大学 Microblog information tracing method and microblog information tracing method
CN104484459A (en) * 2014-12-29 2015-04-01 北京奇虎科技有限公司 Method and device for combining entities in knowledge map
CN104572770A (en) * 2013-10-25 2015-04-29 华为技术有限公司 Method and device for extracting subjects
CN105389397A (en) * 2015-12-22 2016-03-09 北京奇虎科技有限公司 Method and device for sequencing news
CN105528335A (en) * 2015-12-22 2016-04-27 北京奇虎科技有限公司 Method and device for determining correlation among news
CN105528336A (en) * 2015-12-23 2016-04-27 北京奇虎科技有限公司 Method and device for determining article correlation by multiple marks
CN105630928A (en) * 2015-12-22 2016-06-01 北京奇虎科技有限公司 Text marking method and apparatus
CN105654113A (en) * 2015-12-23 2016-06-08 北京奇虎科技有限公司 Article fingerprint characteristic generation method and device
CN106202530A (en) * 2016-07-22 2016-12-07 北京邮电大学 Data processing method and device
TWI570579B (en) * 2015-07-23 2017-02-11 葆光資訊有限公司 An information retrieving method utilizing webpage visual features and webpage language features and a system using thereof
CN103729420B (en) * 2013-12-20 2017-05-03 广西贝腾科技服务有限公司 Microblog hotspot tracking system and method
CN106970924A (en) * 2016-01-14 2017-07-21 北京国双科技有限公司 A kind of topic sort method and device
CN107436877A (en) * 2016-05-25 2017-12-05 北京京东尚科信息技术有限公司 Much-talked-about topic method for pushing and device
US10217025B2 (en) 2015-12-22 2019-02-26 Beijing Qihoo Technology Company Limited Method and apparatus for determining relevance between news and for calculating relevance among multiple pieces of news
CN109885763A (en) * 2019-01-26 2019-06-14 北京工业大学 A kind of blog article recommended method based on user's head portrait
CN109885760A (en) * 2019-01-22 2019-06-14 上海交通大学 Information source tracing method and system based on user interest
CN110069732A (en) * 2019-03-29 2019-07-30 腾讯科技(深圳)有限公司 A kind of method, device and equipment that information is shown
CN110457599A (en) * 2019-08-15 2019-11-15 中国电子信息产业集团有限公司第六研究所 Hot topic method for tracing, device, server and readable storage medium storing program for executing
CN111914597A (en) * 2019-05-09 2020-11-10 杭州睿琪软件有限公司 Document comparison identification method and device, electronic equipment and readable storage medium
CN113420234A (en) * 2021-07-02 2021-09-21 青海师范大学 Microblog data acquisition method and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040205457A1 (en) * 2001-10-31 2004-10-14 International Business Machines Corporation Automatically summarising topics in a collection of electronic documents
CN101980199A (en) * 2010-10-28 2011-02-23 北京交通大学 Method and system for discovering network hot topic based on situation assessment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040205457A1 (en) * 2001-10-31 2004-10-14 International Business Machines Corporation Automatically summarising topics in a collection of electronic documents
CN101980199A (en) * 2010-10-28 2011-02-23 北京交通大学 Method and system for discovering network hot topic based on situation assessment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
孙胜平: ""中文微博客热点话题检测与跟踪技术研究"", 《中国优秀硕士学位论文全文数据库信息科技辑》, 15 September 2011 (2011-09-15) *

Cited By (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104572770A (en) * 2013-10-25 2015-04-29 华为技术有限公司 Method and device for extracting subjects
CN103729420B (en) * 2013-12-20 2017-05-03 广西贝腾科技服务有限公司 Microblog hotspot tracking system and method
CN103984731A (en) * 2014-05-19 2014-08-13 北京大学 Self-adaption topic tracing method and device under microblog environment
CN103984729A (en) * 2014-05-19 2014-08-13 北京大学 Microblog information tracing method and microblog information tracing method
CN103984731B (en) * 2014-05-19 2017-03-08 北京大学 Self adaptation topic tracking method and apparatus under microblogging environment
CN104484459B (en) * 2014-12-29 2019-07-23 北京奇虎科技有限公司 The method and device that entity in a kind of pair of knowledge mapping merges
CN104484459A (en) * 2014-12-29 2015-04-01 北京奇虎科技有限公司 Method and device for combining entities in knowledge map
TWI570579B (en) * 2015-07-23 2017-02-11 葆光資訊有限公司 An information retrieving method utilizing webpage visual features and webpage language features and a system using thereof
CN105528335B (en) * 2015-12-22 2018-10-09 北京奇虎科技有限公司 The method and apparatus for determining correlation between news
CN105630928A (en) * 2015-12-22 2016-06-01 北京奇虎科技有限公司 Text marking method and apparatus
CN105528335A (en) * 2015-12-22 2016-04-27 北京奇虎科技有限公司 Method and device for determining correlation among news
CN105630928B (en) * 2015-12-22 2019-06-21 北京奇虎科技有限公司 The identification method and device of text
CN105389397A (en) * 2015-12-22 2016-03-09 北京奇虎科技有限公司 Method and device for sequencing news
US10217025B2 (en) 2015-12-22 2019-02-26 Beijing Qihoo Technology Company Limited Method and apparatus for determining relevance between news and for calculating relevance among multiple pieces of news
CN105654113A (en) * 2015-12-23 2016-06-08 北京奇虎科技有限公司 Article fingerprint characteristic generation method and device
CN105654113B (en) * 2015-12-23 2020-02-21 北京奇虎科技有限公司 Article fingerprint feature generation method and device
CN105528336B (en) * 2015-12-23 2018-09-21 北京奇虎科技有限公司 The method and apparatus that more mark posts determine article correlation
CN105528336A (en) * 2015-12-23 2016-04-27 北京奇虎科技有限公司 Method and device for determining article correlation by multiple marks
CN106970924A (en) * 2016-01-14 2017-07-21 北京国双科技有限公司 A kind of topic sort method and device
CN106970924B (en) * 2016-01-14 2020-10-20 北京国双科技有限公司 Topic sorting method and device
CN107436877A (en) * 2016-05-25 2017-12-05 北京京东尚科信息技术有限公司 Much-talked-about topic method for pushing and device
CN107436877B (en) * 2016-05-25 2021-03-30 北京京东尚科信息技术有限公司 Hot topic pushing method and device
CN106202530A (en) * 2016-07-22 2016-12-07 北京邮电大学 Data processing method and device
CN106202530B (en) * 2016-07-22 2019-09-27 北京邮电大学 Data processing method and device
CN109885760A (en) * 2019-01-22 2019-06-14 上海交通大学 Information source tracing method and system based on user interest
CN109885763A (en) * 2019-01-26 2019-06-14 北京工业大学 A kind of blog article recommended method based on user's head portrait
CN110069732A (en) * 2019-03-29 2019-07-30 腾讯科技(深圳)有限公司 A kind of method, device and equipment that information is shown
CN110069732B (en) * 2019-03-29 2022-11-22 腾讯科技(深圳)有限公司 Information display method, device and equipment
CN111914597A (en) * 2019-05-09 2020-11-10 杭州睿琪软件有限公司 Document comparison identification method and device, electronic equipment and readable storage medium
CN111914597B (en) * 2019-05-09 2024-03-15 杭州睿琪软件有限公司 Document comparison identification method and device, electronic equipment and readable storage medium
CN110457599A (en) * 2019-08-15 2019-11-15 中国电子信息产业集团有限公司第六研究所 Hot topic method for tracing, device, server and readable storage medium storing program for executing
CN113420234A (en) * 2021-07-02 2021-09-21 青海师范大学 Microblog data acquisition method and system

Similar Documents

Publication Publication Date Title
CN103324666A (en) Topic tracing method and device based on micro-blog data
US20190034835A1 (en) Method and system to provide related data
US10867256B2 (en) Method and system to provide related data
CN105005594B (en) Abnormal microblog users recognition methods
CN103914478B (en) Webpage training method and system, webpage Forecasting Methodology and system
CN103324665B (en) Hot spot information extraction method and device based on micro-blog
CN104899273B (en) A kind of Web Personalization method based on topic and relative entropy
CN103514183B (en) Information search method and system based on interactive document clustering
CN102215300B (en) Telecommunication service recommendation method and system
CN106250513A (en) A kind of event personalization sorting technique based on event modeling and system
Shi et al. Sentiment analysis of Chinese microblogging based on sentiment ontology: a case study of ‘7.23 Wenzhou Train Collision’
CN103049440A (en) Recommendation processing method and processing system for related articles
CN110750640A (en) Text data classification method and device based on neural network model and storage medium
CN105608200A (en) Network public opinion tendency prediction analysis method
CN103455545A (en) Location estimation of social network users
CN105068991A (en) Big data based public sentiment discovery method
CN102890702A (en) Internet forum-oriented opinion leader mining method
CN104978332B (en) User-generated content label data generation method, device and correlation technique and device
CN104965931A (en) Big data based public opinion analysis method
CN110334178A (en) Data retrieval method, device, equipment and readable storage medium storing program for executing
CN102637179B (en) Method and device for determining lexical item weighting functions and searching based on functions
CN105138577A (en) Big data based event evolution analysis method
CN109299277A (en) The analysis of public opinion method, server and computer readable storage medium
CN110134845A (en) Project public sentiment monitoring method, device, computer equipment and storage medium
CN107526718A (en) Method and apparatus for generating text

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20130925