CN102110140A

CN102110140A - Network-based method for analyzing opinion information in discrete text

Info

Publication number: CN102110140A
Application number: CN 201110030156
Authority: CN
Inventors: 赵峰; 李生红; 陈秀真; 李海燕; 黄慧琼
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2011-01-26
Filing date: 2011-01-26
Publication date: 2011-06-29

Abstract

The invention relates to a network-based system for analyzing opinion information in a discrete text, belonging to the field of network information safety. The system comprises the following modules: a discrete text information acquisition module which acquires network information in a preset analysis cycle, a discrete text information tracking and restoring module which restores ellipsis and remote anaphora in the original content to obtain a text which contains a relatively complete text structure and semantic information, a semantic information mining and characteristic extracting module which realizes semantic information mining and characteristic extracting on text information by utilizing a latent semantic indexing technology, an opinion information clustering module which realizes information clustering by combining a niche genetic algorithm with a K-Means method, a hot opinion event discovery module which mines the hot opinion in the obtained topic and event, and a background information processing and data supporting center which analyzes data and provides a repertoire specially for a network, new words in the network, the existing class information and the existing hot topics. By applying the invention, the problem that information analysis is influenced as the text structure of the existing network opinion information is incomplete, ellipsis and remote anaphora are more and the new works in the network are more is solved, and the accuracy for discovery of the opinion and hot event is improved by adopting a high-efficiency clustering method.

Description

The public feelings information analytical approach of discrete text Network Based

Technical field

The present invention relates to network information analysis, specifically is a kind of public feelings information analytical approach of discrete text Network Based.

Background technology

Along with the development of Internet technology and the raising of people's living standard, network has become people and has obtained information and the daily most important platform that exchanges.According to " the 26th the China Internet network state of development statistical report " of CNNIC issue, Chinese netizen's scale reaches 4.2 hundred million people, and increases newly the first half of the year among the netizen, and the 62%th, the mobile phone netizen.These pivots the existing scale and the prospect of China Internet network, shown also that simultaneously the mode that people exchange develops computer network and mobile phone mobile network from traditional approach.Speech on the internet is the reaction of netizens' real-time point of view, can have an immense impact on to public opinion and trend thereof, therefore serious also can cause social event, and the information document that these speeches form does not possess the complete structure of an article of traditional documents, content omit and long-range refer to more, and comprise more network neologisms, therefore be necessary this is studied, develop corresponding public feelings information analytic system.(patent name is Chinese patent CN200810147645.2: a kind of method for collecting network public feelings viewpoint) be the method for calculating focus speech word frequency and word frequency variation, with the verb in the critical sentence and noun as eigenwert, by the cosine similarity of calculating between each critical sentence proper vector critical sentence is carried out cluster, obtain a plurality of viewpoint theme line collection, the method that adopts heavy emotion dictionary of cum rights and manual differentiation to combine is at last calculated the emotion tendency of each viewpoint theme line.This method is that unit carries out extraction of focus speech and critical sentence cluster with the speech with the method for statistics, and the text message that has the complete structure of an article in processing is feasible.But discover through us, under current network environment, variation has taken place in the structure of an article of public sentiment text message, particularly along with cellphone subscriber's the sharp increase and the development of network technique, the intercommunion platform such such as microblogging arises at the historic moment, and the information that participates in topic discussion by mobile phone increases.These public feelings informations no longer are to have the comparatively perfect complete structure of an article of certain length, institutional framework, the process object of network public sentiment information is that language is brief, ellipsis is more, the information of the non-complete discrete text form of structure, and omission term wherein and long-range referring to all are to need the problem handled.Simultaneously, in internet exchange platform now, new word and the cyberspeak that acquires a special sense have more important meaning to the reaction of netizen's public sentiment viewpoint, only can not draw the semantic information of these speech with the method for statistics, so topic and incident clustering accuracy are with influenced.In addition, except numerous subject documents, the comment document of these subject documents also having been comprised netizen's viewpoint on the internet, also is the important component part of network public-opinion tendency.

Summary of the invention

The present invention is directed to the characteristics of the existing network public feelings information of above-mentioned proposition, a kind of public feelings information analytical approach of discrete text Network Based is proposed, by the network information that collects is carried out the tracking and the recovery of discrete text, realize the content of network text information flow is omitted and the long-range effective reconstruction that refers to.Adopt potential semantic indexing technology to realize semantic information excavation and feature selecting on this basis.At last public feelings information is analyzed.

The present invention is achieved by the following technical solutions, and the public feelings information analytical approach of discrete text Network Based comprises discrete text information acquisition, discrete text information processing, and corresponding database, comprises the steps:

A. the discrete text information acquisition module is at first gathered the network information by the analytical cycle of setting, and is saved in local data base;

B. next, discrete text tracking of information and restoration module are restored raw content omission part and the long-range part that refers to;

C. on step b basis, semantic information is excavated with characteristic extracting module and is utilized potential semantic indexing technology that text message is carried out semanteme excavation and feature extraction;

D. the data that obtained by step c enter public feelings information cluster module, by the combine cluster of the information of carrying out of niche genetic algorithm and K-Means method; Simultaneously, support the data-guiding classification information at center that the network information is carried out topic and incident cluster by background information processing and data;

E. by focus public sentiment incident discovery module topic and incident that cluster obtains are carried out the excavation of focus public sentiment at last, obtain final result, hand over to the system manager, to carry out follow-up work of treatment as required.

(step a) is meant from network and obtains information flow, is saved in local data base with html format in described discrete text information acquisition.Because present network information flow comprises corresponding picture, audio frequency usually, even a large amount of advertisement picture, so the information of the html format that need preserve this locality is carried out denoising, removing information such as picture, audio frequency, advertisement, thereby reach the purpose that only keeps text message.The concrete steps of webpage denoising are: earlier the uniform data in the html file is standardized, mark that element intersects occurring as＜abc〉＜def＜/abc＜/def, pairing is reduced into＜abc〉＜def＜/def＜/abc＜def＜/def complete form; Then html web page is stored with tree-like chain structure, handled the corresponding html tree of each html webpage of back; At last, present dynamic web page technique generally will be received within database in the page, layout is then used template, takes out content during demonstration and be put in the template from database, the feature of these templates is mainly to divide the space of a whole page with table element, and uses an independent form and show main text.Therefore our mode of handling web page text is, html tree according to above-mentioned generation merges the text in the table element, get text in the form of quantity of information maximum as main text, extract corresponding text information thus, comprise contents such as title, text and money order receipt to be signed and returned to the sender, obtain discrete text information.Simultaneously, set up document index in the denoising process, preserve important as UserID information and the time of participating in discussion, number etc.

Described discrete text tracking of information and recovery (step b), at first handle according to background information and network-specific that data support center provides determines in discrete text with maximum match principle with repertorie that content is omitted, long-range needs such as refer to are paid close attention to part, content omission, the long-range concrete form that refers to comprise only briefly to be quoted other people viewpoint (as " support building-owner ") and does not clearly provide the comment of own viewpoint, the comment of long-range hyperlink form etc.; On this basis according to the hierarchical structure of the html that forms in discrete text information acquisition module tree or visit long-range hyperlink and realize effective location, at last content is omitted part and long-rangely refer to raw content that the part utilization oriented and carry out content and replace abridged raw content, the long-range raw content that refers to.Simultaneously, this module also will be removed the special symbol in the discrete text.Discrete text through this series of processes has possessed the more complete structure of an article.The network-specific repertorie here, be meant at the unconventional language performance phenomenon that occurs under the network discrete text language environment, handle and data support center is found and increased the network-specific term that increases newly by background information, and progressively accumulate in conjunction with existing network-specific term and to form.

Described semantic information is excavated and feature extraction (step c), be to be that the public sentiment document carries out participle with the ICTCLAS of Chinese Academy of Sciences Words partition system to the discrete text after restoring, carry out weight calculation with TF-IDF, obtain word-document matrix, bigger and can keep on the semantic basis effectively dimensionality reduction in view of existing potential semantic indexing technology in view of the common dimension of resulting word-document matrix then, for reducing calculated amount, adopt potential semantic indexing technology that word-document matrix is carried out dimension-reduction treatment, find out speech and notion, the relation of notion and public sentiment document, and carry out feature extraction based on this, obtain being used for notion-public sentiment document matrix that the dimension of public sentiment analysis is lowered thus, carry out the information cluster so that this is entered next module as input.Simultaneously, in the process of participle, the needs of neologisms Network Based provide the user special-purpose dictionary by background information processing and data support center, to improve the accuracy of participle.The potential semantic indexing technology of the employing here obtains notion-public sentiment document matrix, and its disposal route is: for passing through the A that participle and weight calculation obtain _{M * n}, the line display word of matrix, tabulation is shown and document it is carried out svd:

U ₀Embodied word-conceptual relation, V ₀Embody notion-document relationships, the diagonal matrix ∑ ₀Element arrange from big to small.Keep ∑ then ₀Preceding m element and get U ₀And V ₀Preceding m row form corresponding matrix ∑, U and V respectively.Obtain A at last _{M * n}Approximate solution A '=U ∑ V ^T, this A ' is notion-public sentiment document matrix, and it has kept original A to greatest extent _{M * n}Semantic information, simultaneously the dimension of feature space is reduced to the m dimension.

(step d) is that to use the method that combines based on niche genetic algorithm and K-Means will have a public feelings information of same subject information or topic classification poly-to same class to described public feelings information cluster.Genetic algorithm is a kind of method of simulating biological evolution, and implementation step is: set initial population, calculating individual fitness, heredity selection, hereditary intersection, hereditary variation, form next population, judge whether satisfy stopping criterion.The ultimate principle of genetic algorithm is the survival of the fittest in natural selection, the survival of the fittest, but in the biological evolution process except the competition, also have cooperation to a certain degree, niche genetic algorithm utilizes this thought just.Among the present invention program population is divided into several microhabitats (niche), similarity is selected to exert an influence to heredity in the average class of each microhabitat (niche) internal condition document, and then influencing the interior individual fitness of microhabitat, intersection and mutation operation then carry out in whole population.Carry out cluster with K-Means in each evolution iteration, to calculate average similarity in individual fitness and the class, the K-Means cluster initial center in the initial population is selected at random, and the back of evolving each time later on selects the big K individuality of adaptive value as initial center.

Described focus public sentiment incident finds that (step e), the document index information of preserving in cluster result that is obtained by the cluster module and the discrete text denoising process is analyzed and obtained current much-talked-about topic and focus incident.Public feelings information number of files and the discussion number in the document index according to every class in the cluster result are excavated focus public sentiment incident within a certain period of time, and each regular update is gathered focus public sentiment incident of finding behind the html page and the existing focus that background information processing and data support center provide compare analysis, the result hands over to the system manager.

Background information is handled and the analysis result that is obtained by said process is preserved at data support center, with in the past preserved the analysis result that gets off and compared, find and increase newly-increased network-specific term, the classification information of the network-specific that set up the special-purpose dictionary of the user who is used for Words partition system, is used for discrete text tracking of information and recovery after, the focus public sentiment event information that has now found that with repertorie, cluster.

Compared with prior art, the present invention has following beneficial effect: 1) increase at the current network cellphone subscriber, the network public-opinion text message does not possess the complete structure of an article, long-range referring to content omitted discrete text phenomenons such as more, realize the tracking and the recovery of public feelings information, improve the accuracy that public sentiment is analyzed; 2) handle at existing public feelings information, increase the back-office support module, the network-specific of effectively extracting the network discrete text is further improved analytical effect with repertorie and the new dictionary of network; 3) provide clustering method, realized efficient public feelings information cluster based on niche genetic algorithm and K-Means combination.

Description of drawings

Fig. 1 is the inventive method workflow diagram.

Fig. 2 is that system of the present invention discrete text is followed the trail of and the restoration module synoptic diagram.

Fig. 3 is a system of the present invention public feelings information cluster module process flow diagram.

Embodiment

Below in conjunction with accompanying drawing embodiments of the invention are elaborated: present embodiment is being to implement under the prerequisite with the technical solution of the present invention, provided detailed embodiment and concrete operating process, but protection scope of the present invention is not limited to following embodiment.

As shown in Figure 1, in the method for the present invention, comprise that discrete text information acquisition module, discrete text tracking of information and restoration module, semantic information excavation and characteristic extracting module, public feelings information cluster module, focus public sentiment incident find that module and background information are handled and data are supported the center, and be used for Words partition system the special-purpose dictionary of user, be used for discrete text tracking of information and recovery network-specific after with repertorie, cluster classification information and the existing hotspot database that has now found that.Its treatment scheme is:

1) discrete text information acquisition

Obtain information flow from network, be saved in local data base with html format, then the information of the html format preserved is standardized and the webpage denoising, remove corresponding picture, audio frequency and advertisement, navigation bar etc., extract corresponding text information, comprise contents such as title, text and money order receipt to be signed and returned to the sender, obtain discrete text information.Simultaneously, set up document index in the denoising process, preserve important as UserID information and time, the number of participating in discussion.

The disposal route of webpage denoising process is: at first, HTML describes file structure with tag, all statements can nested loop, so handle for convenience, at first shape as＜abc＜def＜/abc＜/def marker ligand to after be reduced into＜abc＜def＜/def＜/abc＜def＜/def form.So far, the framework of a html web page file can simply be expressed as follows:

<HTML>

<HEAD>

＜TITLE〉＜/TITLE〉// title

＜SCRIPT〉＜/SCRIPT〉// script

</HEAD>

＜BODY〉＜/BODY〉// content, note

</HTML>

Then, the html web page of above-mentioned form is stored with tree-like chain structure, handled the corresponding html tree of each html webpage of back.At last, present dynamic web page technique generally will be received within database in the page, layout is then used template, takes out content during demonstration and be put in the template from database, the feature of these templates is mainly to divide the space of a whole page with table element, and uses an independent form and show main text.Therefore our mode of handling web page text is, according to the html tree of above-mentioned generation the text in the table element is merged, and gets text in the form of quantity of information maximum as main text.Obtain the discrete text that constitutes by title, text, money order receipt to be signed and returned to the sender thus.

2) discrete text tracking of information and recovery

As shown in Figure 2, according to the network-specific repertorie that background information is handled and data support center provides, determine that by maximum match principle content is omitted, long-range refer to etc. needs to pay close attention to part in discrete text, on this basis according to the hierarchical structure of the html that forms in discrete text information acquisition module tree or visit long-range hyperlink and realize effective location, at last content is omitted part and long-rangely refer to raw content that the part utilization oriented and carry out content and replace abridged raw content, the long-range raw content that refers to.Simultaneously, this module also will be removed the special symbol in the discrete text.Discrete text through this series of processes has possessed the more complete structure of an article.

This module is mainly used in money order receipt to be signed and returned to the sender and the remote linkage problem handled, and processing mode is, referring to as " supporting the building-owner " (being provided with repertorie by network-specific) and url link etc. of occurring in the text carried out content to replace.We are 1) in extracted main text, it is the content in the stalk tree of html number, the main content of pasting of subtree root node representative is directly to the child node of main answer of pasting as root node, to the answer of the money order receipt to be signed and returned to the sender child node as this money order receipt to be signed and returned to the sender.Like this, directly find corresponding node to carry out content to " building-owner ", " upstairs " and replace, long-range url is carried out links and accesses, recurrence is extracted content of text and is substituted.

3) semantic information is excavated and feature extraction

Concrete steps are: at first, the ICTCLAS Words partition system that uses Inst. of Computing Techn. Academia Sinica to develop carries out participle to the discrete text after restoring; Calculate term weighing with TF-IDF then, obtain word-document matrix A _{M * n}; At last with A _{M * n}Carry out svd

Keep ∑ ₀Preceding m element and get U ₀And V ₀Preceding m row form corresponding matrix ∑, U and V respectively, and then obtain notion-public sentiment document matrix A ': A '=U ∑ V ^T

4) public feelings information cluster

As shown in Figure 3, it is poly-to same class to use the method that combines based on niche genetic algorithm and K-Means will have a public feelings information of same subject information or topic classification.Concrete steps are:

S1: select text as initial cluster center at random, form initial population, use the k-means initial clustering to calculate individual initial fitness in the population;

S2: select in the population, intersect, variation;

S3: each text in the population is carried out cluster with the k-means algorithm;

S4: calculate the ideal adaptation degree;

S5:,, then replace parent and enter next circulation if its fitness is higher than parent to each individuality;

S6: satisfy end condition and then forward S7 to, otherwise forward S2 to;

S7: choose the high individuality of fitness as initial cluster center, and carry out cluster with k-means;

5) focus public sentiment incident is found

The document index information of preserving in cluster result that is obtained by the cluster module and the discrete text denoising process is analyzed and is obtained current much-talked-about topic and focus incident.Excavating within a certain period of time, the foundation of focus public sentiment incident is the public feelings information number of files of every class in the cluster result and the discussion number in the document index.Each regular update gathers the focus public sentiment incident of finding behind the html page and the existing focus that background information is handled and data support center provides compares analysis, and the result hands over to the system manager;

6) background information is handled and data support center

The analysis result that preservation is obtained by said process, with in the past preserved the analysis result that gets off and compared, find and increase newly-increased network-specific term, the classification information of the network-specific that set up the special-purpose dictionary of the user who is used for Words partition system, is used for discrete text tracking of information and recovery after, the focus public sentiment event information that has now found that with repertorie, cluster.

Systematic analysis flow process and concrete processing mode have been provided in the present embodiment, comprise used structure information storage, online public feelings information to regular update obtains real-time analysis result, comprise information clustering result and much-talked-about topic, focus incident, also preserve simultaneously and the used back-end data of replacement analysis process, the maintenance of these back-end datas needs system manager's the reliability of participation to guarantee that next round is analyzed.

Claims

1. the public feelings information analytical approach of discrete text Network Based comprises discrete text information acquisition, discrete text information processing, and corresponding database, it is characterized in that: comprise the steps:

E. by focus public sentiment incident discovery module topic and incident that cluster obtains are carried out the excavation of focus public sentiment at last, obtain final result, hand over to the system manager.

2. the public feelings information analytical approach of discrete text Network Based according to claim 1, it is characterized in that, at step a, earlier obtain information flow from network, be saved in local data base with html format, the information of the html format that this locality is preserved is carried out denoising then, simultaneously, set up document index in the denoising process, time, the number of preserving UserID information and participating in discussion.

3. the public feelings information analytical approach of discrete text Network Based according to claim 1, it is characterized in that, at step b, at first in discrete text determining content with repertorie with maximum match principle according to the network-specific that background information is handled and data support center provides omits, the long-range part that refers to, on this basis according to the hierarchical structure of the html that forms in discrete text information acquisition module tree or visit long-range hyperlink and realize to the abridged raw content, effective location of the long-range raw content that refers to is omitted part to content at last and is long-rangely referred to raw content that the part utilization oriented and carry out content and replace; Simultaneously, the special symbol in the removal discrete text.

4. the public feelings information analytical approach of discrete text Network Based according to claim 1, it is characterized in that, at step c, to the discrete text after restoring is the public sentiment document, carry out participle with the ICTCLAS of Chinese Academy of Sciences Words partition system, carry out weight calculation with TF-IDF, obtain word-document matrix, adopt potential semantic indexing technology that word-document matrix is carried out dimension-reduction treatment then, find out speech and notion, the relation of notion and public sentiment document, and carry out feature extraction based on this, obtain being used for next step and carry out notion-public sentiment document matrix that dimension that the information cluster uses is lowered.

5. the public feelings information analytical approach of discrete text Network Based according to claim 1, it is characterized in that, in steps d, niche genetic algorithm is that population is divided into several microhabitats, similarity is selected to exert an influence to heredity in the average class of each microhabitat internal condition document, and then influencing the interior individual fitness of microhabitat, intersection and mutation operation then carry out in whole population; In each evolution iteration, carry out cluster with K-Means, to calculate average similarity in individual fitness and the class, K-Means cluster initial center in the initial population is selected at random, the back of evolving each time later on selects the big K individuality of adaptive value as initial center, will have with the method for K-Means that the public feelings information of same subject information or topic classification is poly-to arrive same class.

6. the public feelings information analytical approach of discrete text Network Based according to claim 1, it is characterized in that, at step e, the document index information of preserving in cluster result that obtains by the cluster module and the discrete text denoising process, public feelings information number of files and the discussion number in the document index according to every class in the cluster result are excavated focus public sentiment incident within a certain period of time, and each regular update is gathered focus public sentiment incident of finding behind the html page and the existing focus that background information processing and data support center provide compare analysis, the result hands over to the system manager.

7. the public feelings information analytical approach of discrete text Network Based according to claim 1 is characterized in that, background information is handled and data are supported the center

8. the public feelings information analytical approach of discrete text Network Based according to claim 2 is characterized in that, the concrete steps of denoising are: earlier with the standardization of the uniform data in the html file, the marker ligand that the element intersection occurs to being reduced into complete form; Then html web page is stored with tree-like chain structure, handled the corresponding html tree of each html webpage of back; Html tree according to above-mentioned generation merges the text in the table element at last, get text in the form of quantity of information maximum as main text, extract corresponding text information thus, comprise contents such as title, text and money order receipt to be signed and returned to the sender, obtain discrete text information.