WO2016033907A1

WO2016033907A1 - Statistical machine learning-based internet hidden link detection method

Info

Publication number: WO2016033907A1
Application number: PCT/CN2014/095168
Authority: WO
Inventors: 孟池洁; 王伟; 耿光刚; 隋鹏宇
Original assignee: 中国科学院计算机网络信息中心
Priority date: 2014-09-05
Filing date: 2014-12-26
Publication date: 2016-03-10
Also published as: CN104239485A; CN104239485B

Abstract

A statistical machine learning-based hidden link detection method comprises the following steps: 1) collecting real webpage source code data as a training set for a classification model, and dividing the data into a category containing hidden links and a category containing no hidden link; 2) respectively extracting anchor texts, i.e., character contents of link fields, from Html source code files of all the collected webpages of the two categories, and then segmenting the anchor texts into individual words; 3) vectorizing the two categories of word-segmented texts; 4) reducing the dimension of a vector corresponding to each text; 5) training the two categories of data obtained in the step (4) by using a classifier to obtain a classification model; and 6) applying the obtained classification model to an unknown webpage to be detected, to obtain a hidden link detection result. Whether a webpage contains hidden links or not is effectively and automatically detected using the source codes of the webpage, so that theoretical and practical support can be provided for search engines to crack down network cheating.

Description

Internet dark chain detection method based on statistical machine learning

Technical field

The invention belongs to the field of network technology and search technology, and particularly relates to an internet dark chain detection method based on statistical machine learning.

Background technique

As an important portal of the Internet, search engines have become an indispensable tool for netizens every day, and ranking of search results is very important for the presentation of search results. Search engines have specialized algorithms (such as Google's PageRank, etc.) to measure the relative importance of the page and to determine the ranking of the search results. Since search engines use "crawlers" to crawl web content along links between web pages, most of the algorithms that measure the importance of web pages are an important factor in the external links of web pages, that is, the more links external websites point to the target web pages, The higher the weight value of the landing page, the easier it is to get to the front position in the search results. The high ranking of search engine results can bring a high degree of attention to a website, so many webmasters will link to related websites when they build their own websites. And among the cheaters who use black-gray technology (called black hat SEO), it is one of the means to implant a dark chain in the website.

The dark chain, also known as the black chain, is a kind of link written in a web page, but is set to be invisible to the human eye. The purpose is to attract the crawling of the search engine crawler, and it is not displayed to the reader in the browser, only when viewing the webpage source code. Can be found. The dark chain manufacturer uses the weighting algorithm of the webpage to attach importance to the link, and writes a large number of dark chains in the webpage, and the chain aims to increase the weight of the target webpage. People who participate in the use of dark chains often write a large number of pages on their own websites by illegally obtaining the rights of others' websites and writing a large number of unrelated dark chains in them, or the webmasters themselves participating in the dark chain exchange cooperation. Due to its hidden nature, dark chains are difficult to find, and the network cheats in the underground industry are continually mass-embedded with dark chains in the Internet, so it is difficult to be completely eliminated. The dark chain is similar to the reality of small pole advertising in the real world, known as "network psoriasis." This kind of cheating method not only seriously affects the image reputation of the website, but also destroys the fair search engine ranking mechanism and affects the quality of the search results. Therefore, the detection of dark chains is necessary.

Although search engines continue to punish black hat SEO, there are still many dark chains in the Internet. Large search engines have not published specific algorithms or methods for discovering network cheats. Most of the detection methods are self-test by the webmaster, that is, check the source code of the web page to see if there is any unknown code, or use the tool to check whether the modification time of the website is abnormal. These methods have limited power to eradicate dark chains and are highly demanding for inspectors. Can not do automatic, a large number of tests. The existing published Baidu detection dark chain technology patent (patent number 201210049496.2, publication number: CN102622435 A) is a rule-based detection method that uses hidden technology to identify whether there is a dark chain in combination with a black and white list. This detection method is weakly recognized for one of the hidden ways in which the dark chain is used (the invisible code is defined in the JavaScript script). Currently, the hidden dark chain in this way occupies a large proportion, and the new hidden method cannot automatically respond. There will be a missed check.

Summary of the invention

Based on the limitations of the prior art, the present invention provides a new Internet dark chain detection method, which utilizes the source code of the webpage to automatically and automatically detect whether the webpage contains the existence of a dark chain, and provides theoretical and practical support for the search engine to combat network cheating. .

The invention utilizes the characteristics of the webpage content to be trained, and is classified into a model training containing a dark chain and no dark chain, and then classifying the webpage to be detected into a dark chain and a dark chain. Machine learning-based methods are widely used in text classification, spam filtering, anomaly detection, etc., and have proven to be effective. This method can achieve automatic mining and dynamic optimization of classification models, and is a heuristic method.

Specifically, the technical solution adopted by the present invention is as follows:

A method for detecting a dark chain based on statistical machine learning, the steps of which include:

1) Collect the real webpage source data as a training set of the classification model, and divide it into two categories: dark chain and no dark chain;

2) Pre-processing link: extract the anchor text from the HTML source files of all the two types of web pages collected in step 1), that is, the text content of the link field, and then divide the anchor text into a single word;

3) vectorizing the data obtained in step 2), that is, the two types of text after the word segmentation;

4) The dimension reduction processing is performed on the vector corresponding to each text (step 3), but the vector corresponding to each text is obtained, but the dimension is high, but not all dimensions are meaningful, so the dimension processing needs to be reduced, that is, the feature Choose to ensure the efficiency of model training);

5) using the classifier to train the two types of data obtained in step 4) to obtain a classification model;

6) The classification model obtained in step 5) is used for the unknown web page to be detected, and the dark chain detection result is obtained.

Further, step 1) classifies the web page by expert annotation.

Further, in step 2), if it is a Chinese webpage, the open source tokenizer (such as Kenting Chinese word breaker, Mmseg, etc.) is used to split the anchor text into a single word; if it is an English webpage, then no special use is involved. The word segmentation device can obtain a single word only through the vocabulary segmentation and lexical filtering steps.

Further, steps 3) to 5) are implemented using open source machine learning and data mining tools, such as Weka, Scikit, Orange, and the like.

The invention proposes a classification method for using the anchor text in the webpage source code as a classification training set, in the training classification model Before, the anchor text is converted into a vector to select features and reduce the dimension. Then the classification model is trained by the machine learning classification algorithm. The obtained classification model can be used to automatically classify unknown web pages in batches and detect whether there are dark chains.

Compared with the prior art, the beneficial effects of the present invention are as follows:

1) The classification model can be trained by using the dataset marked by experts, and the unknown webpage can be input into the classification model to automatically classify the webpage into two categories: dark chain and no dark chain. There is no need to invest in human knowledge of dark chain related knowledge.

2) Using the content characteristics of the webpage source code, it does not detect the hidden technology of the dark chain, and when the new hidden technology means, it can achieve dynamic adaptive and effective detection.

DRAWINGS

BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 is a general flow diagram of the process of the present invention.

2 is a flow chart of data preparation and preprocessing of the present invention.

3 is a flow chart of the classification model training of the present invention.

detailed description

The above described objects, features and advantages of the present invention will become more apparent from the aspects of the appended claims.

1 is a general flow chart of a method for detecting a dark chain detection method based on statistical machine learning of the present invention, including data preparation and preprocessing processes (collecting and classifying webpage source code samples, extracting anchor text, word segmentation and vectorization), and performing classification model Training, using the classification model for unknown web pages to be detected, etc.

Figure 2 illustrates the data preparation and pre-processing flow of the present invention. Proceed as follows

1) Collect the source code containing the dark chain and the HTML source file without the dark chain separately. The former is selected by the human screening and identification; the latter selects the HTML source file of the homepage of various web pages included in the DMOZ directory (one is shared by global volunteers) Maintain an open catalogue, the most important website directory navigation on the Internet). Two types of HTML text can be obtained by crawling the crawler's home page.

2) Two types of source code files are separately extracted to extract the anchor text, and the anchor text is divided into independent words. If it is a Chinese webpage, it involves the use of Chinese word segmentation tools (such as the 庖丁分词器mmseg, etc.), and in order to reduce meaningless words and retain important words in the process of word segmentation, add a stop word list (including meaningless word words) in the Chinese word segmentation device. , pronouns, quantifiers, etc.) and custom word lexicon (specific words in dark chain anchor text).

3) Convert the anchor text after the two types of word segmentation into the data format that Weka needs.

4) Input the data obtained in the previous step into the open source machine learning and data mining tool Weka vector , that is, with each word as a dimension, the word exists in the text, the corresponding dimension is 1, otherwise it is 0, and all the text is converted into the corresponding vector.

Figure 3 illustrates the training process for the classification model of the present invention. Proceed as follows

1) In order to ensure the efficiency of the training model, Weka's feature selection function is used to reduce the dimension of the vector corresponding to each text, that is, to judge each dimension of the vector, and to see the degree of influence on the category, Weka can use different evaluation algorithms. Make feature selection. A feature selection algorithm with better classification effect, such as the information gain method shown in FIG. 2, the chi-square calibration method, and the like can be selected.

Taking the chi-square verification method as an example to illustrate the process of feature selection in text classification: the total number of documents in the statistical sample set is N; the statistics of the text without lyrics appear when the frequency A, the negative document appears frequency B, the positive document part appears frequency C, the negative document does not The frequency D that appears. For each word, calculate the chi-square value as follows:

Each word is sorted from the largest to the smallest, and the first K values are selected as features, that is, dimension reduction to K dimension.

2) Based on the reduced vector obtained in the previous step, the classification model training provided by Weka is used for classification model training. A variety of classification methods can be used for classification training, such as shown in Figure 2.

Bayes, SVM, SMO, Adaboost and other methods, according to the performance of the training results, select the best classifier suitable for the data set. Taking the AdaBoost algorithm as an example, the process of training the classifier is described. Let a class to be classified be x={a1, a2,..., am}, each a is a feature attribute of x, the category is C ₁ , C ₂ ,... , C _n , calculate the frequency of occurrence of each category in the training samples and the conditional probability estimate of each category for each category (calculated as P(C _i |x)=P(x|C _i )P( C _i )/P(x)) and record the results.

The training model is then used to classify unknown web pages. Proceed as follows

1) Enter the domain name of the web page to be detected into the crawler program, and manually grab the HTML source code of the webpage and store it as a file.

2) The pre-processing steps of the source code obtained in step 1) are the same as the data pre-processing method above, that is, anchor text extraction, word segmentation, and vectorization.

3) On the test set obtained in step 2), use the already trained classification model to classify. The trained classification model can be used to automatically classify unknown web pages in batches to detect whether they contain dark chains.

The above three stages of vectorization, feature selection and classification model training can also be independent of existing integrated tool software. For example, Weka, Scikit, Orange, etc. mentioned above can be programmed by themselves, in order to shorten the work cycle, use The open source tools mentioned above simplify the working steps.

Table 1 lists the accuracy and recall rates of the five classifiers and four feature extraction algorithms using the method of the present invention. The dataset is a Chinese webpage (manually screened Chinese webpages containing dark chains and normal Chinese webpages containing no dark links collected from the DMOZ catalog). The indicator Precision is the accuracy rate, Recall is the recall rate, F-measure It is an index value of the former two, and the ROC areas are the ROC curve area. The closer the four indicators are to 1, the better the performance. Bold representations indicate better accuracy and other data performance.

Table 1. Accuracy and recall rates for five classifiers and four feature extraction algorithms

The above embodiments are only used to illustrate the technical solutions of the present invention, and the present invention is not limited thereto, and those skilled in the art can modify or replace the technical solutions of the present invention without departing from the spirit and scope of the present invention. The scope of protection shall be as stated in the claims.

Claims

A method for detecting a dark chain based on statistical machine learning, the steps of which include:

1) Collect the real webpage source data as a training set of the classification model, and divide it into two categories: dark chain and no dark chain;

2) Extract the anchor text from the Html source files of the two types of web pages, and divide the anchor text into a single word;

3) Vectorize the two types of text after the word segmentation;

4) performing a dimension reduction process on the vector corresponding to each text, that is, performing feature selection;

5) using the classifier to train the two types of data obtained in step 4) to obtain a classification model;

6) The classification model obtained in step 5) is used for the unknown web page to be detected, and the dark chain detection result is obtained.
The method of claim 1 wherein step 1) divides the web page into the two categories by expert annotation.
The method of claim 1 wherein: step 1) using the crawler to crawl the website home page to obtain two types of Html text.
The method according to claim 1, wherein in step 2), if the data set is a Chinese webpage, the open text Chinese word segmenter is used to segment the anchor text into a single word; if it is an English web page, the vocabulary is directly segmented. And lexical filtering to get a single word.
The method according to claim 4, wherein: step 2) adding a stop word list and a custom word vocabulary in the Chinese word segmentation process to reduce meaningless words and retain important words in the Chinese word segmentation process. The custom word vocabulary is a word unique to the dark chain anchor text.
The method of claim 1 wherein steps 3) through 5) are implemented using open source machine learning and data mining tools, including but not limited to Weka, Scikit, Orange. .
The method according to claim 6, wherein in step 3), when vectorization is performed, each word is used as a dimension, and if the word exists in the text, the corresponding dimension is 1, otherwise 0, thereby all the texts are used. Convert to the corresponding vector.
The method of claim 1 wherein the step 5) applying the classification model to the unknown web page to be detected is:

a) input the domain name of the webpage to be detected into the crawler program, and manually capture the Html source code of the webpage and store it as a file;

b) performing a pre-processing step on the source code obtained in step a), namely performing anchor text extraction, word segmentation and vectorization;

c) On the test set obtained in step b), use the trained classification model to classify to detect the presence or absence of dark chains.