CN104731919A

CN104731919A - Wechat public account user classifying method based on AdaBoost algorithm

Info

Publication number: CN104731919A
Application number: CN201510135936.XA
Authority: CN
Inventors: 李云超; 孙晓燕
Original assignee: Inspur Group Co Ltd
Current assignee: Inspur Group Co Ltd
Priority date: 2015-03-26
Filing date: 2015-03-26
Publication date: 2015-06-24

Abstract

The invention discloses a Wechat public account user classifying method based on the AdaBoost algorithm. According to the method, when Wechat public account users conduct daily operation and inquiry, operation and inquiry information is iterated by means of the boundary data oversampling algorithm, then data with high weight serve as boundary data, the boundary data are sampled, users are classified, and a system pushes hotspot information when users conduct clicking again. Compared with the prior art, the method has the advantages that after iteration is conducted multiple times with the AdaBoost algorithm, the effectiveness of the algorithm is proved through experiment results, user information pushing accuracy can be improved, practicality is high, and popularization is easy.

Description

A kind of micro-letter public number user classification method based on AdaBoost algorithm

Technical field

The present invention relates to technical field of data processing, specifically a kind of practical, based on micro-letter public number user classification method of AdaBoost algorithm.

Background technology

Classification refers to by the analysis of data with existing and study, finds the rule be hidden in data, and utilizes these rules to go to predict the data in future and judge.In today of information society high speed development, in the face of the data of magnanimity, sorting technique just seems particularly important, and classification is also an important method of machine learning, Data Mining.Existing classical taxonomy algorithm, such as: KNN, bayesian algorithm, neural network algorithm, algorithm of support vector machine, integrated study etc., all achieve and apply widely, and obtain good effect.

In the mobile interchange epoch, direction has been opened up in the marketing that is extensively extended to of micro-letter public number, and a large amount of user profile also compares dispersion.Namely the sample size of certain classification of data centralization is far less than the sample size of other classifications, and namely data set distribution is that formula is unbalanced.

Because the distribution basis equalization of traditional algorithm tentation data collection, and using overall classification accuracy as target, this will to cause in unbalanced dataset assorting process sorter to there is obvious preference to quantitatively dominant most class: the nicety of grading that improve most class, and lower to the nicety of grading of minority class.

AdaBoost algorithm is a kind of relatively more conventional boosting algorithm, its basic thought is: give the weight that in sample set, each sample is identical time initial, and use a Weak Classifier to train, according to this classification results, adjustment is re-started to sample weights afterwards, reduce the sample weights of just dividing, increase the sample weights of wrong point, namely the sample of wrong point is more focused on, after successive ignition, wrong point sample weights is increasing, and last classification results is produced according to multiple Weak Classifier Nearest Neighbor with Weighted Voting by strong classifier.

Usually a data centralization, the data on border are larger for classification results impact, but more difficult classification, namely easily divided by mistake.In AdaBoost algorithm, these data boundary weights of difficult point can along with iterative process carry out increasing.Thus, we use AdaBoost Algorithm for Training data set, and through iteration repeatedly, these data weightings of difficult point are comparatively large, use these data boundaries to carry out oversampling.Based on this, now provide a kind of micro-letter public number user classification method based on AdaBoost algorithm.

Summary of the invention

Technical assignment of the present invention is for above weak point, provide a kind of practical, based on micro-letter public number user classification method of AdaBoost algorithm.

Based on a micro-letter public number user classification method for AdaBoost algorithm, its specific implementation process is:

When micro-letter public number user regular job and inquiry, this operation and query messages are after data boundary oversampling algorithm iteration, and the large data of weight are as data boundary, this data boundary is sampled, by users classification, when user again clicking operation time, system pushes focus message.

Described data boundary oversampling algorithm, different according to the system of selection of data boundary, be divided into following two kinds of algorithms: what applicable data boundary weight was large is selected the random oversampling algorithm of data boundary according to probability; Force the minority class data boundary selection algorithm selecting the data that weight is large in minority class.

The detailed process of described random oversampling algorithm is:

Input original training set T: the data set comprising n sample;

Output category model:

For original training set , use AdaBoost algorithm to do pre-service, be separated afterwards middle most class and minority class, leave set in respectively with in, quantity is respectively with ;

According to set weight, Stochastic choice individual sample adds set , wherein , for the multiple of oversampling, repeat, until meet the multiple of sampling ;

Merge all newly-generated minority class and add set ;

Merge set with , form new training set ;

training classifier;

Algorithm terminates.

The detailed process of described minority class data boundary selection algorithm is:

Input original training set T: the data set comprising n sample;

Export: disaggregated model;

For original training set , use AdaBoost algorithm to do pre-service to it, be separated afterwards middle most class and minority class, leave set in respectively with in, quantity is respectively with , and calculate the average weight of middle sample, is denoted as ;

According to the average weight of minority class , minority class set is divided into two parts be greater than average weight stored in set , quantity is , be less than average weight stored in set , quantity is ; Merge all newly-generated minority class and add set ;

According to sampling multiple , random oversampling set , and all newly-generated minority class are added set ;

Merge set with , form new training set ;

training classifier;

Algorithm terminates.

A kind of micro-letter public number user classification method based on AdaBoost algorithm of the present invention, has the following advantages:

A kind of micro-letter public number user classification method based on AdaBoost algorithm that the present invention proposes, solve micro-letter public number users classification with PUSH message problem more accurately, after using AdaBoost algorithm iteration repeatedly, the result of experiment demonstrates the validity of algorithm, may be used for improving user message and push precision, practical, be easy to promote.

Accompanying drawing explanation

Accompanying drawing 1 be the character of data set in the present invention and k table is set.

Accompanying drawing 2 is the comparison sheet of algorithm at Tp rate.

Accompanying drawing 3 is the comparison sheet of algorithm at F-value.

Accompanying drawing 4 is the comparison sheet of algorithm at G-mean.

Embodiment

Below in conjunction with the drawings and specific embodiments, the invention will be further described.

The invention discloses a kind of micro-letter public number user classification method based on AdaBoost algorithm, after using AdaBoost algorithm iteration repeatedly, select the larger data of fractional weight as data boundary, and to this part data boundary oversampling, data set is tended to balance, reaches the object improving minority class nicety of grading; AdaBoost algorithm is performed at micro-letter public number framework, according to user's regular job and Query Information, by users classification, precisely to push information on bidding, price information etc.; User is when clicking subscription bid and waiting operation, and system pushes hot information.

Its specific implementation process is:

Described random oversampling algorithm is referred to as BOBA-1 algorithm, and its detailed process is:

Input original training set T: the data set comprising n sample;

Output category model:

Merge all newly-generated minority class and add set ;

Merge set with , form new training set ;

training classifier;

Algorithm terminates.

In above BOBA-1 algorithm, can find out, after using AdaBoost algorithm to do pre-service, be easy to distinguish data boundary and other data in minority class set.Data boundary weight is comparatively large, and selected probability is also comparatively large, this not only adds minority class quantity, and increases the more helpful data boundary of classification.At this, we use random oversampling algorithm instead of SMOTE algorithm, are because we think simple algorithm better effects if in some cases, and more easily verify the feasibility of new algorithm.In this algorithm, data boundary is just selected according to probability, may there is the situation that some data boundary is not selected.For this reason, we devise again following algorithm, strengthen the selection to minority class data boundary, namely force to select the data that in minority class, weight is larger, instead of according to weight according to probability selection.

Described minority class data boundary selection algorithm is referred to as BOBA-2 algorithm, and its detailed process is:

Input original training set T: the data set comprising n sample;

Export: disaggregated model;

Merge set with , form new training set ;

training classifier;

Algorithm terminates.

In BOBA-2 algorithm, the minority class that we select weight larger, and oversampling accordingly, compared to BOBA-1 algorithm, focus on the selection to data boundary more.

Embodiment:

Carry out algorithm experimental now, we come from UCI by data set used, select 14 unbalanced data sets of ratio to test, wherein partitioned data set (PDS) is two class imbalance problems, partitioned data set (PDS) is two class unbalanced dataset by manual amendment, and meanwhile, we are to data set oversampling multiple carry out different settings, make data set regional balance, the character of data set and setting as shown in Figure 1.

Experiment uses ten folding cross validations to test on weka platform, carries out oversampling to minority class, makes its quantity basic identical with most class, reaches balance.Evaluation criterion uses Tp rate, F-value and G-mean.Experiment will compare with several algorithm below:

C4.5 algorithm: whole set of data ( ) participate in training;

C4.5+Random algorithm (brief note Ran): to minority class set carry out random oversampling, make data set substantially reach balance;

C4.5+ BOBA-1 algorithm (brief note B-1): use BOBA-1 algorithm to process data set, then use C4.5 training;

C4.5+ BOBA-2 algorithm (brief note B-2): use BOBA-2 algorithm to process data set, then use C4.5 training;

Consider the randomness of random oversampling algorithm, algorithm gets the average result of 10 times.

BOBA algorithm experimental result and analysis:

Attachedly Figure 2 shows that the comparison of algorithm at Tp rate, react the performance of algorithm in classification minority class.Black matrix in form represents the maximal value in often going.Therefrom, can find out, the algorithm that we propose shows better on 11 data sets, has reacted algorithm and has improved the advantage in minority class nicety of grading.

Attachedly Figure 3 shows that the comparison of algorithm at F-value, algorithm shows better on 11 data sets, a little less than random oversampling algorithm on cmc data set, but improves the nicety of grading of minority class; Also lower than random oversampling algorithm on haberman data set, the nicety of grading of minority class is identical; On hepatitis data set, C4.5+Random and our algorithm all lower than C4.5, although improve the nicety of grading of minority class.

Attachedly Figure 4 shows that the comparison of algorithm at G-mean, reflect the classification situation of data set entirety.Therefrom can find out, our algorithm shows better on 11 data sets, on haberman, hepatitis data set, algorithm is not as C4.5+Random, and a little less than C4.5+Random on vowel data set, but Tp rate and F-value all increases.

By observing three tables, can see that BOBA-2 algorithm is in most cases excellent not as the performance of BOBA-1 algorithm.In fact, there is the problem of overfitting in the attention classification boundaries data that BOBA-2 algorithm is excessive, especially, when the minority class number being greater than average sample is less, oversampling will be crossed repeatedly, easily produce overfitting, cause classification performance to decline, even result is poor, such as mfeat-m data set.

Recall algorithm BOBA-1, we can see, BOBA-1 algorithm is according to the weight of sample according to probability selection, and large large by the probability selected of probability, therefore it is relatively stable, and the result that the result of experiment also demonstrates BOBA-1 algorithm each is almost identical.Review, C4.5+Random algorithm and BOBA-2 algorithm randomness are comparatively large, because each random number is different, causes result to be floated comparatively large, for ensureing the reliability of result, needing the average result asked repeatedly.Therefore, can say, BOBA-1 algorithm is also more preponderated in time.

By above analysis, can see that BOBA-1 and BOBA-2 algorithm is while raising minority class precision, the nicety of grading of the most class of excessive sacrifice, effectively can not improve the performance of sorter.

The present invention is based on the data boundary oversampling algorithm of AdaBoost, after using AdaBoost algorithm iteration repeatedly, select the larger data of fractional weight as data boundary, and to this part data boundary oversampling, data set is tended to balance, reach the object improving minority class nicety of grading, the result of experiment demonstrates the validity of algorithm.

Above-mentioned embodiment is only concrete case of the present invention; scope of patent protection of the present invention includes but not limited to above-mentioned embodiment; claims of any a kind of micro-letter public number user classification method based on AdaBoost algorithm according to the invention and the those of ordinary skill of any described technical field to its suitable change done or replacement, all should fall into scope of patent protection of the present invention.

Claims

1. the micro-letter public number user classification method based on AdaBoost algorithm, it is characterized in that, its specific implementation process is: when micro-letter public number user regular job and inquiry, this operation and query messages are after data boundary oversampling algorithm iteration, the large data of weight, as data boundary, are sampled, by users classification to this data boundary, when user again clicking operation time, system pushes focus message.

2. a kind of micro-letter public number user classification method based on AdaBoost algorithm according to claim 1, it is characterized in that, described data boundary oversampling algorithm, different according to the system of selection of data boundary, be divided into following two kinds of algorithms: what applicable data boundary weight was large is selected the random oversampling algorithm of data boundary according to probability; Force the minority class data boundary selection algorithm selecting the data that weight is large in minority class.

3. a kind of micro-letter public number user classification method based on AdaBoost algorithm according to claim 2, it is characterized in that, the detailed process of described random oversampling algorithm is:

Input original training set T: the data set comprising n sample;

Output category model:

Merge all newly-generated minority class and add set ;

Merge set with , form new training set ;

training classifier;

Algorithm terminates.

4. a kind of micro-letter public number user classification method based on AdaBoost algorithm according to claim 2, it is characterized in that, the detailed process of described minority class data boundary selection algorithm is:

Input original training set T: the data set comprising n sample;

Export: disaggregated model;

Merge set with , form new training set ;

training classifier;

Algorithm terminates.