CN104731919A - Wechat public account user classifying method based on AdaBoost algorithm - Google Patents

Wechat public account user classifying method based on AdaBoost algorithm Download PDF

Info

Publication number
CN104731919A
CN104731919A CN201510135936.XA CN201510135936A CN104731919A CN 104731919 A CN104731919 A CN 104731919A CN 201510135936 A CN201510135936 A CN 201510135936A CN 104731919 A CN104731919 A CN 104731919A
Authority
CN
China
Prior art keywords
algorithm
data
minority class
oversampling
data boundary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510135936.XA
Other languages
Chinese (zh)
Inventor
李云超
孙晓燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Group Co Ltd
Original Assignee
Inspur Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Group Co Ltd filed Critical Inspur Group Co Ltd
Priority to CN201510135936.XA priority Critical patent/CN104731919A/en
Publication of CN104731919A publication Critical patent/CN104731919A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a Wechat public account user classifying method based on the AdaBoost algorithm. According to the method, when Wechat public account users conduct daily operation and inquiry, operation and inquiry information is iterated by means of the boundary data oversampling algorithm, then data with high weight serve as boundary data, the boundary data are sampled, users are classified, and a system pushes hotspot information when users conduct clicking again. Compared with the prior art, the method has the advantages that after iteration is conducted multiple times with the AdaBoost algorithm, the effectiveness of the algorithm is proved through experiment results, user information pushing accuracy can be improved, practicality is high, and popularization is easy.

Description

A kind of micro-letter public number user classification method based on AdaBoost algorithm
Technical field
The present invention relates to technical field of data processing, specifically a kind of practical, based on micro-letter public number user classification method of AdaBoost algorithm.
Background technology
Classification refers to by the analysis of data with existing and study, finds the rule be hidden in data, and utilizes these rules to go to predict the data in future and judge.In today of information society high speed development, in the face of the data of magnanimity, sorting technique just seems particularly important, and classification is also an important method of machine learning, Data Mining.Existing classical taxonomy algorithm, such as: KNN, bayesian algorithm, neural network algorithm, algorithm of support vector machine, integrated study etc., all achieve and apply widely, and obtain good effect.
In the mobile interchange epoch, direction has been opened up in the marketing that is extensively extended to of micro-letter public number, and a large amount of user profile also compares dispersion.Namely the sample size of certain classification of data centralization is far less than the sample size of other classifications, and namely data set distribution is that formula is unbalanced.
Because the distribution basis equalization of traditional algorithm tentation data collection, and using overall classification accuracy as target, this will to cause in unbalanced dataset assorting process sorter to there is obvious preference to quantitatively dominant most class: the nicety of grading that improve most class, and lower to the nicety of grading of minority class.
AdaBoost algorithm is a kind of relatively more conventional boosting algorithm, its basic thought is: give the weight that in sample set, each sample is identical time initial, and use a Weak Classifier to train, according to this classification results, adjustment is re-started to sample weights afterwards, reduce the sample weights of just dividing, increase the sample weights of wrong point, namely the sample of wrong point is more focused on, after successive ignition, wrong point sample weights is increasing, and last classification results is produced according to multiple Weak Classifier Nearest Neighbor with Weighted Voting by strong classifier.
Usually a data centralization, the data on border are larger for classification results impact, but more difficult classification, namely easily divided by mistake.In AdaBoost algorithm, these data boundary weights of difficult point can along with iterative process carry out increasing.Thus, we use AdaBoost Algorithm for Training data set, and through iteration repeatedly, these data weightings of difficult point are comparatively large, use these data boundaries to carry out oversampling.Based on this, now provide a kind of micro-letter public number user classification method based on AdaBoost algorithm.
Summary of the invention
Technical assignment of the present invention is for above weak point, provide a kind of practical, based on micro-letter public number user classification method of AdaBoost algorithm.
Based on a micro-letter public number user classification method for AdaBoost algorithm, its specific implementation process is:
When micro-letter public number user regular job and inquiry, this operation and query messages are after data boundary oversampling algorithm iteration, and the large data of weight are as data boundary, this data boundary is sampled, by users classification, when user again clicking operation time, system pushes focus message.
Described data boundary oversampling algorithm, different according to the system of selection of data boundary, be divided into following two kinds of algorithms: what applicable data boundary weight was large is selected the random oversampling algorithm of data boundary according to probability; Force the minority class data boundary selection algorithm selecting the data that weight is large in minority class.
The detailed process of described random oversampling algorithm is:
Input original training set T: the data set comprising n sample;
Output category model:
For original training set , use AdaBoost algorithm to do pre-service, be separated afterwards middle most class and minority class, leave set in respectively with in, quantity is respectively with ;
According to set weight, Stochastic choice individual sample adds set , wherein , for the multiple of oversampling, repeat, until meet the multiple of sampling ;
Merge all newly-generated minority class and add set ;
Merge set with , form new training set ;
training classifier;
Algorithm terminates.
The detailed process of described minority class data boundary selection algorithm is:
Input original training set T: the data set comprising n sample;
Export: disaggregated model;
For original training set , use AdaBoost algorithm to do pre-service to it, be separated afterwards middle most class and minority class, leave set in respectively with in, quantity is respectively with , and calculate the average weight of middle sample, is denoted as ;
According to the average weight of minority class , minority class set is divided into two parts be greater than average weight stored in set , quantity is , be less than average weight stored in set , quantity is ; Merge all newly-generated minority class and add set ;
According to sampling multiple , random oversampling set , and all newly-generated minority class are added set ;
Merge set with , form new training set ;
training classifier;
Algorithm terminates.
A kind of micro-letter public number user classification method based on AdaBoost algorithm of the present invention, has the following advantages:
A kind of micro-letter public number user classification method based on AdaBoost algorithm that the present invention proposes, solve micro-letter public number users classification with PUSH message problem more accurately, after using AdaBoost algorithm iteration repeatedly, the result of experiment demonstrates the validity of algorithm, may be used for improving user message and push precision, practical, be easy to promote.
Accompanying drawing explanation
Accompanying drawing 1 be the character of data set in the present invention and k table is set.
Accompanying drawing 2 is the comparison sheet of algorithm at Tp rate.
Accompanying drawing 3 is the comparison sheet of algorithm at F-value.
Accompanying drawing 4 is the comparison sheet of algorithm at G-mean.
Embodiment
Below in conjunction with the drawings and specific embodiments, the invention will be further described.
The invention discloses a kind of micro-letter public number user classification method based on AdaBoost algorithm, after using AdaBoost algorithm iteration repeatedly, select the larger data of fractional weight as data boundary, and to this part data boundary oversampling, data set is tended to balance, reaches the object improving minority class nicety of grading; AdaBoost algorithm is performed at micro-letter public number framework, according to user's regular job and Query Information, by users classification, precisely to push information on bidding, price information etc.; User is when clicking subscription bid and waiting operation, and system pushes hot information.
Its specific implementation process is:
When micro-letter public number user regular job and inquiry, this operation and query messages are after data boundary oversampling algorithm iteration, and the large data of weight are as data boundary, this data boundary is sampled, by users classification, when user again clicking operation time, system pushes focus message.
Described data boundary oversampling algorithm, different according to the system of selection of data boundary, be divided into following two kinds of algorithms: what applicable data boundary weight was large is selected the random oversampling algorithm of data boundary according to probability; Force the minority class data boundary selection algorithm selecting the data that weight is large in minority class.
Described random oversampling algorithm is referred to as BOBA-1 algorithm, and its detailed process is:
Input original training set T: the data set comprising n sample;
Output category model:
For original training set , use AdaBoost algorithm to do pre-service, be separated afterwards middle most class and minority class, leave set in respectively with in, quantity is respectively with ;
According to set weight, Stochastic choice individual sample adds set , wherein , for the multiple of oversampling, repeat, until meet the multiple of sampling ;
Merge all newly-generated minority class and add set ;
Merge set with , form new training set ;
training classifier;
Algorithm terminates.
In above BOBA-1 algorithm, can find out, after using AdaBoost algorithm to do pre-service, be easy to distinguish data boundary and other data in minority class set.Data boundary weight is comparatively large, and selected probability is also comparatively large, this not only adds minority class quantity, and increases the more helpful data boundary of classification.At this, we use random oversampling algorithm instead of SMOTE algorithm, are because we think simple algorithm better effects if in some cases, and more easily verify the feasibility of new algorithm.In this algorithm, data boundary is just selected according to probability, may there is the situation that some data boundary is not selected.For this reason, we devise again following algorithm, strengthen the selection to minority class data boundary, namely force to select the data that in minority class, weight is larger, instead of according to weight according to probability selection.
Described minority class data boundary selection algorithm is referred to as BOBA-2 algorithm, and its detailed process is:
Input original training set T: the data set comprising n sample;
Export: disaggregated model;
For original training set , use AdaBoost algorithm to do pre-service to it, be separated afterwards middle most class and minority class, leave set in respectively with in, quantity is respectively with , and calculate the average weight of middle sample, is denoted as ;
According to the average weight of minority class , minority class set is divided into two parts be greater than average weight stored in set , quantity is , be less than average weight stored in set , quantity is ; Merge all newly-generated minority class and add set ;
According to sampling multiple , random oversampling set , and all newly-generated minority class are added set ;
Merge set with , form new training set ;
training classifier;
Algorithm terminates.
In BOBA-2 algorithm, the minority class that we select weight larger, and oversampling accordingly, compared to BOBA-1 algorithm, focus on the selection to data boundary more.
Embodiment:
Carry out algorithm experimental now, we come from UCI by data set used, select 14 unbalanced data sets of ratio to test, wherein partitioned data set (PDS) is two class imbalance problems, partitioned data set (PDS) is two class unbalanced dataset by manual amendment, and meanwhile, we are to data set oversampling multiple carry out different settings, make data set regional balance, the character of data set and setting as shown in Figure 1.
Experiment uses ten folding cross validations to test on weka platform, carries out oversampling to minority class, makes its quantity basic identical with most class, reaches balance.Evaluation criterion uses Tp rate, F-value and G-mean.Experiment will compare with several algorithm below:
C4.5 algorithm: whole set of data ( ) participate in training;
C4.5+Random algorithm (brief note Ran): to minority class set carry out random oversampling, make data set substantially reach balance;
C4.5+ BOBA-1 algorithm (brief note B-1): use BOBA-1 algorithm to process data set, then use C4.5 training;
C4.5+ BOBA-2 algorithm (brief note B-2): use BOBA-2 algorithm to process data set, then use C4.5 training;
Consider the randomness of random oversampling algorithm, algorithm gets the average result of 10 times.
BOBA algorithm experimental result and analysis:
Attachedly Figure 2 shows that the comparison of algorithm at Tp rate, react the performance of algorithm in classification minority class.Black matrix in form represents the maximal value in often going.Therefrom, can find out, the algorithm that we propose shows better on 11 data sets, has reacted algorithm and has improved the advantage in minority class nicety of grading.
Attachedly Figure 3 shows that the comparison of algorithm at F-value, algorithm shows better on 11 data sets, a little less than random oversampling algorithm on cmc data set, but improves the nicety of grading of minority class; Also lower than random oversampling algorithm on haberman data set, the nicety of grading of minority class is identical; On hepatitis data set, C4.5+Random and our algorithm all lower than C4.5, although improve the nicety of grading of minority class.
Attachedly Figure 4 shows that the comparison of algorithm at G-mean, reflect the classification situation of data set entirety.Therefrom can find out, our algorithm shows better on 11 data sets, on haberman, hepatitis data set, algorithm is not as C4.5+Random, and a little less than C4.5+Random on vowel data set, but Tp rate and F-value all increases.
By observing three tables, can see that BOBA-2 algorithm is in most cases excellent not as the performance of BOBA-1 algorithm.In fact, there is the problem of overfitting in the attention classification boundaries data that BOBA-2 algorithm is excessive, especially, when the minority class number being greater than average sample is less, oversampling will be crossed repeatedly, easily produce overfitting, cause classification performance to decline, even result is poor, such as mfeat-m data set.
Recall algorithm BOBA-1, we can see, BOBA-1 algorithm is according to the weight of sample according to probability selection, and large large by the probability selected of probability, therefore it is relatively stable, and the result that the result of experiment also demonstrates BOBA-1 algorithm each is almost identical.Review, C4.5+Random algorithm and BOBA-2 algorithm randomness are comparatively large, because each random number is different, causes result to be floated comparatively large, for ensureing the reliability of result, needing the average result asked repeatedly.Therefore, can say, BOBA-1 algorithm is also more preponderated in time.
By above analysis, can see that BOBA-1 and BOBA-2 algorithm is while raising minority class precision, the nicety of grading of the most class of excessive sacrifice, effectively can not improve the performance of sorter.
The present invention is based on the data boundary oversampling algorithm of AdaBoost, after using AdaBoost algorithm iteration repeatedly, select the larger data of fractional weight as data boundary, and to this part data boundary oversampling, data set is tended to balance, reach the object improving minority class nicety of grading, the result of experiment demonstrates the validity of algorithm.
Above-mentioned embodiment is only concrete case of the present invention; scope of patent protection of the present invention includes but not limited to above-mentioned embodiment; claims of any a kind of micro-letter public number user classification method based on AdaBoost algorithm according to the invention and the those of ordinary skill of any described technical field to its suitable change done or replacement, all should fall into scope of patent protection of the present invention.

Claims (4)

1. the micro-letter public number user classification method based on AdaBoost algorithm, it is characterized in that, its specific implementation process is: when micro-letter public number user regular job and inquiry, this operation and query messages are after data boundary oversampling algorithm iteration, the large data of weight, as data boundary, are sampled, by users classification to this data boundary, when user again clicking operation time, system pushes focus message.
2. a kind of micro-letter public number user classification method based on AdaBoost algorithm according to claim 1, it is characterized in that, described data boundary oversampling algorithm, different according to the system of selection of data boundary, be divided into following two kinds of algorithms: what applicable data boundary weight was large is selected the random oversampling algorithm of data boundary according to probability; Force the minority class data boundary selection algorithm selecting the data that weight is large in minority class.
3. a kind of micro-letter public number user classification method based on AdaBoost algorithm according to claim 2, it is characterized in that, the detailed process of described random oversampling algorithm is:
Input original training set T: the data set comprising n sample;
Output category model:
For original training set , use AdaBoost algorithm to do pre-service, be separated afterwards middle most class and minority class, leave set in respectively with in, quantity is respectively with ;
According to set weight, Stochastic choice individual sample adds set , wherein , for the multiple of oversampling, repeat, until meet the multiple of sampling ;
Merge all newly-generated minority class and add set ;
Merge set with , form new training set ;
training classifier;
Algorithm terminates.
4. a kind of micro-letter public number user classification method based on AdaBoost algorithm according to claim 2, it is characterized in that, the detailed process of described minority class data boundary selection algorithm is:
Input original training set T: the data set comprising n sample;
Export: disaggregated model;
For original training set , use AdaBoost algorithm to do pre-service to it, be separated afterwards middle most class and minority class, leave set in respectively with in, quantity is respectively with , and calculate the average weight of middle sample, is denoted as ;
According to the average weight of minority class , minority class set is divided into two parts be greater than average weight stored in set , quantity is , be less than average weight stored in set , quantity is ; Merge all newly-generated minority class and add set ;
According to sampling multiple , random oversampling set , and all newly-generated minority class are added set ;
Merge set with , form new training set ;
training classifier;
Algorithm terminates.
CN201510135936.XA 2015-03-26 2015-03-26 Wechat public account user classifying method based on AdaBoost algorithm Pending CN104731919A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510135936.XA CN104731919A (en) 2015-03-26 2015-03-26 Wechat public account user classifying method based on AdaBoost algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510135936.XA CN104731919A (en) 2015-03-26 2015-03-26 Wechat public account user classifying method based on AdaBoost algorithm

Publications (1)

Publication Number Publication Date
CN104731919A true CN104731919A (en) 2015-06-24

Family

ID=53455806

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510135936.XA Pending CN104731919A (en) 2015-03-26 2015-03-26 Wechat public account user classifying method based on AdaBoost algorithm

Country Status (1)

Country Link
CN (1) CN104731919A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105262819A (en) * 2015-10-29 2016-01-20 努比亚技术有限公司 Mobile terminal and method thereof for achieving push
CN105787025A (en) * 2016-02-24 2016-07-20 腾讯科技(深圳)有限公司 Network platform public account classifying method and device
WO2017125020A1 (en) * 2016-01-22 2017-07-27 腾讯科技(深圳)有限公司 Message processing method, device and system
CN107067032A (en) * 2017-03-30 2017-08-18 东软集团股份有限公司 The method and apparatus of data classification
CN112819020A (en) * 2019-11-15 2021-05-18 富士通株式会社 Method and device for training classification model and classification method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101668176A (en) * 2009-09-25 2010-03-10 北京酷联天下科技有限公司 Multimedia content-on-demand and sharing method based on social interaction graph
US7844085B2 (en) * 2007-06-07 2010-11-30 Seiko Epson Corporation Pairwise feature learning with boosting for use in face detection

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7844085B2 (en) * 2007-06-07 2010-11-30 Seiko Epson Corporation Pairwise feature learning with boosting for use in face detection
CN101668176A (en) * 2009-09-25 2010-03-10 北京酷联天下科技有限公司 Multimedia content-on-demand and sharing method based on social interaction graph

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
孙晓燕: "《不平衡数据集分类问题研究》", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105262819A (en) * 2015-10-29 2016-01-20 努比亚技术有限公司 Mobile terminal and method thereof for achieving push
CN105262819B (en) * 2015-10-29 2019-02-15 努比亚技术有限公司 A kind of mobile terminal and its method for realizing push
WO2017125020A1 (en) * 2016-01-22 2017-07-27 腾讯科技(深圳)有限公司 Message processing method, device and system
CN105787025A (en) * 2016-02-24 2016-07-20 腾讯科技(深圳)有限公司 Network platform public account classifying method and device
CN105787025B (en) * 2016-02-24 2021-07-09 腾讯科技(深圳)有限公司 Network platform public account classification method and device
CN107067032A (en) * 2017-03-30 2017-08-18 东软集团股份有限公司 The method and apparatus of data classification
CN107067032B (en) * 2017-03-30 2020-04-07 东软集团股份有限公司 Data classification method and device
CN112819020A (en) * 2019-11-15 2021-05-18 富士通株式会社 Method and device for training classification model and classification method

Similar Documents

Publication Publication Date Title
CN104731919A (en) Wechat public account user classifying method based on AdaBoost algorithm
CN109492026B (en) Telecommunication fraud classification detection method based on improved active learning technology
CN111860638B (en) Parallel intrusion detection method and system based on unbalanced data deep belief network
Zhao et al. A weighted hybrid ensemble method for classifying imbalanced data
Fernandes et al. A proactive intelligent decision support system for predicting the popularity of online news
Rodrigues et al. Gaussian process classification and active learning with multiple annotators
CN101587493B (en) Text classification method
CN103902570B (en) A kind of text classification feature extracting method, sorting technique and device
CN105787025B (en) Network platform public account classification method and device
CN107609121A (en) Newsletter archive sorting technique based on LDA and word2vec algorithms
CN107908715A (en) Microblog emotional polarity discriminating method based on Adaboost and grader Weighted Fusion
CN105975518B (en) Expectation cross entropy feature selecting Text Classification System and method based on comentropy
CN102156885B (en) Image classification method based on cascaded codebook generation
Goh et al. Comprehensive literature review on machine learning structures for web spam classification
CN104751182A (en) DDAG-based SVM multi-class classification active learning algorithm
CN105320967A (en) Multi-label AdaBoost integration method based on label correlation
CN105512916A (en) Advertisement accurate delivery method and advertisement accurate delivery system
CN107944460A (en) One kind is applied to class imbalance sorting technique in bioinformatics
Alsaafin et al. A minimal subset of features using feature selection for handwritten digit recognition
CN109766911A (en) A kind of behavior prediction method
CN106251241A (en) A kind of feature based selects the LR Bagging algorithm improved
CN106529726A (en) Method of performing classification and recommendation based on stock prediction trends
CN108537279A (en) Based on the data source grader construction method for improving Adaboost algorithm
Foozy et al. A comparative study with RapidMiner and WEKA tools over some classification techniques for SMS spam
Peng Adaptive sampling with optimal cost for class-imbalance learning

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150624

WD01 Invention patent application deemed withdrawn after publication