CN104598452A - Method and device for analyzing user gender - Google Patents

Method and device for analyzing user gender Download PDF

Info

Publication number
CN104598452A
CN104598452A CN201310526980.4A CN201310526980A CN104598452A CN 104598452 A CN104598452 A CN 104598452A CN 201310526980 A CN201310526980 A CN 201310526980A CN 104598452 A CN104598452 A CN 104598452A
Authority
CN
China
Prior art keywords
user
cis
sex
monogram
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310526980.4A
Other languages
Chinese (zh)
Other versions
CN104598452B (en
Inventor
丁若谷
陈家耀
冯是聪
吴明辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING SIBOTU INFORMATION TECHNOLOGY Co Ltd
Original Assignee
BEIJING SIBOTU INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING SIBOTU INFORMATION TECHNOLOGY Co Ltd filed Critical BEIJING SIBOTU INFORMATION TECHNOLOGY Co Ltd
Priority to CN201310526980.4A priority Critical patent/CN104598452B/en
Publication of CN104598452A publication Critical patent/CN104598452A/en
Application granted granted Critical
Publication of CN104598452B publication Critical patent/CN104598452B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Abstract

The invention provides a method and a device for analyzing a user gender, and relates to the field of data analysis. The problem that a traditional analysis mode is not suitable for an occasion that a personalized domain name and a name are weak in association can be solved. The method comprises the following steps: collecting a sample data set, wherein the sample data set comprises multiple pairs of user personalized domain names and the corresponding user genders; calculating an occurrence probability of the combinations of different letters on each sequential position and different letters on a plurality of adjacent sequential positions in each user personalized domain name in the sample data set; and taking a proportion of males in the sample data set and the probability as reference parameters to analyze the user personalized domain names with the unknown user gender, and judging the user gender. The technical scheme provided by the invention is suitable for data analysis and realizes the analysis of the user gender on the basis of an automation algorithm.

Description

User's gender analysis method and apparatus
Technical field
The present invention relates to data analysis field, particularly relate to a kind of user's gender analysis method and apparatus.
Background technology
Under internet environment, the sex of user is a very important information.According to the sex of user, ICP can represent different contents to different user.Such as, it is interested in e-sports that male user may compare female user, and female user may to compare male user interested in fashion dress ornament.In this case, if the sex of user is identified, Internet advertising provider can be just the advertisement of male user displaying e-sports, is the advertisement that female user shows fashion dress ornament, thus makes advertisement more targeted, obtain better advertising results.
For registration blog, the user of microblogging or other social network sites, a lot of service provider can after user completes necessary log-on message, suggestion user fills in the attribute of some users itself, such as sex, age, duty, for oneself arranging individual character domain name etc., and be often all that selectivity fills in item usually in the information registering item relating to privacy of user in these attributes, and nonessential item of filling in, like this, just result in quite a few user to select not fill in this type of information, such as user is that the information of protection oneself does not leak outside, can select not fill in the age, sex etc., so, for data analysis mechanism or supplier itself, also the gender information of user cannot just directly be obtained.But item is filled in for the selectivity not relating to privacy, often very high by the success ratio of filling in.Such as, individual character domain name, service provider experiences and affinity in order to adding users, and the microblogging or the personal space homepage that often permit a user to oneself arrange the virtual url with representative of consumer nature.These Domain Name Form registering sites can be set to oneself name or the numeral oneself liked arbitrarily by user, or monogram, i.e. fashion but also facilitate.But, for the gender differences of mankind itself, arranging individual character domain name, masculinity and femininity often instinct go the domain name that some represent self attributes is set.Such as, certain user may register property domain name one by one: http://weibo.com/basketballfans, and wherein weibo.com is the domain name of microblogging service provider, the individual character domain name that basketballfans part and user select.So, extrapolated the gender information of user by the individual character domain name with user representative, not only do not invade user but also can user profile be collected.
In existing technology, the most similar technology is United States Patent (USP) 7,447,996 [1].This patent proposes a kind of software module, for inferring the sex of user in instant communicating system according to different user names, shows different virtual images according to different sexes.Depend on specific Praxeology data, the name namely in language-specific and the relation between sex.Such as, mention in this patent, for Chinese Name, by the retrieval of Praxeology database, " xiuxiu " and " lili " may be more the name of women.
Praxeology database is not also suitable for multiple network application scenarios, is not especially suitable for individual character domain name and associates more weak occasion with name.The composition of individual character domain name normally comprises a large amount of compositions exceeding common name category, and these compositions are difficult to by Praxeology data analysis.Such as, may comprise " basketball " in individual character domain name, i.e. basketball; And basketball may be put into the basket ball fan of individual character domain name, the male sex may occupy an leading position.If " the corresponding male sex of basketball " this kind of data are added database, required work will greatly increase, and is difficult to complete.
Summary of the invention
The invention provides a kind of user's gender analysis method and apparatus, solve existing analysis mode and be not suitable for individual character domain name associates more weak occasion problem with name.
A kind of user's gender analysis method, comprising:
Collecting sample data set, described sample data collection comprises multipair user personality domain name and corresponding user's sex;
To add up in the user personality domain name that described sample data concentrates the probability that on each cis-position, on different letters and adjacent some cis-positions, different monogram occurs according to sex;
Concentrate the ratio of the male sex and described probability as with reference to parameter using described sample data, the user personality domain name of unknown subscriber's sex is analyzed, judges described user's sex.
Preferably, in the user personality domain name that the described sample data of described statistics is concentrated on each cis-position on different letters and adjacent some cis-positions the probability that different monogram occurs according to sex step before, also comprise:
Calculate the ratio that described sample data concentrates the male sex.
Preferably, to add up in the user personality domain name that described sample data concentrates the probability that on each cis-position, on different letters and adjacent some cis-positions, monogram occurs according to sex to comprise:
Step a: get the part that in a user personality domain name, user specifies, records user's sex that this user personality domain name is corresponding simultaneously;
Step b: the number of times that on the number of times occur letter on each cis-position of described part of specifying and/or adjacent some cis-positions, different monogram occurs counts;
Step c: the whole user personality domain names concentrated described sample data carry out the process as step a to b, until described sample data collection has traveled through;
Steps d: to add up on each cis-position of described user personality domain name the number of times that the monogram on number of times and/or adjacent some cis-positions that letter occurs for different sexes occurs for different sexes, and calculate the probability that on alphabetical on each cis-position and/or adjacent some cis-positions, monogram occurs for different sexes.
Preferably, to add up on each cis-position of described user personality domain name the number of times that the monogram on number of times and/or adjacent some cis-positions that letter occurs for different sexes occurs for different sexes, and calculate the probability that on alphabetical on each cis-position and/or adjacent some cis-positions, monogram occurs for different sexes and be specially:
According to expression formula
To calculate on each cis-position each monogram on each letter and adjacent some cis-positions respectively and correspond to the probability of the male sex.Wherein, P on the left of equation (the corresponding male sex of n-gram) for length be the probability that monogram on adjacent some cis-positions of n corresponds to the male sex, when n is 1, P (the corresponding male sex of n-gram) corresponds to the probability of the male sex for the letter on single cis-position; Monogram on adjacent some cis-positions of the corresponding male sex's frequency of n-gram on the right side of equation to be letter on single cis-position or length be n corresponds to the number of times of the male sex, and the monogram on adjacent some cis-positions of the corresponding women's frequency of n-gram to be letter on single cis-position or length be n corresponds to the number of times of women.
Preferably, using described probability as reference parameter, the user personality domain name of unknown subscriber's sex is analyzed, judges that described user's sex comprises:
Step a: the length obtaining the user personality domain name of described unknown subscriber's sex, is designated as k;
Step b: according to expression formula
The sex calculating described user is the probability of the male sex, and wherein, url represents the part that in individual character domain name, user specifies; Substr(url, j, i) represent the substring that jth position character in url starts the monogram on adjacent some cis-positions that length is i and forms, be the substring that the letter on single cis-position is formed when i is 1; N represents the number of substr (url, j, i); w hrepresent the weight of this letter or monogram; P (substr (url, j, i) concentrates the corresponding male sex in sample data) represents letter on above-mentioned substring or male sex's probability corresponding to monogram;
Step c: the result of calculation in comparison step b and described sample data concentrate the ratio of the male sex;
Steps d: when result of calculation is in stepb more than or equal to the ratio that step c calculates, judge that the sex of described unknown sex user is as the male sex.
Preferably, after described steps d, also comprise:
Step e: when result of calculation is in stepb less than the ratio that step c calculates, judge that the sex of described unknown sex user is as women.
Present invention also offers a kind of user's gender analysis device, comprising:
Sampling module, for collecting sample data set, described sample data collection comprises multipair user personality domain name and corresponding user's sex;
Reference parameter computing module, for adding up in the user personality domain name that described sample data concentrates the probability that on each cis-position, on different letters and adjacent some cis-positions, different monogram occurs according to sex;
Analysis module, for concentrating using described sample data the ratio of the male sex and described probability as with reference to parameter, analyzing the user personality domain name of unknown subscriber's sex, judging described user's sex.
Preferably, this device also comprises:
With reference to ratio computing module, concentrate the ratio of the male sex for calculating described sample data.
Preferably, described reference parameter computing module comprises:
Sex extraction unit, for getting the part that in a user personality domain name, user specifies, records user's sex that this user personality domain name is corresponding simultaneously;
Counting unit, the number of times that on the number of times that letter occurs on each cis-position to described part of specifying and/or adjacent some cis-positions, different monogram occurs counts;
Statistic unit, whole user personality domain names for concentrating described sample data carry out the process of described counting unit, until described sample data collection has traveled through, to add up on each cis-position of described user personality domain name the number of times that the monogram on number of times and/or adjacent some cis-positions that letter occurs for different sexes occurs for different sexes, and calculate the probability that on alphabetical on each cis-position and/or adjacent some cis-positions, monogram occurs for different sexes.
Preferably, described statistic unit to calculate on each cis-position the probability that on letter and/or adjacent some cis-positions, monogram occurs for different sexes and is specially:
According to expression formula
Calculate male sex's probability that on each cis-position, on each letter and adjacent some cis-positions, each monogram is corresponding respectively, wherein, P (n-gram corresponding the male sex) for length be the probability that on adjacent some cis-positions of n, a monogram corresponds to the male sex, P when n is 1 (the corresponding male sex of n-gram) corresponds to the probability of the male sex for a letter on single cis-position, the corresponding male sex's frequency of n-gram be on single cis-position a letter or length be n adjacent some cis-positions on a monogram correspond to the number of times of the male sex, the corresponding women's frequency of n-gram be on single cis-position a letter or length be n adjacent some cis-positions on a monogram correspond to the number of times of women.
Preferably, described analysis module comprises:
Domain name length acquiring unit, for obtaining the length of the user personality domain name of described unknown subscriber's sex, is designated as k;
Probability calculation unit, for according to expression formula
The sex calculating described user is the probability of the male sex, and wherein, url represents the part that in individual character domain name, user specifies, substr(url, j, i) represent that in url, jth position character starts the substring that length is the adjacent character formation of i, n represents substr (url, j, i) number, w hrepresent the weight of this letter or monogram, P (substr (url, j, i) concentrate the corresponding male sex in sample data) to represent that jth position character or jth position character in url start length be male sex's probability that letter on the substring that forms of the adjacent character of i or monogram are corresponding;
Comparing unit, for comparing the result of calculation of probability calculation unit and the ratio calculated with reference to ratio computing module;
Identifying unit, the result for comparing at described comparing unit is the result of calculation comparing probability calculation unit when being more than or equal to the ratio calculated with reference to ratio computing module, judges that the sex of described unknown sex user is as the male sex.
Preferably, described identifying unit, the result also for comparing at described comparing unit is the result of calculation comparing probability calculation unit when being less than the described ratio calculated with reference to ratio computing module, judges that the sex of described unknown sex user is as women.
The invention provides a kind of user's gender analysis method and apparatus, collecting sample data set, described sample data collection comprises multipair user personality domain name and corresponding user's sex, then the probability that on each cis-position, on different letters and adjacent some cis-positions, different monogram occurs according to sex is added up in the user personality domain name that described sample data concentrates, again using described probability as reference parameter, the user personality domain name of unknown subscriber's sex is analyzed, judge described user's sex, achieve the user's gender analysis based on automation algorithm, more flexibly with accurate, solve existing analysis mode and be not suitable for individual character domain name associates more weak occasion problem with name.
Accompanying drawing explanation
The process flow diagram of a kind of user's gender analysis method that Fig. 1 provides for embodiments of the invention one;
The structural representation of a kind of user's gender analysis device that Fig. 2 provides for embodiments of the invention two;
Fig. 3 is the structural representation of reference parameter computing module 202 in Fig. 2;
Fig. 4 is the structural representation of analysis module 203 in Fig. 2.
Embodiment
The embodiment provides a kind of user's gender analysis method and apparatus, by a kind of algorithm of robotization, avoid the dependence to Praxeology database.
Hereinafter will be described in detail to embodiments of the invention by reference to the accompanying drawings.It should be noted that, when not conflicting, the embodiment in the application and the feature in embodiment can combination in any mutually.
First by reference to the accompanying drawings, embodiments of the invention one are described.
Embodiments provide a kind of user's gender analysis method, the flow process using the method completing user gender analysis as shown in Figure 1, comprising:
Step 101, collecting sample data set, described sample data collection comprises multipair user personality domain name and corresponding user's sex;
Step 102, calculate the ratio that described sample data concentrates the male sex;
Step 103, add up described sample data and to concentrate in male sex's proportion and user personality domain name the probability that on each cis-position, on different letters and adjacent some cis-positions, different monogram occurs according to sex;
This step specifically comprises:
Step a: get the part that in a user personality domain name, user specifies, records user's sex that this user personality domain name is corresponding simultaneously;
Step b: calculate the ratio that described sample data concentrates the male sex; Step c: the number of times that on the number of times occur letter on each cis-position of described part of specifying and/or adjacent some cis-positions, different monogram occurs counts;
The method carrying out counting for the number of times of letter appearance on single character bit is as follows:
The occurrence number of the letter in described user personality domain name first is added 1, and then the occurrence number of the character string formed by described user personality domain name second adds 1, adds up last position to this user personality domain name successively.
The method carrying out counting for the number of times of monogram appearance different on adjacent some cis-positions is as follows:
First determine the length n of the character string that adjacent some cis-positions are formed, then formed character string with described user personality domain name first for the initial n of getting cis-position, the number of times that the monogram in this character string occurs is added 1; Then be that the initial n of getting cis-position is formed character string with this user personality domain name second, the number of times that the monogram in this character string occurs is added 1.The rest may be inferred, till the last position directly causing character string is last position of user personality domain name.The value of n by 2 to the length of described user personality domain name.
Step c: the whole user personality domain names concentrated described sample data carry out the process as step a to b, until described sample data collection has traveled through;
Steps d: to add up on each cis-position of described user personality domain name the number of times that the monogram on number of times and/or adjacent some cis-positions that letter occurs for different sexes occurs for different sexes, and calculate the probability that on alphabetical on each cis-position and/or adjacent some cis-positions, monogram occurs for different sexes.
In this step, according to expression formula
To calculate on each cis-position each monogram on each letter and adjacent some cis-positions respectively and correspond to the probability of the male sex.Wherein, P on the left of equation (the corresponding male sex of n-gram) for length be the probability that on adjacent some cis-positions of n, a monogram corresponds to the male sex, when n is 1, P (the corresponding male sex of n-gram) corresponds to the probability of the male sex for a letter on single cis-position; The corresponding male sex's frequency of n-gram on the right side of equation be on single cis-position a letter or length be n adjacent some cis-positions on a monogram correspond to the number of times of the male sex, the corresponding women's frequency of n-gram be on single cis-position a letter or length be n adjacent some cis-positions on a monogram correspond to the number of times of women.
Step 104, using described probability as with reference to parameter, the user personality domain name of unknown subscriber's sex is analyzed, judges described user's sex;
This step specifically comprises:
Step a: the length obtaining the user personality domain name of described unknown subscriber's sex, is designated as k;
Step b: according to expression formula
The sex calculating described user is the probability of the male sex, and wherein, url represents the part that in individual character domain name, user specifies, substr(url, j, i) represent that in url, jth position character starts the substring that length is the adjacent character formation of i, n represents substr (url, j, i) number, w hrepresent the weight of this letter or monogram, P (substr (url, j, i) concentrate the corresponding male sex in sample data) to represent that jth position character or jth position character in url start length be male sex's probability that letter on the substring that forms of the adjacent character of i or monogram are corresponding;
Step c: the ratio of the male sex that the result of calculation in comparison step b and step 102 calculate;
Steps d: when result of calculation is in stepb more than or equal to the ratio that step 102 calculates, judge that the sex of described unknown sex user is as the male sex;
Step e: when result of calculation is in stepb less than the ratio that step 102 calculates, judge that the sex of described unknown sex user is as women.
Below in conjunction with accompanying drawing, embodiments of the invention two are described.
Embodiments provide a kind of user's gender analysis device, its structure as shown in Figure 2, comprising:
Sampling module 201, for collecting sample data set, described sample data collection comprises multipair user personality domain name and corresponding user's sex;
Reference parameter computing module 202, for adding up in the user personality domain name that described sample data concentrates the probability that on each cis-position, on different letters and adjacent some cis-positions, different monogram occurs according to sex;
Analysis module 203, for concentrating using described sample data the ratio of the male sex and described probability as with reference to parameter, analyzing the user personality domain name of unknown subscriber's sex, judging described user's sex.
Preferably, this device also comprises:
With reference to ratio computing module 204, concentrate the ratio of the male sex for calculating described sample data.
Preferably, the structure of described reference parameter computing module 202 as shown in Figure 3, comprising:
Sex extraction unit 2021, for getting the part that in a user personality domain name, user specifies, records user's sex that this user personality domain name is corresponding simultaneously;
Counting unit 2022, the number of times that on the number of times that letter occurs on each cis-position to described part of specifying and/or adjacent some cis-positions, different monogram occurs counts;
Statistic unit 2023, whole user personality domain names for concentrating described sample data carry out the process of described counting unit, until described sample data collection has traveled through, to add up on each cis-position of described user personality domain name the number of times that the monogram on number of times and/or adjacent some cis-positions that letter occurs for different sexes occurs for different sexes, and calculate the probability that on alphabetical on each cis-position and/or adjacent some cis-positions, monogram occurs for different sexes.
Preferably, described statistic unit 2023 to calculate on each cis-position the probability that on letter and/or adjacent some cis-positions, monogram occurs for different sexes and is specially:
According to expression formula
Calculate male sex's probability that on each cis-position, on each letter and adjacent some cis-positions, each monogram is corresponding respectively, wherein, P (n-gram corresponding the male sex) for length be the probability that on adjacent some cis-positions of n, a monogram corresponds to the male sex, P when n is 1 (the corresponding male sex of n-gram) corresponds to the probability of the male sex for a letter on single cis-position, the corresponding male sex's frequency of n-gram be on single cis-position a letter or length be n adjacent some cis-positions on a monogram correspond to the number of times of the male sex, the corresponding women's frequency of n-gram be on single cis-position a letter or length be n adjacent some cis-positions on a monogram correspond to the number of times of women.
Preferably, the structure of described analysis module 203 as shown in Figure 4, comprising:
Domain name length acquiring unit 2031, for obtaining the length of the user personality domain name of described unknown subscriber's sex, is designated as k;
The sex calculating described user is the probability of the male sex, and wherein, url represents the part that in individual character domain name, user specifies, substr(url, j, i) represent that in url, jth position character starts the substring that length is the adjacent character formation of i, n represents substr (url, j, i) number, w hrepresent the weight of this letter or monogram, P (substr (url, j, i) concentrate the corresponding male sex in sample data) to represent that jth position character or jth position character in url start length be male sex's probability that letter on the substring that forms of the adjacent character of i or monogram are corresponding;
Comparing unit 2033, for comparing the result of calculation of probability calculation unit 2032 and the ratio calculated with reference to ratio computing module 204;
Identifying unit 2034, the result for comparing at described comparing unit 2033 is the result of calculation comparing probability calculation unit 2032 when being more than or equal to the ratio calculated with reference to ratio computing module 204, judges that the sex of described unknown sex user is as the male sex.
Preferably, described identifying unit 2034, result also for comparing at described comparing unit 2033 is the result of calculation comparing probability calculation unit when being less than the described ratio calculated with reference to ratio computing module, judges that the sex of described unknown sex user is as women.
Below in conjunction with accompanying drawing, embodiments of the invention three are described.
The embodiment of the invention discloses a kind of user's gender analysis system, the individual character domain name apply for according to user, had or use, classifies to the sex of user automatically.First the embodiment of the present invention obtains the sample data collection of the corresponding relation of individual character domain name and user's sex by means such as user data statistics, business associate, then the part that in individual character domain name, user specifies is analyzed, use the method for machine learning, train the sorter using individual character domain name to classify to user's sex.When needing to classify to the individual character domain name of unknown subscriber's sex, using this sorter, getting final product user's sex of the prediction of output.
Concrete steps are as follows.
Step one: the sample data collection gathering the corresponding relation of individual character domain name and user's sex, analyzes the part that in individual character domain name, user specifies.
Step 2: calculate described sample data and concentrate ratio shared by the male sex.
Step 3: get the part that in property domain name one by one, user specifies, be designated as character string one, records respective user sex simultaneously.
Step 4: the length of character string one is designated as k, the frequency of occurrences of all 1-gram in statistics character string one, the frequency of occurrences of 2-gram, the frequency of occurrences of 3-gram until the frequency of occurrences of k-gram (k represents the length of character string, it can be the integer of more than 1 or 1, the value upper limit of k), corresponding n-gram(n being represented the length of character string, can be 1 to k) the frequency of occurrences by respective user sex add up.
Step 5: repeat step 3, until the sample data collection gathered in step one has traveled through.
Step 6: calculate the probability of user's sex and the occurrence number of this n-gram corresponding to the n-gram that occurred of institute, the probability of statistical sample data centralization different sexes appearance simultaneously, jointly as the parameter of sorter.
Step 6: when using sorter, to the individual character domain name of unknown subscriber's sex, analyzing the part that wherein user specifies, length is designated as k, obtain its 1-gram until k-gram, is the probability of the male sex by its sex of formulae discovery below:
The sex calculating described user is the probability of the male sex, and wherein, url represents the part that in individual character domain name, user specifies, substr(url, j, i) represent that in url, jth position character starts the substring that length is i, n represents substr (url, j, i) number, w hrepresent the weight of this monogram.
Step 7: if the probability calculated in step 6 is greater than the ratio that the sample data calculated in step 2 concentrates the male sex to occur, then can be categorized as corresponding male user by this individual character domain name, on the contrary corresponding female user.
Below in conjunction with accompanying drawing, embodiments of the invention four are described.
Embodiments provide a kind of user's gender analysis method, idiographic flow is as follows:
Step one: collect three individual character domain names below: http://weibo.com/nickleave, http://weibo.com/inferpku, http://t.qq.com/bankofdota, the part that wherein user specifies is respectively nickleave, inferpku, bankofdota.In this example, the system that the present invention relates to obtains information by the means of business associate from the service provider of weibo.com and t.qq.com, learn that the user's sex corresponding to nickleave is female, the user's sex corresponding to inferpku and bankofdota is man.
Step 2: the probability calculating the male sex in all samples.In step one, we have collected three samples altogether, and sex is respectively nickleave(female), inferpku(man) and bankofdota(man).As can be seen here, in three samples, masculinity proportion accounts for 2/3.
Step 3: get nickleave, female.
1-gram, 2-gram, 3-gram, 4-gram, 5-gram, 6-gram, 7-gram, 8-gram, 9-gram that step 4: nickleave is corresponding, be added in corresponding women, statistics is as shown in table 1.In order to represent convenient, in following table, only list 1-gram, 2-gram, 3-gram tri-kinds of situations.
Table 1
Step 5: said process is repeated to inferpku.After cumulative, the result in table 1 is updated to table 2:
Table 2
1-gram The male sex Women 2-gram The male sex Women 3-gram The male sex Women
Frequency Frequency Frequency Frequency Frequency Frequency
n 1 1 ni 0 1 nic 0 1
i 1 1 ic 0 1 ick 0 1
c 0 1 ck 0 1 ckl 0 1
k 1 1 kl 0 1 kle 0 1
l 0 1 le 0 1 lea 0 1
e 1 2 ea 0 1 eav 0 1
a 0 1 av 0 1 ave 0 1
v 0 1 ve 0 1 inf 1 0
f 1 0 in 1 0 nfe 1 0
r 1 0 nf 1 0 fer 1 0
p 1 0 fe 1 0 erp 1 0
u 1 0 er 1 0 rpk 1 0
rp 1 0 pku 1 0
pk 1 0
ku 1 0
Again said process is repeated to bankofdota.After cumulative, table 2 is updated to table 3:
Table 3
Step 6: calculate the probability of user's sex and the occurrence number of this n-gram corresponding to the n-gram that occurred of institute.Wherein, to be the computing method of the probability (male sex's probability namely in following table) of the male sex be n-gram respective user sex:
Such as, the corresponding male sex's probability of the 1-gram n in the table 4 corresponding male sex's frequency 2 of 1-gram n in table and women's frequency 1 calculate, i.e. 2/ (2+1)=0.666667.
In order to represent convenient, in table 4, only list 1-gram, 2-gram, 3-gram tri-kinds of situations.
Table 4
1-gram Male sex's probability 2-gram Male sex's probability 3-gram Male sex's probability
n 0.666667 ni 0 nic 0
i 0.5 ic 0 ick 0
c 0 ck 0 ckl 0
k 0.666667 kl 0 kle 0
l 0 le 0 lea 0
e 0.333333 ea 0 eav 0
a 0.666667 av 0 ave 0
v 0 ve 0 inf 1
f 1 in 1 nfe 1
r 1 nf 1 fer 1
p 1 fe 1 erp 1
u 1 er 1 rpk 1
b 1 rp 1 pku 1
o 1 pk 1 ban 1
d 1 ku 1 ank 1
t 1 ba 1 nko 1
an 1 kof 1
nk 1 ofd 1
ko 1 fdo 1
of 1 dot 1
fd 1 ota 1
do 1
ot 1
ta 1
Step 7: suppose to need the individual character domain name of carrying out classifying to be www.renren.com/eleven, the part that wherein user specifies is eleven, the n-gram occurred in eleven comprises e(tri-times), l(once), v(once), n(once), le(once), ve(once), and the n-gram occurred in above-mentioned 3rd sex frequency meter lattice comprises e(tri-times), l(once), v(once), n(tri-times), le(once), ve(once).The n-gram total degree that user name eleven occurs in sex frequency form is 10.Calculate the weight w of letter or monogram thus h.According to above formula, bring above numerical value into, can obtain:
= ( 0.33 * 3 + 0 * 1 + 0 * 1 + 0.67 * 1 + 0 * 1 + 0 * 1 )
= ( 1.66 )
= 0.166
Step 7:
Step 8: the probability 0.166 being the male sex due to the eleven respective user sex calculated in upper step 2 is less than male sex's proportion 0.67 in sample, therefore www.renren.com/eleven can be categorized as corresponding female user.
The embodiment provides a kind of user's gender analysis method and apparatus, collecting sample data set, described sample data collection comprises multipair user personality domain name and corresponding user's sex, to add up in the user personality domain name that described sample data concentrates the probability that on each cis-position, on different letters and adjacent some cis-positions, different monogram occurs according to sex, again using described probability as reference parameter, the user personality domain name of unknown subscriber's sex is analyzed, judge described user's sex, , achieve the user's gender analysis based on automation algorithm, more flexibly with accurate, solve existing analysis mode and be not suitable for individual character domain name associates more weak occasion problem with name.
The technical scheme that the embodiment of the present invention provides, by a kind of algorithm of robotization, avoids the dependence to Praxeology database.Existing analysis mode is to the dependence of name, and it is not also suitable for individual character domain name etc. and associates more weak occasion with name, and the technical scheme that the embodiment of the present invention provides does not exist this problem.In addition, embodiments of the invention, by analysis to individual character domain name, can be used in the more wide application such as display advertisement optimization.
One of ordinary skill in the art will appreciate that all or part of step of above-described embodiment can use computer program flow process to realize, described computer program can be stored in a computer-readable recording medium, described computer program (as system, unit, device etc.) on corresponding hardware platform performs, when performing, step comprising embodiment of the method one or a combination set of.
Alternatively, all or part of step of above-described embodiment also can use integrated circuit to realize, and these steps can be made into integrated circuit modules one by one respectively, or the multiple module in them or step are made into single integrated circuit module to realize.Like this, the present invention is not restricted to any specific hardware and software combination.
Each device/functional module/functional unit in above-described embodiment can adopt general calculation element to realize, and they can concentrate on single calculation element, also can be distributed on network that multiple calculation element forms.
Each device/functional module/functional unit in above-described embodiment using the form of software function module realize and as independently production marketing or use time, can be stored in a computer read/write memory medium.The above-mentioned computer read/write memory medium mentioned can be ROM (read-only memory), disk or CD etc.
Anyly be familiar with those skilled in the art in the technical scope that the present invention discloses, change can be expected easily or replace, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain described in claim.

Claims (12)

1. user's gender analysis method, is characterized in that, comprising:
Collecting sample data set, described sample data collection comprises multipair user personality domain name and corresponding user's sex;
To add up in the user personality domain name that described sample data concentrates the probability that on each cis-position, on different letters and adjacent some cis-positions, different monogram occurs according to sex;
Concentrate the ratio of the male sex and described probability as with reference to parameter using described sample data, the user personality domain name of unknown subscriber's sex is analyzed, judges described user's sex.
2. user's gender analysis method according to claim 1, it is characterized in that, in the user personality domain name that the described sample data of described statistics is concentrated on each cis-position on different letters and adjacent some cis-positions the probability that different monogram occurs according to sex step before, also comprise:
Calculate the ratio that described sample data concentrates the male sex.
3. user's gender analysis method according to claim 1, is characterized in that, to add up in the user personality domain name that described sample data concentrates the probability that on each cis-position, on different letters and adjacent some cis-positions, monogram occurs according to sex and comprises:
Step a: get the part that in a user personality domain name, user specifies, records user's sex that this user personality domain name is corresponding simultaneously;
Step b: the number of times that on the number of times occur letter on each cis-position of described part of specifying and/or adjacent some cis-positions, different monogram occurs counts;
Step c: the whole user personality domain names concentrated described sample data carry out the process as step a to b, until described sample data collection has traveled through;
Steps d: to add up on each cis-position of described user personality domain name the number of times that the monogram on number of times and/or adjacent some cis-positions that letter occurs for different sexes occurs for different sexes, and calculate the probability that on alphabetical on each cis-position and/or adjacent some cis-positions, monogram occurs for different sexes.
4. user's gender analysis method according to claim 3, it is characterized in that, to add up on each cis-position of described user personality domain name the number of times that the monogram on number of times and/or adjacent some cis-positions that letter occurs for different sexes occurs for different sexes, and calculate the probability that on alphabetical on each cis-position and/or adjacent some cis-positions, monogram occurs for different sexes and be specially:
According to expression formula
To calculate on each cis-position each monogram on each letter and adjacent some cis-positions respectively and correspond to the probability of the male sex; Wherein, P on the left of equation (the corresponding male sex of n-gram) for length be the probability that monogram on adjacent some cis-positions of n corresponds to the male sex, when n is 1, P (the corresponding male sex of n-gram) corresponds to the probability of the male sex for the letter on single cis-position; Monogram on adjacent some cis-positions of the corresponding male sex's frequency of n-gram on the right side of equation to be letter on single cis-position or length be n corresponds to the number of times of the male sex, and the monogram on adjacent some cis-positions of the corresponding women's frequency of n-gram to be letter on single cis-position or length be n corresponds to the number of times of women.
5. user's gender analysis method according to claim 1, is characterized in that, using described probability as reference parameter, analyzes, judge that described user's sex comprises to the user personality domain name of unknown subscriber's sex:
Step a: the length obtaining the user personality domain name of described unknown subscriber's sex, is designated as k;
Step b: according to expression formula
The sex calculating described user is the probability of the male sex, and wherein, url represents the part that in individual character domain name, user specifies; Substr (url, j, i) represents the substring that jth position character in url starts the monogram on adjacent some cis-positions that length is i and forms, and is the substring that the letter on single cis-position is formed when i is 1; N represents the number of substr (url, j, i); w hrepresent the weight of this letter or monogram; P (substr (url, j, i) concentrates the corresponding male sex in sample data) represents letter on above-mentioned substring or male sex's probability corresponding to monogram;
Step c: the result of calculation in comparison step b and described sample data concentrate the ratio of the male sex;
Steps d: when result of calculation is in stepb more than or equal to the ratio that step c calculates, judge that the sex of described unknown sex user is as the male sex.
6. user's gender analysis method according to claim 5, is characterized in that, after described steps d, also comprise:
Step e: when result of calculation is in stepb less than the ratio that step c calculates, judge that the sex of described unknown sex user is as women.
7. user's gender analysis device, is characterized in that, comprising:
Sampling module, for collecting sample data set, described sample data collection comprises multipair user personality domain name and corresponding user's sex;
Reference parameter computing module, for adding up in the user personality domain name that described sample data concentrates the probability that on each cis-position, on different letters and adjacent some cis-positions, different monogram occurs according to sex;
Analysis module, for concentrating using described sample data the ratio of the male sex and described probability as with reference to parameter, analyzing the user personality domain name of unknown subscriber's sex, judging described user's sex.
8. user's gender analysis device according to claim 7, it is characterized in that, this device also comprises:
With reference to ratio computing module, concentrate the ratio of the male sex for calculating described sample data.
9. user's gender analysis device according to claim 8, is characterized in that, described reference parameter computing module comprises:
Sex extraction unit, for getting the part that in a user personality domain name, user specifies, records user's sex that this user personality domain name is corresponding simultaneously;
Counting unit, the number of times that on the number of times that letter occurs on each cis-position to described part of specifying and/or adjacent some cis-positions, different monogram occurs counts;
Statistic unit, whole user personality domain names for concentrating described sample data carry out the process of described counting unit, until described sample data collection has traveled through, to add up on each cis-position of described user personality domain name the number of times that the monogram on number of times and/or adjacent some cis-positions that letter occurs for different sexes occurs for different sexes, and calculate the probability that on alphabetical on each cis-position and/or adjacent some cis-positions, monogram occurs for different sexes.
10. user's gender analysis device according to claim 9, is characterized in that, described statistic unit calculates the probability that on alphabetical on each cis-position and/or adjacent some cis-positions, monogram occurs for different sexes and is specially:
According to expression formula
Calculate male sex's probability that on each cis-position, on each letter and adjacent some cis-positions, each monogram is corresponding respectively, wherein, P (n-gram corresponding the male sex) for length be the probability that on adjacent some cis-positions of n, a monogram corresponds to the male sex, P when n is 1 (the corresponding male sex of n-gram) corresponds to the probability of the male sex for a letter on single cis-position, the corresponding male sex's frequency of n-gram be on single cis-position a letter or length be n adjacent some cis-positions on a monogram correspond to the number of times of the male sex, the corresponding women's frequency of n-gram be on single cis-position a letter or length be n adjacent some cis-positions on a monogram correspond to the number of times of women.
11. user's gender analysis devices according to claim 8, it is characterized in that, described analysis module comprises:
Domain name length acquiring unit, for obtaining the length of the user personality domain name of described unknown subscriber's sex, is designated as k;
Probability calculation unit, for according to expression formula
The sex calculating described user is the probability of the male sex, wherein, url represents the part that in individual character domain name, user specifies, substr (url, j, i) represents that in url, jth position character starts the substring that length is the adjacent character formation of i, n represents substr (url, j, i) number, w hrepresent the weight of this letter or monogram, P (substr (url, j, i) concentrate the corresponding male sex in sample data) to represent that jth position character or jth position character in url start length be male sex's probability that letter on the substring that forms of the adjacent character of i or monogram are corresponding;
Comparing unit, for comparing the result of calculation of probability calculation unit and the ratio calculated with reference to ratio computing module;
Identifying unit, the result for comparing at described comparing unit is the result of calculation comparing probability calculation unit when being more than or equal to the ratio calculated with reference to ratio computing module, judges that the sex of described unknown sex user is as the male sex.
12. user's gender analysis devices according to claim 11, is characterized in that,
Described identifying unit, the result also for comparing at described comparing unit is the result of calculation comparing probability calculation unit when being less than the described ratio calculated with reference to ratio computing module, judges that the sex of described unknown sex user is as women.
CN201310526980.4A 2013-10-30 2013-10-30 User's gender analysis method and apparatus Active CN104598452B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310526980.4A CN104598452B (en) 2013-10-30 2013-10-30 User's gender analysis method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310526980.4A CN104598452B (en) 2013-10-30 2013-10-30 User's gender analysis method and apparatus

Publications (2)

Publication Number Publication Date
CN104598452A true CN104598452A (en) 2015-05-06
CN104598452B CN104598452B (en) 2018-09-11

Family

ID=53124251

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310526980.4A Active CN104598452B (en) 2013-10-30 2013-10-30 User's gender analysis method and apparatus

Country Status (1)

Country Link
CN (1) CN104598452B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105809557A (en) * 2016-03-15 2016-07-27 微梦创科网络科技(中国)有限公司 Method and device for mining genders of users in social network
CN106656943A (en) * 2015-11-03 2017-05-10 秒针信息技术有限公司 Network user attribute matching method and device
CN106844687A (en) * 2017-01-23 2017-06-13 炫彩互动网络科技有限公司 A kind of method and system that user's sex is determined based on games log
CN107357782A (en) * 2017-06-29 2017-11-17 深圳市金立通信设备有限公司 One kind identification user's property method for distinguishing and terminal
CN111309913A (en) * 2020-02-26 2020-06-19 北京慧博科技有限公司 Method for analyzing gender by name

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1987852A (en) * 2005-12-21 2007-06-27 腾讯科技(深圳)有限公司 Method and device for determining communication object attribute according to news content
US20090043721A1 (en) * 2007-08-10 2009-02-12 Microsoft Corporation Domain name geometrical classification using character-based n-grams
US20090043720A1 (en) * 2007-08-10 2009-02-12 Microsoft Corporation Domain name statistical classification using character-based n-grams
CN103164470A (en) * 2011-12-15 2013-06-19 盛大计算机(上海)有限公司 Directional application method based on user gender distinguished results and system thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1987852A (en) * 2005-12-21 2007-06-27 腾讯科技(深圳)有限公司 Method and device for determining communication object attribute according to news content
US20090043721A1 (en) * 2007-08-10 2009-02-12 Microsoft Corporation Domain name geometrical classification using character-based n-grams
US20090043720A1 (en) * 2007-08-10 2009-02-12 Microsoft Corporation Domain name statistical classification using character-based n-grams
CN103164470A (en) * 2011-12-15 2013-06-19 盛大计算机(上海)有限公司 Directional application method based on user gender distinguished results and system thereof

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106656943A (en) * 2015-11-03 2017-05-10 秒针信息技术有限公司 Network user attribute matching method and device
CN106656943B (en) * 2015-11-03 2019-09-17 秒针信息技术有限公司 A kind of matching process and device of network user's attribute
CN105809557A (en) * 2016-03-15 2016-07-27 微梦创科网络科技(中国)有限公司 Method and device for mining genders of users in social network
CN106844687A (en) * 2017-01-23 2017-06-13 炫彩互动网络科技有限公司 A kind of method and system that user's sex is determined based on games log
CN107357782A (en) * 2017-06-29 2017-11-17 深圳市金立通信设备有限公司 One kind identification user's property method for distinguishing and terminal
CN107357782B (en) * 2017-06-29 2020-12-18 深圳市金立通信设备有限公司 Method and terminal for identifying gender of user
CN111309913A (en) * 2020-02-26 2020-06-19 北京慧博科技有限公司 Method for analyzing gender by name

Also Published As

Publication number Publication date
CN104598452B (en) 2018-09-11

Similar Documents

Publication Publication Date Title
CN103295145B (en) Mobile phone advertising method based on user consumption feature vector
US8856229B2 (en) System and method for social networking
CN102866990B (en) A kind of theme dialogue method and device
US20130297694A1 (en) Systems and methods for interactive presentation and analysis of social media content collection over social networks
US20130297581A1 (en) Systems and methods for customized filtering and analysis of social media content collected over social networks
CN104008203B (en) A kind of Users' Interests Mining method for incorporating body situation
CN106997549A (en) The method for pushing and system of a kind of advertising message
CN105045916A (en) Mobile game recommendation system and recommendation method thereof
CN105718579A (en) Information push method based on internet-surfing log mining and user activity recognition
CN104598452A (en) Method and device for analyzing user gender
CN105930469A (en) Hadoop-based individualized tourism recommendation system and method
CN105426514A (en) Personalized mobile APP recommendation method
CN105224681B (en) Customer requirement retrieval method and system based on family's place of working context environmental
JP5754854B2 (en) Contributor analysis apparatus, program and method for analyzing poster profile information
CN109460512A (en) Recommendation information processing method, device, equipment and storage medium
JP6719399B2 (en) Analysis device, analysis method, and program
CN106650760A (en) Method and device for recognizing user behavioral object based on flow analysis
US20130117358A1 (en) Method of identifying remote users of websites
CN104077417A (en) Figure tag recommendation method and system in social network
CN105975479A (en) Tag library-based telecom user interest degree analysis method and system
CN109858965A (en) A kind of user identification method and system
CN107992500A (en) A kind of information processing method and server
CN106326338A (en) Service providing method and device based on search engine
CN102567392A (en) Control method for interest subject excavation based on time window
CN109409940A (en) Browse processing method, device, equipment and storage medium based on path

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20150506

Assignee: Beijing Interactive Technology Co., Ltd.

Assignor: Beijing Sibotu Information Technology Co., Ltd.

Contract record no.: 2015110000019

Denomination of invention: Method and device for analyzing user gender

License type: Exclusive License

Record date: 20150603

LICC Enforcement, change and cancellation of record of contracts on the licence for exploitation of a patent or utility model
EC01 Cancellation of recordation of patent licensing contract

Assignee: Beijing Interactive Technology Co., Ltd.

Assignor: The second hand information technology Co. Ltd.

Contract record no.: 2015110000019

Date of cancellation: 20160426

EM01 Change of recordation of patent licensing contract

Change date: 20160426

Contract record no.: 2015110000019

Assignor after: The second hand information technology Co. Ltd.

Assignor before: Beijing Sibotu Information Technology Co., Ltd.

LICC Enforcement, change and cancellation of record of contracts on the licence for exploitation of a patent or utility model
CB02 Change of applicant information

Address after: 100102 Beijing, Chaoyang District Fu Tong East Street, building 1, room 5, room 321008

Applicant after: The second hand information technology Co. Ltd.

Address before: Beijing City, a small town east of Changping District road 102218 in No. 398 Coal Construction Group No. 1 building, 4 floor second hand system

Applicant before: Beijing Sibotu Information Technology Co., Ltd.

COR Change of bibliographic data
GR01 Patent grant
GR01 Patent grant