CN102999496A - Method for building requirement analysis formwork and method and device for searching requirement recognition - Google Patents

Method for building requirement analysis formwork and method and device for searching requirement recognition Download PDF

Info

Publication number
CN102999496A
CN102999496A CN2011102667995A CN201110266799A CN102999496A CN 102999496 A CN102999496 A CN 102999496A CN 2011102667995 A CN2011102667995 A CN 2011102667995A CN 201110266799 A CN201110266799 A CN 201110266799A CN 102999496 A CN102999496 A CN 102999496A
Authority
CN
China
Prior art keywords
query
demand
template
seed
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011102667995A
Other languages
Chinese (zh)
Inventor
黄际洲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN2011102667995A priority Critical patent/CN102999496A/en
Publication of CN102999496A publication Critical patent/CN102999496A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a method for building requirement analysis formworks and a method and device for searching requirement recognition. The method for building the requirement analysis formwork includes acquiring a seed query set of requirement type, determining all n-ary phrase (n-gram) with n to be one or a plurality of preset positive integers, replacing N1 n-gram with lowest occurrence times in each seed query of the seed query set with a wildcard character according to the occurrence times of the n-gram in the seed set of the requirement type and obtained by statistics to obtain a candidate requirement analysis formwork with N1 being the preset positive integer, conducting confidence coefficient grading on the candidate requirement analysis formwork, selecting candidate requirement analysis formworks with confidence coefficient ranging at first N2 to be the requirement analysis formworks of the requirement type with N2 being a preset positive integer. The method is capable of saving manpower and cost, expanding suitable face and improving recall rate and recognition accuracy.

Description

Set up the method for demand analysis template, method and the device of search need identification
[technical field]
The present invention relates to field of computer technology, particularly a kind of method of demand analysis template, method and apparatus of search need identification set up.
[background technology]
Along with internet developing rapidly and maturation in the world, the information resources on the network are enriched constantly, and information data amount has become the major way of modern's obtaining information also in expansion at full speed by the search engine obtaining information.For provide more convenient to the user, inquiry service is that search engine technique is in the current and following developing direction exactly.
In search engine technique, it is an important ring that improves searching accuracy and validity that user's search need is identified, and effect is remarkable in structuring search (being vertical search) especially.Find that by analysis the user is when using query statement search need, expression way embodies certain regularity usually.Such as the user when the inquiry novel class demand, may input following query: " novel that leading man is very handsome ", " leading man is campy novel ", " leading man is master of martial arts's novel " etc., these query follow a kind of specific expression pattern, that is: the novel of leading man [W+], wherein [W+] is asterisk wildcard.If with a kind of as novel class demand of this expression pattern, then can be easy to identify the query such as " novel of leading man failure in love ", " leading man is the novel of man very " is novel class demand.Therefore, just derived the demand recognition method based on the demand analysis template, and existing demand analysis template is to set up by the characteristic of observing query commonly used is artificial, there is following defective in this mode:
The negligible amounts of defective one, demand analysis template, the template that manually sums up generally all are that applicable surface is narrower about hundreds of.
Defective two, manually participation, the labor intensive cost.
Defective three, recall rate are lower.Usually may there be difference in the template that manually sums up with the query of the final input of user in form, thisly not quite identically just causes a lot of query None-identifieds to go out the demand type.
Defective four, recognition accuracy are low.The template of manually summing up, writing, accuracy rate is more difficult to get comprehensive check and assurance, observe query " desktop background ", " the windows desktop background " of picture demand such as the people, just write out template " [W+] background ", be used for identifying when the query of picture demand is arranged, the mistake of being brought by this template will be a lot, identified mistakenly the picture demand such as the query that will not have picture background " the prosperous background of medicine man ", " father's background " etc.
[summary of the invention]
The invention provides the method and apparatus of a kind of method of setting up the demand analysis template, search need identification, so that save human cost, enlarge applicable surface and improve recognition accuracy.
Concrete technical scheme is as follows:
A kind of method of setting up the demand analysis template, carry out following steps for default demand type respectively:
S1, the seed query set of obtaining described demand type;
S2, determine all phrase n-gram of n unit of described seed query set, described n is default one or more positive integers;
S3, the occurrence number of each n-gram in the seed set of described demand type that obtains according to statistics, N1 minimum n-gram of occurrence number among each seed query of described seed query set replaced with asterisk wildcard, obtain candidate's demand analysis template, described N1 is default positive integer;
S4, each candidate's demand analysis template is carried out degree of confidence scoring, select the degree of confidence scoring to come candidate's demand analysis template of front N2 as the demand analysis template of described demand type, described N2 is default positive integer.
According to one preferred embodiment of the present invention, described step S1 specifically comprises following mode:
Mode 1, from the search daily record of described demand type vertical search, obtain the query that searching times is higher than preset first threshold value, consist of the seed query set of described demand type; Perhaps,
Mode 2, from the search daily record of the Webpage search of described demand type, obtain corresponding to the website of having clicked described demand type or clicked the query of the title that comprises described demand type Feature Words, and the query that searching times among the query that obtains is higher than default Second Threshold is consisted of the seed query set of described demand type; Perhaps,
The query that mode 3, the query that described mode 1 is obtained and described mode 2 are obtained gets the seed query set of occuring simultaneously and obtaining described demand type.
According to one preferred embodiment of the present invention, the number of clicks of utilizing the query obtain in described mode 1 and the ratio of searching times obtain the weights of corresponding query; Perhaps,
The number of clicks of utilizing the query obtain in described mode 2 and the ratio of searching times obtain the weights of corresponding query; Perhaps,
In described mode 3, pass through formula query _ mi = MIN ( click _ 2 , click _ 1 ) 2 search _ 2 * search _ 1 Obtain the weights of each query in the seed query set, wherein query_mi is the weights of query, click_1 is the number of clicks of this query in described mode 1, click_2 is the number of clicks of this query in described mode 2, search_1 is the searching times of this query in mode 1, and search_2 is the searching times of this query in mode 2.
According to one preferred embodiment of the present invention, in described step S3, set in advance the n-gram granularity that replaces with asterisk wildcard, carry out the step that n-gram that the n-gram that N1 occurrence number among each seed query of described seed query set is minimum or occurrence number be lower than the preset times threshold value replaces with asterisk wildcard according to described granularity.
According to one preferred embodiment of the present invention, in described step S3, also comprise before the step of the described replacement of execution: the n-gram of named entity among each query of described seed query set is replaced with the named entity type mark.
According to one preferred embodiment of the present invention, described in the step S4 each candidate's demand analysis template being carried out the degree of confidence scoring specifically comprises:
The characteristic ginseng value of candidate's demand template is weighted the degree of confidence scoring that obtains this candidate's demand template after the summation, and wherein said characteristic parameter comprises following listed at least a:
Obtain all seed query of this candidate's demand template the weight average value, according to this candidate's demand template comprise fixing word number scoring, whether comprise the scoring of named entity type mark and comprise the scoring that is replaced the n-gram number according to this candidate's demand template according to this candidate's demand template.
A kind of method of search need identification, the method comprises:
After receiving query to be identified, described query to be identified is mated with the demand analysis template of each demand type respectively, determine that demand type corresponding to demand analysis template that the match is successful is the demand type of described query to be identified;
The demand analysis template of wherein said each demand type is to set up by the above-mentioned method of setting up the demand analysis template.
A kind of device of setting up the demand analysis template, this device comprises:
The seed acquiring unit, the seed query set that is used for obtaining default demand type;
The phrase determining unit is used for all n-gram that determine that described seed query gathers;
The candidate template determining unit, the occurrence number that each n-gram that is used for obtaining according to statistics gathers at the seed of described demand type, N1 minimum n-gram of occurrence number among each seed query of described seed query set replaced with asterisk wildcard, obtain candidate's demand analysis template, described N1 is default positive integer;
Template selection unit is used for each candidate's demand analysis template is carried out the degree of confidence scoring, selects the degree of confidence scoring to come the individual candidate's demand analysis template of front N2 as the demand analysis template of described demand type, and described N2 is default positive integer.
According to one preferred embodiment of the present invention, described seed acquiring unit obtains seed query set in the following ways:
Mode 1, from the search daily record of described demand type vertical search, obtain the query that searching times is higher than preset first threshold value, consist of the seed query set of described demand type; Perhaps,
Mode 2, from the search daily record of the Webpage search of described demand type, obtain corresponding to the website of having clicked described demand type or clicked the query of the title that comprises described demand type Feature Words, and the query that searching times among the query that obtains is higher than default Second Threshold is consisted of the seed query set of described demand type; Perhaps,
The query that mode 3, the query that described mode 1 is obtained and described mode 2 are obtained gets the seed query set of occuring simultaneously and obtaining described demand type.
According to one preferred embodiment of the present invention, this device also comprises: the weights determining unit, for the number of clicks of the query that obtains in described mode 1 utilization and the weights that the ratio of searching times obtains corresponding query; Perhaps,
The number of clicks of utilizing the query obtain in described mode 2 and the ratio of searching times obtain the weights of corresponding query; Perhaps,
In described mode 3, pass through formula query _ mi = MIN ( click _ 2 , click _ 1 ) 2 search _ 2 * search _ 1 Obtain the weights of each query in the seed query set, wherein query_mi is the weights of query, click_1 is the number of clicks of this query in described mode 1, click_2 is the number of clicks of this query in described mode 2, search_1 is the searching times of this query in mode 1, and search_2 is the searching times of this query in mode 2.
According to one preferred embodiment of the present invention, set in advance the n-gram granularity that replaces with asterisk wildcard, described candidate template determining unit is carried out the operation that n-gram that the n-gram that N1 occurrence number among each seed query of described seed query set is minimum or occurrence number be lower than the preset times threshold value replaces with asterisk wildcard according to described granularity.
According to one preferred embodiment of the present invention, described candidate template determining unit is before the operation of carrying out described replacement, and further the n-gram with named entity among each query of described seed query set replaces with the named entity type mark.
According to one preferred embodiment of the present invention, described template selection unit is weighted the degree of confidence scoring that obtains this candidate's demand template after the summation to the characteristic ginseng value of candidate's demand template, and wherein said characteristic parameter comprises following listed at least a:
Obtain all seed query of this candidate's demand template the weight average value, according to this candidate's demand template comprise fixing word number scoring, whether comprise the scoring of named entity type mark and comprise the scoring that is replaced the n-gram number according to this candidate's demand template according to this candidate's demand template.
A kind of device of search need identification, this device comprises:
The acquisition request unit is used for receiving query to be identified;
The template matches unit is used for described query to be identified is mated with the demand analysis template of each demand type respectively, determines that demand type corresponding to demand analysis template that the match is successful is the demand type of described query to be identified;
The demand analysis template of wherein said each demand type is to be set up by the above-mentioned device of setting up the demand analysis template.
As can be seen from the above technical solutions, n-gram obtains and determine candidate's demand analysis template based on the asterisk wildcard replacement of occurrence number by the seed query set of demand type is carried out in the present invention, and according to the degree of confidence of candidate's demand analysis template the final demand analysis template of this demand type is therefrom selected in scoring, realize in this way the automatic mining of demand analysis template, greatly saved human cost.And can realize that by the quantity that increases seed query in the seed query set massive demand analyzes the foundation of template, enlarge applicable surface.Because the excavation of this demand analysis template is based on seed query, the demand analysis template of therefore excavating is basically identical in form with query commonly used, has improved recall rate and recognition accuracy.
[description of drawings]
The method for building up process flow diagram of the demand analysis template that Fig. 1 provides for the embodiment of the invention one;
The structure drawing of device of setting up the demand analysis template that Fig. 2 provides for the embodiment of the invention two;
The structure drawing of device that Fig. 3 identifies for the search need that the embodiment of the invention three provides;
The search need identification that Fig. 4 provides for the embodiment of the invention is used for the instance graph of large search ordering;
The search need identification that Fig. 5 provides for the embodiment of the invention is used for the instance graph of vertical search.
[embodiment]
In order to make the purpose, technical solutions and advantages of the present invention clearer, describe the present invention below in conjunction with the drawings and specific embodiments.
Can at first excavate in embodiments of the present invention the demand analysis template of each demand type, when the demand identification of carrying out query, utilize the demand analysis template of excavating in advance that query to be identified is carried out demand identification.
Process below by the demand analysis template of each demand type of a pair of excavation of embodiment is described in detail.
Embodiment one,
The method for building up process flow diagram of the demand analysis template that Fig. 1 provides for the embodiment of the invention one as shown in Figure 1, may further comprise the steps for the process of setting up of the demand analysis template of one of them demand type:
Step 101: the seed query set of obtaining this demand type.
At first need to preset the seed query set of each demand type, these seeds query set can be disposed by artificial mode, but because the excavation of template needs the huge seed query of quantity usually, therefore adopt the high cost of artificial mark and audit, preferably, adopt in embodiments of the present invention the mode of the seed set of automatic acquiring demand type, specifically can include but not limited to following several:
First kind of way: from the search daily record of this demand type vertical search, obtain the query that searching times is higher than preset first threshold value, consist of the seed query set of this demand type.
Wherein, can utilize the number of clicks of the seed query that this kind mode obtains and the ratio of searching times to obtain this seed query at the weights of this demand type, these weights will be for follow-up template ordering.Can be directly with the ratio of number of clicks and searching times as the weights of this seed query in this demand type, also the ratio of number of clicks and searching times can be asked square as the weights of this seed query in this demand type, etc.
For example, can vertically search in the plain search daily record at picture and obtain the query that searching times is higher than the setting search frequency threshold value, and obtaining searching times and the number of clicks of each query, the seed query that for example obtains and searching times thereof and number of clicks are as shown in table 1.
Table 1
Seed query Searching times Number of clicks Weights
The wife of Liu Dehua 10234 6590 0.6439
Open schoolmate's wife 6842 4985 0.7286
The wife of Liu Ye 9527 6672 0.7003
The daughter of Xu Fan 20192 18652 0.9237
The up-to-date art picture complete or collected works of model ice ice 23022 12244 0.5318
Zhang Ziyi's art picture complete or collected works 12021 7026 0.5845
The up-to-date art picture complete or collected works of poplar power 30801 9152 0.2971
The second way: from the search daily record of the Webpage search of this demand type, obtain corresponding to the website of having clicked this demand type or clicked the query of the title that comprises this demand type Feature Words, and the query that searching times among the query that obtains is higher than default Second Threshold is consisted of the seed query set of this demand type.
Equally, can utilize the number of clicks of the seed query that this mode obtains and the ratio of searching times to obtain this seed query at the weights of this demand type, these weights will be for follow-up template ordering.Particularly, can be directly with the ratio of number of clicks and searching times as the weights of this seed query in this demand type, also the ratio of number of clicks and searching times can be asked square as the weights of this seed query in this demand type, etc.
For example, can be in the search daily record of common Webpage search, obtain the query that correspondence has been clicked the picture category website or clicked the web page title that comprises the picture category Feature Words, wherein, the picture category Feature Words can such as " picture group ", " the large figure of high definition ", " picture " etc., then therefrom select searching times greater than the query of default Second Threshold as seed query, and obtain searching times and the number of clicks of each seed query.The seed query that for example obtains and searching times thereof and number of clicks are as shown in table 2.
Table 2
Seed query Searching times Number of clicks Weights
The wife of Liu Dehua 75343 60873 0.8079
Open schoolmate's wife 76932 52834 0.6878
The wife of Liu Ye 64859 48956 0.7548
The up-to-date art picture complete or collected works of model ice ice 62534 44526 0.7120
Zhang Ziyi's art picture complete or collected works 76242 60109 0.7884
The up-to-date art picture complete or collected works of poplar power 92847 49628 0.5345
The third mode: obtain respectively by above-mentioned dual mode that getting behind the query occurs simultaneously obtains seed query set.
The weights of the seed query that this mode is obtained can adopt following formula to calculate:
query _ mi = MIN ( click _ 2 , click _ 1 ) 2 search _ 2 * search _ 1
Wherein, query_mi is the weights of seed query, click_1 is the number of clicks of this seed query in first kind of way, click_2 is the number of clicks of this seed query in the second way, search_1 is the searching times of this seed query in first kind of way, and search_2 is the searching times of this seed query in the second way.
For example, the seed query that obtains and searching times thereof and number of clicks are as shown in table 3.
Table 3
Seed query search_2 click_2 search_1 click_1 query_mi
The wife of Liu Dehua 75343 60873 10234 6590 0.0563
Open schoolmate's wife 76932 52834 6842 4985 0.0472
The wife of Liu Ye 64859 48956 9527 6672 0.0720
The up-to-date art picture complete or collected works of model ice ice 62534 44526 23022 12244 0.1041
Zhang Ziyi's art picture complete or collected works 76242 60109 12021 7026 0.0539
The up-to-date art picture complete or collected works of poplar power 92847 49628 30801 9152 0.0293
Step 102: all n unit phrases (n-gram) of determining seed query set.
So-called n-gram is exactly n the combination that word occurs in order of minimum particle size, and wherein n is default one or more positive integers.For example, for query " the up-to-date art picture complete or collected works of model ice ice ", be 1,2,3 and 4 if n is set, the n-gram that then obtains is:
1-gram: Fan Bingbing, up-to-date, artistic, picture, complete or collected works
2-gram: up-to-date, the up-to-date art of model ice ice, art picture, picture complete or collected works
3-gram: the up-to-date art of model ice ice, up-to-date art picture, art picture complete or collected works
4-gram: the up-to-date art picture of model ice ice, up-to-date art picture complete or collected works
Step 103: the occurrence number of each n-gram that obtains according to statistics in the seed query of this demand type set, the n-gram that the n-gram that N1 occurrence number among each seed query of seed query set is minimum or occurrence number are lower than the preset times threshold value replaces with asterisk wildcard, obtain candidate's demand analysis template, N1 is default positive integer.
In this step, can set in advance the granularity of the n-gram that replaces with asterisk wildcard, namely the value of n also needs to set in advance and replace at most what n-gram, the i.e. value of N1 in a query.
Take query " the up-to-date art picture complete or collected works of model ice ice " as example, the granularity of supposing to replace n-gram and be asterisk wildcard is 1-gram, and at most only replace 1 1-gram, because in resulting all n-gram of this query, the occurrence number of 1-gram " Fan Bingbing " is minimum, then this 1-gram is replaced to the candidate's demand analysis template that obtains behind the asterisk wildcard to be exactly: [W+] up-to-date art picture complete or collected works.
If replacement n-gram is that the granularity of asterisk wildcard is 2-gram, and can only replace at most a 2-gram, because 2-gram " model ice ice is up-to-date " word frequency is minimum, then this 2-gram is replaced to the candidate's demand analysis template that obtains behind the asterisk wildcard and be exactly: [W+] art picture complete or collected works.
At most only to replace a 1-gram and 2-gram as example, candidate's demand analysis template that each seed query can generate in the table 3 is referring to table 4.
Table 4
Figure BDA0000090226640000101
By setting maximum interchangeable n-gram numbers, and the n-gram granularity of replacing, we can obtain dissimilar candidate's demand analysis templates.In some demand analysis template, the word that comprises except asterisk wildcard is a lot, and these words itself can express certain semantic, and this class template just is the little template of escape risk so.Such as " [W+] art picture complete or collected works ", the query that this demand analysis template can match is less, but accuracy rate is very high.And in some demand analysis template, the word that comprises except asterisk wildcard is less, although some certain semantic can be expressed in these words itself, still ambiguity may occur, and this class demand analysis template is the template that escape has a big risk so.Such as " [W+] wife ", the query that can match is more, but may there be some problems in accuracy rate.Such as matching " I like me wife " this query that does not have the picture demand.
The accuracy rate problem of bringing for solving template that above-mentioned escape has a big risk, we can before generating candidate template, carry out first some pre-service to query.Specific practice be exactly in each query of seed query set occurrence number be lower than before the n-gram of predetermined threshold value or n-gram that occurrence number is lower than the preset times threshold value replace with asterisk wildcard, also comprise: the n-gram of named entity among each query of seed query set is replaced with the named entity type mark.For example the name among the seed query is replaced with mark " [name] ", the place name among the seed query is all replaced with mark " [place name] " etc., and then generate candidate's demand analysis template with said method.The candidate's demand analysis template that draws like this is as shown in table 5.
Table 5
Seed query Pretreated seed query Candidate's demand analysis template
The wife of Liu Dehua The wife of [name] [name] [W+] wife
Open schoolmate's wife The wife of [name] [name] [W+] wife
The wife of Liu Ye The wife of [name] [name] [W+] wife
The up-to-date art picture complete or collected works of model ice ice [name] up-to-date art picture complete or collected works [name] [W+] art picture complete or collected works
Zhang Ziyi's art picture complete or collected works [name] art picture complete or collected works [name] art picture complete or collected works
The up-to-date art picture complete or collected works of poplar power [name] up-to-date art picture complete or collected works [name] [W+] art picture complete or collected works
After above-mentioned pre-service, the candidate's demand analysis template that obtains again owing to further there is the named entity type mark to do semantically restriction, has reduced the risk that semantic escape occurs candidate's demand analysis template.And having asterisk wildcard to do assurance, the demand analysis template is just recalled can take into account how different expression waies.
Such as, for " [name] [W+] wife " such template, just can mate query " first wife of Tan Yong unicorn ", thereby promoted the recall rate of the query of identification picture category demand, and also can be protected on the accuracy rate, for example for " my barbarous wife " this query that does not have a picture category demand then can be not with this template matches on.
Step 104: each candidate's demand analysis template is carried out the degree of confidence scoring, select the degree of confidence scoring to come the individual candidate's demand analysis template of front N2 as the demand analysis template of described certain demand type, wherein N2 is default positive integer.
The candidate's demand analysis template that obtains might be some wrong templates, if these candidate's demand analysis templates are all adopted and will inevitably bring certain impact to the accuracy rate of demand identification.At this, can select final demand analysis template by the mode of candidate's demand analysis template being carried out the degree of confidence scoring.
When candidate's demand analysis template being carried out the degree of confidence scoring, can adopt the mode that an above characteristic ginseng value is weighted summation, include but not limited to the following characteristics parameter:
1) the mutual information value (mi_avg) of candidate's demand analysis template, this mi_avg is the weight average value that obtains the seed query of this candidate's demand analysis template, that is:
mi _ avg = Σ k = 0 M query _ mi k M
Wherein, query_mi kFor obtaining k seed query of this candidate's demand analysis template, M is the number that obtains the seed query of this candidate's demand analysis template.
2) comprise the fixedly scoring of the number of word (term_score) according to this candidate's demand template, in candidate's demand analysis template, except asterisk wildcard and named entity type mark, other words are referred to as fixedly word.Fixedly word is more, and ability is stronger on the demand of distinguishing a query.The scoring of this term_score determines that by the fixedly word number that candidate's demand analysis template comprises the fixedly word number that comprises is more, and the term_score value is larger.For example, if comprise a fixedly word, then term_score is 0.02, if comprise two fixedly words, then term_score is 0.04, if comprise three fixedly words, then term_score is 0.06, etc.
3) whether comprise the scoring (ne_score) of named entity type mark according to this candidate's demand template.If comprise the named entity type mark, then the demand recognition capability of this candidate's demand analysis template is stronger, and accuracy rate is higher.For example, when comprising the named entity type mark, ne_score is 0.1, and when not comprising the named entity type mark, ne_score is 0.
4) comprise the scoring (ngram_sub) that is replaced the n-gram number according to candidate's demand template.The number that is replaced n-gram is fewer, and the risk of semantic generation escape is less, and therefore, it is higher to mark.For example, be replaced if candidate's demand template is 1 n-gram among the query, then scoring is that 0.09,2 n-gram is replaced, and then scoring is that 0.08,3 n-gram is replaced, and then scoring is 0.07 etc.If the corresponding a plurality of query of candidate's demand template possibility, situation about then being replaced with maximum n-gram is marked.
If adopt above four kinds of parameter values, then the degree of confidence of candidate's demand analysis template scoring (score) computing formula can for:
score=λ 1mi_avg+λ 2term_score+λ 3ne_score+λ 4ngram_sub
Wherein, λ 1, λ 2, λ 3And λ 4Be the weight coefficient of presetting, can adopt empirical value, also can adopt the mode of machine learning to obtain, for example, λ 1, λ 2, λ 3And λ 4Can get respectively 0.5,0.15,0.2 and 0.15.
For example, the final score of the candidate of each shown in the table 5 demand analysis template can be as shown in table 6.
Table 6
Figure BDA0000090226640000131
After adopting aforesaid way to calculate the degree of confidence scoring of each candidate's demand analysis template, according to appraisal result candidate's demand analysis template is sorted, select the highest N of scoring as the final demand analysis template of this demand type.
After excavating the demand analysis template of each demand type by the mode of above-described embodiment one, if receive query to be identified, then should query to be identified and each demand analysis template mate, determine that demand type corresponding to demand analysis template that the match is successful is the demand type of this query to be identified.
Suppose that query to be identified is: soup is the picture complete or collected works only, then should query to be identified and after each demand analysis template mates, the demand analysis template that matches is: [W+] picture complete or collected works, the corresponding picture category demand of this demand analysis template, therefore, can determine that this query to be identified is the picture category demand.
More than be the description that method provided by the present invention is carried out, be described below by two pairs of devices of setting up the demand analysis template provided by the present invention of embodiment.
Embodiment two,
The structure drawing of device of setting up the demand analysis template that Fig. 2 provides for the embodiment of the invention two, as shown in Figure 2, this device can comprise: seed acquiring unit 201, phrase determining unit 202, candidate template determining unit 203 and template selection unit 204.
Seed acquiring unit 201 obtains the seed query set of default demand type.This seed query set can be disposed by artificial mode, but because the excavation of template needs the huge seed query of quantity usually, therefore adopt the high cost of artificial mark and audit, preferably, adopt in embodiments of the present invention the mode of the seed set of automatic acquiring demand type, specifically can include but not limited to following several:
Mode 1, from the search daily record of demand type vertical search, obtain the query that searching times is higher than preset first threshold value, consist of the seed query set of demand type; Perhaps,
Mode 2, from the search daily record of the Webpage search of demand type, obtain corresponding to the website of having clicked the demand type or clicked the query of the title that comprises demand type Feature Words, and the query that searching times among the query that obtains is higher than default Second Threshold is consisted of the seed query set of demand type; Perhaps,
The query that mode 3, the query that mode 1 is obtained and mode 2 are obtained gets the seed query set of occuring simultaneously and obtaining the demand type.
Phrase determining unit 202 is determined all n-gram of seed query set.So-called n-gram is exactly n the combination that word occurs in order of minimum particle size, and wherein n is default one or more positive integers.
The occurrence number of each n-gram in the seed set of demand type that candidate template determining unit 203 obtains according to statistics, N1 minimum n-gram of occurrence number among each seed query of seed query set replaced with asterisk wildcard, obtain candidate's demand analysis template, N1 is default positive integer.
At this, can set in advance the n-gram granularity that replaces with asterisk wildcard, candidate template determining unit 203 is carried out the operation that n-gram that the n-gram that N1 occurrence number among each seed query of seed query set is minimum or occurrence number be lower than the preset times threshold value replaces with asterisk wildcard according to granularity.
204 pairs of each candidate's demand analysis templates of template selection unit are carried out the degree of confidence scoring, select the degree of confidence scoring to come the individual candidate's demand analysis template of front N2 as the demand analysis template of demand type, and N2 is default positive integer.
Further, this device can also comprise: weights determining unit 205, be used for determining that seed query gathers the weights of each query, and when adopting the obtain manner of different seed query set, corresponding different Weightings, specific as follows:
The number of clicks of utilizing the query obtain in mode 1 and the ratio of searching times obtain the weights of corresponding query.
The number of clicks of utilizing the query obtain in mode 2 and the ratio of searching times obtain the weights of corresponding query.
In mode 3, pass through formula query _ mi = MIN ( click _ 2 , click _ 1 ) 2 search _ 2 * search _ 1 Obtain the weights of each query in the seed query set, wherein query_mi is the weights of query, click_1 is the number of clicks of this query in mode 1, click_2 is the number of clicks of this query in mode 2, search_1 is the searching times of this query in mode 1, and search_2 is the searching times of this query in mode 2.
The weights of each query can be used for the confidence calculations of follow-up candidate's demand analysis template in the above-mentioned seed query set.
For the accuracy of further raising demand identification, candidate template determining unit 203 is before the operation of carrying out described replacement, and the n-gram of named entity replaces with the named entity type mark among each query that can further seed be gathered.The candidate's demand analysis template that obtains is like this done semantically further restriction by the named entity type, can reduce candidate's demand analysis template the semantic risk that shifts occurs.
Carrying out degree of confidence when scoring, the characteristic ginseng value of 204 pairs of candidate's demands of template selection unit template is weighted the degree of confidence scoring that obtains this candidate's demand template after the summation, and wherein characteristic parameter comprises following listed at least a:
Obtain all seed query of this candidate's demand template the weight average value, according to this candidate's demand template comprise fixing word number scoring, whether comprise the scoring of named entity type mark and comprise the scoring that is replaced the n-gram number according to this candidate's demand template according to this candidate's demand template.Specifically referring to the description among the embodiment one.
Embodiment three,
The structure drawing of device that Fig. 3 identifies for the search need that the embodiment of the invention three provides, this device is in the search need identification of the device basic realization query that sets up the demand analysis template shown in the embodiment two, and as shown in Figure 3, this device can comprise:
Acquisition request unit 301 is used for receiving query to be identified.
Template matches unit 302 is used for query to be identified is mated with the demand analysis template of each demand type respectively, determines that demand type corresponding to demand analysis template that the match is successful is the demand type of query to be identified.
Wherein the demand analysis template of each demand type is to be set up by the device of setting up the demand analysis template shown in the embodiment two.
After the said method that adopts the embodiment of the invention to provide or device identify the demand type, can be used for but be not limited to following application scenarios:
1) is used for the ordering of large search.After the user inputted query, the said method by the embodiment of the invention and device can identify the demand type of this query, with in the Search Results of large search to the page-ranking of demand type that should query in advance.
For example, when the user inputs query " home cooking high definition ", can in large search, identify this query and have the video class demand, the associated video information that in for the results page of this large search, can have " home cooking " this TV play, obtaining of this partial video information can be that the video vertical search provides and inserts in the Search Results of large search, like this in the Search Results of large search, the page of this video class can be come the front of Search Results, as shown in Figure 4, so that user's satisfaction and search experience all will be greatly improved.
2) be used for vertical search.After the user inputs query, said method by the embodiment of the invention and device can identify the demand type of this query, this query is distributed to optimum content resource or application provider's processing, the final accurate result that the user is complementary that returns to efficiently.
For example, and when user's input " from Baidu's mansion to five road junctions ", can identify this query and have the map class demand, this query is offered the map vertical search, carried out the calculating of bus routes by the map vertical search, then directly show bus trip map and relevant bus information from Baidu's mansion to five road junctions, as shown in Figure 5.
3) be used for information recommendation.After the user inputted query, the said method by the embodiment of the invention and device can identify the demand type of this query, based on this demand type the user are carried out information recommendation, recommended such as recommendation, the query of advertisement recommendation, knowledge question platform etc.
For example, the user inputs query " cheap MP3 player " and identifies its demand type and be the shopping class, then can recommend the advertisement relevant with the MP3 player at Search Results, and advertisement and user's actual demand matching degree is just very high like this.
The above only is preferred embodiment of the present invention, and is in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, is equal to replacement, improvement etc., all should be included within the scope of protection of the invention.

Claims (14)

1. a method of setting up the demand analysis template is characterized in that, carries out following steps for default demand type respectively:
S1, the seed query set of obtaining described demand type;
S2, determine all phrase n-gram of n unit of described seed query set, described n is default one or more positive integers;
S3, the occurrence number of each n-gram in the seed set of described demand type that obtains according to statistics, N1 minimum n-gram of occurrence number among each seed query of described seed query set replaced with asterisk wildcard, obtain candidate's demand analysis template, described N1 is default positive integer;
S4, each candidate's demand analysis template is carried out degree of confidence scoring, select the degree of confidence scoring to come candidate's demand analysis template of front N2 as the demand analysis template of described demand type, described N2 is default positive integer.
2. method according to claim 1 is characterized in that, described step S1 specifically comprises following mode:
Mode 1, from the search daily record of described demand type vertical search, obtain the query that searching times is higher than preset first threshold value, consist of the seed query set of described demand type; Perhaps,
Mode 2, from the search daily record of the Webpage search of described demand type, obtain corresponding to the website of having clicked described demand type or clicked the query of the title that comprises described demand type Feature Words, and the query that searching times among the query that obtains is higher than default Second Threshold is consisted of the seed query set of described demand type; Perhaps,
The query that mode 3, the query that described mode 1 is obtained and described mode 2 are obtained gets the seed query set of occuring simultaneously and obtaining described demand type.
3. method according to claim 2 is characterized in that, the number of clicks of utilizing the query obtain in described mode 1 and the ratio of searching times obtain the weights of corresponding query; Perhaps,
The number of clicks of utilizing the query obtain in described mode 2 and the ratio of searching times obtain the weights of corresponding query; Perhaps,
In described mode 3, pass through formula query _ mi = MIN ( click _ 2 , click _ 1 ) 2 search _ 2 * search _ 1 Obtain the weights of each query in the seed query set, wherein query_mi is the weights of query, click_1 is the number of clicks of this query in described mode 1, click_2 is the number of clicks of this query in described mode 2, search_1 is the searching times of this query in mode 1, and search_2 is the searching times of this query in mode 2.
4. method according to claim 1, it is characterized in that, in described step S3, set in advance the n-gram granularity that replaces with asterisk wildcard, carry out the step that n-gram that the n-gram that N1 occurrence number among each seed query of described seed query set is minimum or occurrence number be lower than the preset times threshold value replaces with asterisk wildcard according to described granularity.
5. method according to claim 1 is characterized in that, also comprises before the step of the described replacement of execution in described step S3: the n-gram of named entity among each query of described seed query set is replaced with the named entity type mark.
6. according to claim 1 to the described method of 5 arbitrary claims, it is characterized in that, described in the step S4 each candidate's demand analysis template carried out the degree of confidence scoring and specifically comprise:
The characteristic ginseng value of candidate's demand template is weighted the degree of confidence scoring that obtains this candidate's demand template after the summation, and wherein said characteristic parameter comprises following listed at least a:
Obtain all seed query of this candidate's demand template the weight average value, according to this candidate's demand template comprise fixing word number scoring, whether comprise the scoring of named entity type mark and comprise the scoring that is replaced the n-gram number according to this candidate's demand template according to this candidate's demand template.
7. the method for search need identification is characterized in that the method comprises:
After receiving query to be identified, described query to be identified is mated with the demand analysis template of each demand type respectively, determine that demand type corresponding to demand analysis template that the match is successful is the demand type of described query to be identified;
The demand analysis template of wherein said each demand type is set up by claim 1,2,3,4 or 5 described methods.
8. a device of setting up the demand analysis template is characterized in that, this device comprises:
The seed acquiring unit, the seed query set that is used for obtaining default demand type;
The phrase determining unit is used for all n-gram that determine that described seed query gathers;
The candidate template determining unit, the occurrence number that each n-gram that is used for obtaining according to statistics gathers at the seed of described demand type, N1 minimum n-gram of occurrence number among each seed query of described seed query set replaced with asterisk wildcard, obtain candidate's demand analysis template, described N1 is default positive integer;
Template selection unit is used for each candidate's demand analysis template is carried out the degree of confidence scoring, selects the degree of confidence scoring to come the individual candidate's demand analysis template of front N2 as the demand analysis template of described demand type, and described N2 is default positive integer.
9. device according to claim 8 is characterized in that, described seed acquiring unit obtains seed query set in the following ways:
Mode 1, from the search daily record of described demand type vertical search, obtain the query that searching times is higher than preset first threshold value, consist of the seed query set of described demand type; Perhaps,
Mode 2, from the search daily record of the Webpage search of described demand type, obtain corresponding to the website of having clicked described demand type or clicked the query of the title that comprises described demand type Feature Words, and the query that searching times among the query that obtains is higher than default Second Threshold is consisted of the seed query set of described demand type; Perhaps,
The query that mode 3, the query that described mode 1 is obtained and described mode 2 are obtained gets the seed query set of occuring simultaneously and obtaining described demand type.
10. device according to claim 9 is characterized in that, this device also comprises: the weights determining unit, for the number of clicks of the query that obtains in described mode 1 utilization and the weights that the ratio of searching times obtains corresponding query; Perhaps,
The number of clicks of utilizing the query obtain in described mode 2 and the ratio of searching times obtain the weights of corresponding query; Perhaps,
In described mode 3, pass through formula query _ mi = MIN ( click _ 2 , click _ 1 ) 2 search _ 2 * search _ 1 Obtain the weights of each query in the seed query set, wherein query_mi is the weights of query, click_1 is the number of clicks of this query in described mode 1, click_2 is the number of clicks of this query in described mode 2, search_1 is the searching times of this query in mode 1, and search_2 is the searching times of this query in mode 2.
11. device according to claim 8, it is characterized in that, set in advance the n-gram granularity that replaces with asterisk wildcard, described candidate template determining unit is carried out the operation that n-gram that the n-gram that N1 occurrence number among each seed query of described seed query set is minimum or occurrence number be lower than the preset times threshold value replaces with asterisk wildcard according to described granularity.
12. device according to claim 8 is characterized in that, described candidate template determining unit is before the operation of carrying out described replacement, and further the n-gram with named entity among each query of described seed query set replaces with the named entity type mark.
13. according to claim 8 to the described device of 12 arbitrary claims, it is characterized in that, described template selection unit is weighted the degree of confidence scoring that obtains this candidate's demand template after the summation to the characteristic ginseng value of candidate's demand template, and wherein said characteristic parameter comprises following listed at least a:
Obtain all seed query of this candidate's demand template the weight average value, according to this candidate's demand template comprise fixing word number scoring, whether comprise the scoring of named entity type mark and comprise the scoring that is replaced the n-gram number according to this candidate's demand template according to this candidate's demand template.
14. the device of a search need identification is characterized in that this device comprises:
The acquisition request unit is used for receiving query to be identified;
The template matches unit is used for described query to be identified is mated with the demand analysis template of each demand type respectively, determines that demand type corresponding to demand analysis template that the match is successful is the demand type of described query to be identified; s
The demand analysis template of wherein said each demand type is set up by claim 8,9,10,11 or 12 described devices.
CN2011102667995A 2011-09-09 2011-09-09 Method for building requirement analysis formwork and method and device for searching requirement recognition Pending CN102999496A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011102667995A CN102999496A (en) 2011-09-09 2011-09-09 Method for building requirement analysis formwork and method and device for searching requirement recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011102667995A CN102999496A (en) 2011-09-09 2011-09-09 Method for building requirement analysis formwork and method and device for searching requirement recognition

Publications (1)

Publication Number Publication Date
CN102999496A true CN102999496A (en) 2013-03-27

Family

ID=47928077

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011102667995A Pending CN102999496A (en) 2011-09-09 2011-09-09 Method for building requirement analysis formwork and method and device for searching requirement recognition

Country Status (1)

Country Link
CN (1) CN102999496A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104077320A (en) * 2013-03-29 2014-10-01 北京百度网讯科技有限公司 Method and device for generating to-be-published information
US9529856B2 (en) 2013-06-03 2016-12-27 Google Inc. Query suggestion templates
CN107203501A (en) * 2016-03-16 2017-09-26 航天信息软件技术有限公司 A kind of information issuing method and device
CN107832414A (en) * 2017-11-07 2018-03-23 百度在线网络技术(北京)有限公司 Method and apparatus for pushed information

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1194416A (en) * 1997-09-29 1998-09-30 毕福君 Polynary confidence degree adapting system and its relative method
US20040243568A1 (en) * 2000-08-24 2004-12-02 Hai-Feng Wang Search engine with natural language-based robust parsing of user query and relevance feedback learning
CN1578955A (en) * 2001-09-04 2005-02-09 国际商业机器公司 Sampling approach for data mining of association rules
CN1750121A (en) * 2004-09-16 2006-03-22 北京中科信利技术有限公司 A kind of pronunciation evaluating method based on speech recognition and speech analysis
CN101055587A (en) * 2007-05-25 2007-10-17 清华大学 Search engine retrieving result reordering method based on user behavior information
CN101853308A (en) * 2010-06-11 2010-10-06 中兴通讯股份有限公司 Method and application terminal for personalized meta-search
CN102129422A (en) * 2010-01-14 2011-07-20 富士通株式会社 Template extraction method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1194416A (en) * 1997-09-29 1998-09-30 毕福君 Polynary confidence degree adapting system and its relative method
US20040243568A1 (en) * 2000-08-24 2004-12-02 Hai-Feng Wang Search engine with natural language-based robust parsing of user query and relevance feedback learning
CN1578955A (en) * 2001-09-04 2005-02-09 国际商业机器公司 Sampling approach for data mining of association rules
CN1750121A (en) * 2004-09-16 2006-03-22 北京中科信利技术有限公司 A kind of pronunciation evaluating method based on speech recognition and speech analysis
CN101055587A (en) * 2007-05-25 2007-10-17 清华大学 Search engine retrieving result reordering method based on user behavior information
CN102129422A (en) * 2010-01-14 2011-07-20 富士通株式会社 Template extraction method and device
CN101853308A (en) * 2010-06-11 2010-10-06 中兴通讯股份有限公司 Method and application terminal for personalized meta-search

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
明悦: "语音识别与评测在汉语学习中的应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
李尘一: "基于联合得分的语音置信度评估系统的研究与设计", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104077320A (en) * 2013-03-29 2014-10-01 北京百度网讯科技有限公司 Method and device for generating to-be-published information
US9529856B2 (en) 2013-06-03 2016-12-27 Google Inc. Query suggestion templates
TWI650654B (en) * 2013-06-03 2019-02-11 谷歌有限責任公司 Query suggestion template
US10635717B2 (en) 2013-06-03 2020-04-28 Google Llc Query suggestion templates
CN107203501A (en) * 2016-03-16 2017-09-26 航天信息软件技术有限公司 A kind of information issuing method and device
CN107832414A (en) * 2017-11-07 2018-03-23 百度在线网络技术(北京)有限公司 Method and apparatus for pushed information
CN107832414B (en) * 2017-11-07 2021-10-22 百度在线网络技术(北京)有限公司 Method and device for pushing information

Similar Documents

Publication Publication Date Title
CN103729359B (en) A kind of method and system recommending search word
CN110704743B (en) Semantic search method and device based on knowledge graph
CN101681251B (en) From the semantic analysis of documents to rank phrase
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
CN101984422B (en) Fault-tolerant text query method and equipment
US20110225115A1 (en) Systems and methods for facilitating open source intelligence gathering
CN103336766A (en) Short text garbage identification and modeling method and device
CN105653671A (en) Similar information recommendation method and system
CN104615687A (en) Entity fine granularity classifying method and system for knowledge base updating
CN102831119B (en) Short text clustering Apparatus and method for
CN102722498A (en) Search engine and implementation method thereof
CN102402566A (en) Web user behavior analysis method based on Chinese webpage automatic classification technology
CN105095625B (en) Clicking rate prediction model method for building up, device and information providing method, system
CN103020066A (en) Method and device for recognizing search demand
CN110287329A (en) A kind of electric business classification attribute excavation method based on commodity text classification
CN104715063A (en) Search ranking method and search ranking device
CN105808541B (en) A kind of information matches treating method and apparatus
US20130110594A1 (en) Ad copy determination
CN103186556A (en) Method for obtaining and searching structural semantic knowledge and corresponding device
CN105468649A (en) Method and apparatus for determining matching of to-be-displayed object
TWI461942B (en) An ad management apparatus, an advertisement selecting apparatus, an advertisement management method, an advertisement management program, and a recording medium on which an advertisement management program is recorded
CN105389328B (en) A kind of extensive open source software searching order optimization method
CN102999496A (en) Method for building requirement analysis formwork and method and device for searching requirement recognition
CN103020083A (en) Automatic mining method of requirement identification template, requirement identification method and corresponding device
CN107665442B (en) Method and device for acquiring target user

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20130327

RJ01 Rejection of invention patent application after publication