CN102592595A

CN102592595A - Voice recognition method and system

Info

Publication number: CN102592595A
Application number: CN2012100734129A
Authority: CN
Inventors: 潘青华; 鹿晓亮; 何婷婷; 王智国; 胡国平; 胡郁; 刘庆峰
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2012-03-19
Filing date: 2012-03-19
Publication date: 2012-07-18
Anticipated expiration: 2032-03-19
Also published as: CN102592595B

Abstract

The invention relates to the technical field of voice recognition and discloses voice recognition method and system. The method comprises the following steps: a decoding recognition network is built; each voice signal frame of received voice signals is decoded according to the decoding recognition network, and in the decoding process, historical pathways of active nodes are excitated according to hot words so as to improve the accumulation probability of the historical pathways where the hot words are positioned; the active node with maximal cumulative probability is selected to serve as an optimal node until the last voice signal frame is decoded; and the optimal node is traced back to an optimal pathway and a corresponding word sequence from the decoding state. According to the invention, system parameter reassessment is avoided, and hot words and user personalized words can be recognized quickly and accurately, so that the recognition effect of the hot words is improved.

Description

Audio recognition method and system

Technical field

The present invention relates to the speech recognition technology field, particularly a kind of audio recognition method and system.

Background technology

Realize man-machine between hommization, intelligentized effectively mutual, make up man-machine communication's environment of efficient natural, become the active demand of current information technical application and development.In recent years, along with the develop rapidly of speech recognition technology, various online speech recognition application such as phonetic entry, phonetic search have received increasing concern.System based on the mass data training can satisfy the needs that phonetic entry commonly used is write in advance, and recognition accuracy is often higher when the phonetic entry content meets the distribution of original language model probability especially.Yet in practical application; Mobile Internet and social networks fast development are constantly producing new much-talked-about topic and corresponding focus vocabulary; Also there is the identification demand of different personalized vocabulary in different user, and like contact name etc., these focus vocabulary or personalized vocabulary are because ageing often occurrence frequency is lower in the language material of acquired original with specificity; Thereby the original language model often covers deficiency to such vocabulary, and then causes corresponding recognition system can not accurately discern such hot speech.

For this reason, often adopt the method for systematic parameter revaluation in the prior art, after the hot word material that will newly collect adds former corpus, train new language model again to improve recognition accuracy to new gain of heat speech.Yet in practical application, hot speech update frequency is often higher, and the revaluation of enough language material participation system parameters can't in time be collected by system, and then influence is to the recognition effect of hot speech.On the other hand; Again the training of language model and recognition system resource are (as based on WFST (Weighted Finite-State Transducers; The weighting FST) structure decoding recognition network) is often time-consuming more; Cost is bigger, can't realize the quick response to hot speech identification.

Summary of the invention

The embodiment of the invention provides a kind of audio recognition method and system, can't fast, accurately discern the technical matters of focus vocabulary and user individual vocabulary to solve prior art.

For this reason, the embodiment of the invention provides following technical scheme:

A kind of audio recognition method comprises:

Make up the decoding recognition network;

To the voice signal that receives, according to said decoding recognition network every frame voice signal frame is wherein decoded, and in decode procedure, the historical path of live-vertex is encouraged, to improve the accumulated history path probability in path, hot speech place according to hot speech;

After accomplishing last frame voice signal frame decoding, the live-vertex that selection has the cumulative maximum probability is optimum node;

Recall the word sequence that obtains optimal path and correspondence through decoded state from said optimum node.

A kind of speech recognition system comprises:

The network struction unit is used for making up the decoding recognition network;

Decoding unit is used for the voice signal to receiving, and according to said decoding recognition network every frame voice signal frame is wherein decoded;

Exciting unit is used for the historical path of live-vertex being encouraged according to hot speech at the decode procedure of said decoding unit, to improve the accumulated history path probability in path, hot speech place;

Optimum node is confirmed the unit, is used for after said decoding unit is accomplished last frame voice signal frame decoding, and the live-vertex that selection has the cumulative maximum probability is optimum node;

Trace unit is used for recalling the word sequence that obtains optimal path and correspondence through decoded state from said optimum node.

Embodiment of the invention audio recognition method and system; Employing encourages the historical path of live-vertex based on hot speech coupling; To improve the accumulated history path probability in path, hot speech place, realized effective excitation to hot speech identification, improved the recognition effect of hot speech.Need not the systematic parameter revaluation; Can fast, accurately discern focus vocabulary and user individual vocabulary; For system supports the personalized dictionary or the individualized language model of customization a kind of feasible solution is provided, the user can realize the identification support to personalized vocabulary through the hot speech entry of simple update system support.

Description of drawings

In order to be illustrated more clearly in the technical scheme that the present invention implements; To do to introduce simply to the accompanying drawing of required use among the embodiment below; Obviously, the accompanying drawing in describing below only is some embodiments of the present invention, for those of ordinary skills; Under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.

Fig. 1 is the process flow diagram of embodiment of the invention audio recognition method;

Fig. 2 is the synoptic diagram of a kind of decoding recognition network based on WFST in the embodiment of the invention;

Fig. 3 is the synoptic diagram of the hot speech dictionary of tree structure in the embodiment of the invention;

Fig. 4 is a kind of process flow diagram that encourage in the historical path that according to hot set of words decoding obtained in the embodiment of the invention;

Fig. 5 is a kind of process flow diagram that encourage in the historical path that according to hot speech dictionary decoding obtained in the embodiment of the invention;

Fig. 6 is a kind of realization flow figure that upgrades the accumulated history path probability in historical path in the embodiment of the invention according to the spreading result in historical path;

Fig. 7 is the another kind of realization flow figure that upgrades the accumulated history path probability in historical path in the embodiment of the invention according to the spreading result in historical path;

Fig. 8 is a kind of concrete synoptic diagram of hot speech dictionary in the embodiment of the invention;

Fig. 9 is the structural representation of embodiment of the invention speech recognition system;

Figure 10 is a kind of concrete application structure synoptic diagram of embodiment of the invention speech recognition system;

Figure 11 is the another kind of concrete application structure synoptic diagram of embodiment of the invention speech recognition system;

Figure 12 is a kind of structural representation of excitation subelement in the embodiment of the invention;

Figure 13 is the another kind of structural representation of excitation subelement in the embodiment of the invention.

Embodiment

To combine the accompanying drawing in the embodiment of the invention below, the technical scheme in the embodiment of the invention is carried out clear, intactly description, obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills are not making the every other embodiment that is obtained under the creative work prerequisite, all belong to the scope of the present invention's protection.

For the scheme that makes the embodiment of the invention is more readily understood, and embody the difference with the existing voice identifying schemes better, at first audio recognition method basic in the prior art is done simple declaration below.

In the prior art; Normally the semantic network with language model is extended to the search network based on the model state layer through acoustic model and dictionary etc.; Promptly make up the decoding recognition network; When input speech signal is decoded, obtain new effective extensions path with respect to the accumulated history path probability of each acoustic model on current effective extensions path and language model then through each the frame voice signal that calculates input.When having searched for the last frame voice signal, obtain the optimal path of decoding subsequently, obtain corresponding word sequence through recalling from optimum node executing state with the historical path probability of cumulative maximum.

Embodiment of the invention audio recognition method and system; Adopt the mode of systematic parameter revaluation to improve recognition accuracy to prior art to new gain of heat speech; Can't fast, accurately discern the technical matters of focus vocabulary and user individual vocabulary; Current historical path is encouraged based on hot speech, thereby improve the accumulated history path probability in path, hot speech place, improved the recognition effect of hot speech.Need not the systematic parameter revaluation, can fast, accurately discern focus vocabulary and user individual vocabulary.

As shown in Figure 1, be the process flow diagram of embodiment of the invention audio recognition method, may further comprise the steps:

Step 101 makes up the decoding recognition network.

In embodiments of the present invention, said decoding recognition network can also can be made up through offline mode by the online structure of system, when system start-up, directly is written into, and to reduce system's operand and required memory, further improves decoding efficiency.

Step 102; To the voice signal that receives; According to said decoding recognition network every frame voice signal frame is wherein decoded, and in decode procedure, the historical path of live-vertex is encouraged, to improve the accumulated history path probability in path, hot speech place according to hot speech.

The process of utilizing said decoding recognition network that the voice signal of user's input is decoded is one and in this decoding recognition network, searches for optimal path, realizes the process of the conversion of speech-to-text.

Particularly, can at first deposit data buffer area in for the series of discrete energy value to the continuous speech signal sampling that receives.

Certainly, for the robustness of further raising system, can also carry out noise reduction process to the continuous speech signal that receives earlier.At first through short-time energy and short-time zero-crossing rate analysis to voice signal; Continuous voice signal is divided into independently voice snippet and non-voice segment; Carry out the voice enhancement process to cutting apart the voice snippet that obtains then, when carrying out the voice enhancement process, can be through methods such as Wiener filterings; Neighbourhood noise in the voice signal is further eliminated, to improve follow-up system this Signal Processing ability.

Consider and still can have the irrelevant redundant information of a large amount of speech recognitions in the voice signal after the noise reduction process; Directly operand and recognition accuracy are reduced to its identification; For this reason; Can extract identification efficient voice characteristic the speech energy signal after noise reduction process, and deposit in the characteristic buffer area.Particularly; Can extract MFCC (the Mel Frequency Cepstrum Coefficient of voice; The Mel frequency cepstral coefficient) characteristic, every frame speech data that the long 25ms frame of window is moved 10ms are done short-time analysis and are obtained MFCC parameter and single order thereof, second order difference, amount to 39 dimensions.That is to say, every frame voice signal is quantified as the characteristic sequence of one 39 dimensions.

Then, every frame voice signal is wherein decoded, obtain optimal path, thereby accomplish the decoding identifying according to said decoding recognition network.

In the prior art, the search procedure of optimal path is following: according to time sequencing from left to right, calculate the accumulated history path probability that every frame voice signal frame arrives each live-vertex in the decoding recognition network.

Particularly, for every frame voice signal frame that needs are investigated, can at first calculate in the current decoding recognition network all live-vertexs with respect to the historical path and the accumulated history path probability of this voice signal frame.

Such as, for a current frame voice signal, corresponding phonetic feature sequence is: { O ₁, O ₂..., O _t, t phonetic feature O constantly wherein _tChange the path probability of live-vertex j over to Promptly from live-vertex i to this node j might historical path the probability maximum value calculation following:

Wherein, i all live-vertexs that link to each other with live-vertex j in the recognition network of representing to decode;

Expression (t-1) is characteristic O constantly _T-1Drop on the historical path probability on the live-vertex i; a _IjThe transition probability of expression from node i to node j, and b _j(o _t) expression t moment characteristic O _tLikelihood probability corresponding to node j.

The accumulated history path probability of live-vertex j for all with node path that live-vertex j links to each other in have the path score of cumulative maximum path probability.That is to say, in the cumulative path probability that calculates live-vertex j, also known the last node of live-vertex j, and then known the historical path of live-vertex j.

Then, obtain next frame voice signal frame, and expand decoding backward from the historical path of satisfying the systemic presupposition condition.After to last frame voice signal frame decoding; The live-vertex that wherein has the historical path probability of cumulative maximum is optimum node; Recall the historical path that obtains from this optimum node through decoded state and be optimal path, the word sequence on this optimal path is decoded result.

The vocabulary that can embody former corpus owing to the language model based on the mass data training well distributes, thereby the conventional vocabulary of major part is had recognition effect preferably.And focus vocabulary and user individual vocabulary is owing to have personalization, and probability is less in original language model, thereby its corresponding decoding path score is often on the low side, causes correctly discerning.

For this reason; In the present invention implements; Based on hot speech the historical path of live-vertex is encouraged, keep the time-to-live of hot speech in the searching route expansion, possibly be optimized with the path of hot speech coupling thereby make in the decoding recognition network; Improve the success ratio of hot speech coupling, hot speech identification correctness also can obtain corresponding raising.

Particularly, to possibly adopting different energisation modes with the path of hot speech coupling in the decoding recognition network, will specify in the back this.

Step 103, after accomplishing last frame voice signal frame decoding, the live-vertex that selection has the cumulative maximum probability is optimum node.

Step 104 is recalled the word sequence that obtains optimal path and correspondence from said optimum node through decoded state.

Recall the historical path that obtains from said optimum node through decoded state and be optimal path.

This shows that embodiment of the invention audio recognition method adopts the historical path energized process based on hot speech coupling, the accumulated history path probability in path, hot speech place is optimized, improved the recognition effect of hot speech.Utilize embodiment of the invention audio recognition method; Need not the systematic parameter revaluation; Can fast, accurately discern focus vocabulary and user individual vocabulary; For system supports the personalized dictionary or the individualized language model of customization a kind of feasible solution is provided, the user can realize the identification support to personalized vocabulary through the hot speech dictionary of simple update system support.

The front is mentioned, and in embodiments of the present invention, said decoding recognition network can also can be made up through offline mode by the online structure of system, when system start-up, directly is written into, and to reduce system's operand and required memory, further improves decoding efficiency.

Particularly, the structure of said decoding recognition network can utilize structures such as preset acoustic model and language model.

Wherein, said acoustic model is mainly used in simulation character sound characteristics, specifically can adopt the field of speech recognition HMM based on transition probability and transmission probability (Hidden Markov Model, hidden Markov) model commonly used.Consider that in the large vocabulary continuous speech recognition, the quantity of vocabulary is too huge, if each character is all made up a HMM model, then model quantity is too many, is unfavorable for data storage and calculating.Therefore, in practical application, can make up the HMM model like syllable or phoneme unit only to basic pronunciation unit.Obviously acoustic model can also adopt the other technologies means, such as neural network etc., this embodiment of the invention is not done qualification.

Wherein, said language model is in order more effectively to characterize knowledge such as grammer and semanteme, to remedy the deficiency of acoustic model, to improve discrimination.Specifically can adopt field of speech recognition to use statistical language model always, utilize the mode descriptor of statistical probability and the relation between the speech, promptly suppose certain speech w _kThe probability that occurs is only relevant with its preceding n-1 speech, is designated as

Obviously language model also can adopt the other technologies means, such as the words equity, this embodiment of the invention is not done qualification.

The structure of said decoding recognition network can adopt construction methods more of the prior art, utilizes acoustic model said language model expansion to be become the search network of model layer.Fig. 2 shows a kind of synoptic diagram of the decoding recognition network based on WFST.Certainly, can also adopt the decoding recognition network of other modes, as based on the dynamic decoder recognition network of historical speech tree copy etc.

In embodiments of the present invention, allow the self-defined at any time hot speech of user,, adapt to the demand that vocabulary constantly changes to enlarge the identification range of system.Particularly, after the user imports new hot speech, can said hot speech be saved in hot set of words.That is to say that said hot set of words can be a hot speech entry storehouse that dynamically updates.

In addition, when upgrading said hot speech dictionary, can also adopt the branch word algorithm; The existing dictionary that presets through using system carries out participle to the text that the user increases self-defining hot speech newly; Make each hot speech entry all be expressed as a sequence of being made up of the existing basic words unit of system, and the word segmentation result that obtains is saved in hot speech dictionary, this word segmentation result can be by tree structure management shown in Figure 3; The corresponding hot speech entry of each branch, specific as follows:

Hot speech A:Wordi1 Wordi2 Wordi3;

Hot speech B:Wordi1 Wordi4;

Hot speech C:Wordi5;

...；

Hot speech N:Wordij...Wordik.

This hot speech dictionary can be that a hot speech that dynamically updates divides dictionary.

Based on above-mentioned hot set of words or hot speech dictionary, to possibly can adopting different energisation modes with the path of hot speech coupling in the decoding recognition network, to specifying below this.

1. according to hot set of words the historical path that decoding obtains is optimized

As shown in Figure 4, be a kind of process flow diagram that encourage in the historical path that according to hot set of words decoding obtained in the embodiment of the invention.

In this flow process, give suitable score excitation when corresponding word sequence can constitute a hot speech entry in the historical path of live-vertex, to improve the priority in this history path.

This flow process specifically may further comprise the steps:

Step 401 is obtained the historical path and the accumulated history path probability of all live-vertexs that current speech signal frame decoding is obtained.

Step 402 judges according to said hot set of words whether contiguous word constitutes a hot speech on the said historical path; If then execution in step 403; Otherwise, execution in step 404.

Step 403 improves the accumulated history path probability in said historical path.

Such as, can take the method for constant excitation bonus point to improve the priority in path, hot speech place, so that path, hot speech place keeps more easily.The concrete numerical value of constant excitation bonus point can preestablish as required, and in general, the excitation bonus point is high more, and then hot speech matching degree is high more, the also corresponding raising of hot speech identification correctness.

The reference value that the constant excitation bonus point can be set on the basis of the hot speech discrimination of balance and other non-hot speech discrimination is 300, certainly, also can be other numerical value, and this embodiment of the invention is not done qualification.

Step 404 keeps the accumulated history path probability in said historical path.

Need to prove; Step 402 to step 404; Need carry out one by one the historical path of each bar that step 401 is obtained; The historical path that does not comprise hot speech in the historical path of all live-vertexs that promptly step 401 obtained keeps its path probability, and the historical path probability that comprises hot speech is encouraged.

Need to prove, in practical application, before carrying out next frame voice signal frame decoding, can also the historical path of live-vertex be optimized that the impossible path of deletion is to improve subsequent searches efficient.Particularly; Can adopt the method for reducing thresholding based on probability; At first add up the mxm. of current live-vertex accumulated history path probability, calculate the accumulated history path probability of each live-vertex and the difference of this mxm. then respectively, difference wherein is made as non-live-vertex greater than the live-vertex of the cutting thresholding that is provided with in advance; And from the subsequent searches path, wipe out, end the subsequent searches that begins from this node.

In embodiments of the present invention; Said process need be after encourage according to flow process shown in Figure 4 in the historical path of all live-vertexs that the decoding to the current speech signal frame obtains; Accumulated history path probability according to current all live-vertexs (comprise excitation after live-vertex and unperturbed live-vertex) is confirmed new live-vertex, realizes the expansion of subsequent path.

In flow process shown in Figure 4, improve the priority in path, hot speech place through the method for constant excitation bonus point, realize optimization to path, hot speech place.

In practical application; Because hot speech often is made up of two or more participles; The above-mentioned mode that when complete coupling realized in hot speech entry, encourages may not reached the purpose of excitation too early owing to path, hot speech entry place by reduction, influence the accuracy rate of hot speech identification.For this reason, in embodiments of the present invention, can also adopt motivational techniques, improve constantly the priority in path, hot speech place through the mode that progressively encourages based on predicted path.

2. according to hot speech dictionary the historical path that decoding obtains is optimized

As shown in Figure 5, be a kind of process flow diagram that encourage in the historical path that according to hot speech dictionary decoding obtained in the embodiment of the invention, may further comprise the steps:

Step 501 is obtained the historical path and the accumulated history path probability of all live-vertexs that current speech signal frame decoding is obtained.

Step 502 judges in the historical path of said live-vertex neologisms whether occur; If then execution in step 503; Otherwise, execution in step 504.

Step 503 according to the spreading result of said neologisms with respect to the participle in the corresponding said hot speech dictionary on the said historical path, is upgraded the accumulated history path probability in said historical path.

Step 504 keeps the accumulated history path probability in said historical path.

Need to prove,, all need carry out the process of above-mentioned steps 502 to step 504 for the historical path of all live-vertexs of each frame voice signal frame.And, before carrying out next frame voice signal frame decoding, can also the historical path of live-vertex being optimized, the impossible path of deletion is to improve subsequent searches efficient.Particularly; Can adopt the method for reducing thresholding based on probability; At first add up the mxm. of current live-vertex accumulated history path probability, calculate the accumulated history path probability of each live-vertex and the difference of this mxm. then respectively, difference wherein is made as non-live-vertex greater than the live-vertex of the cutting thresholding that is provided with in advance; And from the subsequent searches path, wipe out, end the subsequent searches that begins from this node.

Be different from above-mentioned when hot speech matees fully just to the excitation of historical path; Step 503 is carried out excitation to historical path when hot speech partly matees in decode procedure; Promptly upgrade the accumulated history path probability in historical path according to the spreading result in historical path; Through improving the path accumulated probability in advance, can guarantee the survival probability in path, hot speech place better.

The multiple diverse ways of can adopting of above-mentioned steps 503 realizes, such as:

As shown in Figure 6, be a kind of realization flow figure that upgrades the accumulated history path probability in historical path in the embodiment of the invention according to the spreading result in historical path, may further comprise the steps:

Step 601 judges whether neologisms are follow-up participles of the hot speech participle of correspondence on the said historical path; If then execution in step 605; Otherwise, execution in step 602.

Step 602 judges whether said neologisms are the initial participle in the said hot speech dictionary; If then execution in step 603; Otherwise, execution in step 604.

Step 603 is carried out and the corresponding bonus point of said initial participle the accumulated history path probability in said historical path.

Step 604 keeps the accumulated history path probability in said historical path.

Step 605 is carried out and the corresponding bonus point of said follow-up participle the accumulated history path probability in said historical path.

As shown in Figure 7, be the another kind of realization flow figure that upgrades the accumulated history path probability in historical path in the embodiment of the invention according to the spreading result in historical path, may further comprise the steps:

Step 701 judges that neologisms belong to whether the above neologisms of historical path word sequence before is a complete thermal speech in the said hot speech dictionary; If then execution in step 705; Otherwise, execution in step 702.

Step 702 judges whether said neologisms are follow-up participles of the hot speech participle of correspondence on the said historical path; If then execution in step 703; Otherwise, execution in step 704.

Step 703 is carried out and the corresponding bonus point of said follow-up participle the accumulated history path probability in said historical path.

Step 704, before cancelling to the bonus point of the accumulated history path probability in said historical path.

Step 705 judges whether said neologisms are the initial participle in the said hot speech dictionary; If then execution in step 706; Otherwise, execution in step 707.

Step 706 is carried out and the corresponding bonus point of said initial participle the accumulated history path probability in said historical path.

Step 707 keeps the accumulated history path probability in said historical path.

Utilize flow process shown in Figure 7, can further avoid by error non-hot speech being belonged to the excitation in historical path.

In practical application,, then in decoding, possibly can't guarantee that all paths that comprise hot speech can both come to light in addition if each live-vertex corresponding historical path is only kept a hot speech participle historical record.For example; Word1Word2Word3 classified as in the word order of hot speech A; And Word2Word4 classified as in the word order of hot speech B, when Word1Word2Word4 classified as in the input word order, when " Word2 " decoded, can preferentially match the Word2 participle of hot speech A; And ignored the Word2 participle of hot speech B, make word sequence Word1Word2Word4 finally not match on the hot speech B of Word2Word4.To this, in embodiments of the present invention, can also adopt historical path to keep the rationality that the historical method of a plurality of hot speech participle couplings improves hot speech coupling to single live-vertex.Promptly when " Word2 " decoded, not only keep the part coupling history of the hot speech A of " Word1Word2 ", the part of the hot speech B of reservation " Word2 " coupling is historical simultaneously, and to a plurality of excitation added values of the corresponding preservation in same historical path.Subsequently in subsequent decoding when definite a certain hot speech participle matching result can not continue expansion, before cancelling again to said historical path because the bonus point of the accumulated history path probability that hot speech participle produces.

In practical application, can path probability be encouraged according to the matching degree of decoding gained word and hot speech participle.Such as, can when neologisms are obtained in decoding, obtain the hot speech participle weight of system intialization, and the path is encouraged.Special, the weight of hot speech participle head and the tail character that can also be through hot speech dictionary for word segmentation is set, and other participle weights simply are set to 0 to simplify the process of motivation.Such as, suppose that current hot speech dictionary is as shown in Figure 8, constitute by " China, China, the people, people ".During decoding identification; If occur on the historical path of certain live-vertex of obtaining of decoding " in " perhaps when " people "; During promptly with the initial character of hot speech dictionary coupling; To said historical path give with corresponding hot speech participle " in " or " people " corresponding score value excitation, promptly the accumulated history path probability in said historical path is carried out the bonus point corresponding with said initial character.Subsequently, when subsequent expansion is carried out in path, hot speech participle place,, constitute " China " " China " if occur " state, China " or " people; " on the subsequent expansion path; " people " when the hot speech of " people " or portion of hot speech, then continue this path is given and " state "; " China ", " people ", " " corresponding score value excitation.Otherwise when path, said hot speech participle place expands to other non-hot speech words, the path, place is not encouraged, perhaps delete the excitation that the front increases.

It is thus clear that the audio recognition method of this embodiment adopts the motivational techniques based on predicted path, improves constantly the priority that hot speech belongs to the path through the mode that progressively encourages, realize optimization to path, hot speech place, improved the recognition effect of hot speech.

Correspondingly, the embodiment of the invention also provides a kind of speech recognition system, and is as shown in Figure 9, is a kind of structural representation of this system.

In this embodiment, said system comprises:

Network struction unit 901 is used for making up the decoding recognition network;

Decoding unit 902 is used for the voice signal to receiving, and according to said decoding recognition network every frame voice signal frame is wherein decoded;

Exciting unit 903 is used for the historical path of live-vertex being encouraged according to hot speech at the decode procedure of said decoding unit 902, to improve the accumulated history path probability in path, hot speech place;

Optimum node is confirmed unit 904, is used for after said decoding unit 902 is accomplished last frame voice signal frame decoding, and the live-vertex that selection has the cumulative maximum probability is optimum node;

Trace unit 905 is used for recalling the word sequence that obtains optimal path and correspondence through decoded state from said optimum node.

In embodiments of the present invention; Network struction unit 901 specifically can make up said decoding recognition network through online mode; Also can make up said decoding recognition network through offline mode; When system start-up, directly be written into the decoding recognition network that has made up, can reduce system's operand and required memory, further improve decoding efficiency.Network struction unit 901 specifically can utilize structures such as preset acoustic model and language model, is not described in detail at this.

In embodiments of the present invention; The process that decoding unit 902 utilizes said decoding recognition network that the voice signal of user's input is decoded is one and calculates the process that every frame voice signal frame arrives the accumulated history path probability of each live-vertex in the decoding recognition network; After decoding unit 902 is to each frame voice signal frame decoding; Can obtain the historical path and the accumulated history path probability of current all live-vertexs; Exciting unit 903 can adopt multitude of different ways according to hot speech the historical path of live-vertex to be encouraged, and improves the accumulated history path probability in path, hot speech place, specifically will specify in the back.

Need to prove, in practical application, before decoding unit 902 carries out next frame voice signal frame decoding, can also the historical path of live-vertex be optimized that the impossible path of deletion is to improve subsequent searches efficient.Particularly; Can adopt the method for reducing thresholding based on probability; At first add up the mxm. of current live-vertex accumulated history path probability, calculate the accumulated history path probability of each live-vertex and the difference of this mxm. then respectively, difference wherein is made as non-live-vertex greater than the live-vertex of the cutting thresholding that is provided with in advance; And from the subsequent searches path, wipe out, end the subsequent searches that begins from this node.

This shows that embodiment of the invention speech recognition system adopts the historical path energized process based on hot speech coupling, the accumulated history path probability in path, hot speech place is optimized, improved the recognition effect of hot speech.Utilize embodiment of the invention speech recognition system; Need not the systematic parameter revaluation; Can fast, accurately discern focus vocabulary and user individual vocabulary; For system supports the personalized dictionary or the individualized language model of customization a kind of feasible solution is provided, the user can realize the identification support to personalized vocabulary through the hot speech dictionary of simple update system support.

In embodiments of the present invention, allow the self-defined at any time hot speech of user,, adapt to the demand that vocabulary constantly changes to enlarge the identification range of system.

Shown in figure 10, be a kind of concrete application structure synoptic diagram of embodiment of the invention speech recognition system.

In this embodiment, said system also comprises: hot set of words updating block 911 is used to obtain the hot speech of user's input, and said hot speech is saved in hot set of words.This hot set of words can be a hot speech entry storehouse that dynamically updates.

Correspondingly, exciting unit 913 encourages the historical path of live-vertex according to said hot set of words in decode procedure.

In this embodiment, exciting unit 913 specifically comprises: obtain subelement 9131, judgment sub-unit 9132 and excitation subelement 9133.Wherein:

Obtain subelement 9131, be used to obtain the historical path and the accumulated history path probability of all live-vertexs that 702 pairs of current speech signal frames decodings of decoding unit obtain;

Judgment sub-unit 9132 is used for judging according to said hot set of words whether contiguous word constitutes a hot speech on the said historical path;

Excitation subelement 9133 is used for after said judgment sub-unit 9132 judges that contiguous word constitutes a hot speech on the said historical path, improving the accumulated history path probability in said historical path; After said judgment sub-unit 9132 judges that contiguous word can not constitute a hot speech on the said historical path, keep the accumulated history path probability in said historical path.When encouraging, can take the method for constant excitation bonus point to improve the priority in path, hot speech place, so that path, hot speech place keeps more easily.The concrete numerical value of constant excitation bonus point can preestablish as required, and in general, the excitation bonus point is high more, and then hot speech matching degree is high more, the also corresponding raising of hot speech identification correctness.

The speech recognition system of this embodiment gives suitable score excitation when corresponding word sequence can constitute a hot speech entry in the historical path of live-vertex, realizes the optimization to path, hot speech place, has improved the recognition effect of hot speech.

In practical application; Because hot speech often is made up of two or more participles; The above-mentioned mode that when complete coupling realized in hot speech entry, encourages may not reached the purpose of excitation too early owing to path, hot speech entry place by reduction, influence the accuracy rate of hot speech identification.For this reason, in another embodiment of speech recognition system of the present invention, exciting unit can also adopt the motivational techniques based on predicted path, improves constantly the priority in path, hot speech place through the mode that progressively encourages.

Shown in figure 11, be the another kind of concrete application structure synoptic diagram of embodiment of the invention speech recognition system.

In this embodiment, said system also comprises: hot speech acquiring unit 921 and cutting unit 922, and wherein, hot speech acquiring unit 921 is used to obtain the hot speech of user's input; Cutting unit 922 is used for the hot speech that said hot speech acquiring unit obtains is carried out the text participle, and the participle that obtains is saved in hot speech dictionary.

Correspondingly, exciting unit 923 encourages the historical path of live-vertex according to said hot speech dictionary in decode procedure.

In this embodiment, exciting unit 923 specifically comprises: obtain subelement 9231, neologisms judgment sub-unit 9232 and excitation subelement 9233.Wherein:

Obtain subelement 9231, be used to obtain the historical path and the accumulated history path probability of all live-vertexs that decoding obtains to the current speech signal frame;

Neologisms judgment sub-unit 9232 is used for judging whether the historical path of said live-vertex neologisms occur;

Excitation subelement 9233; Be used for after said neologisms judgment sub-unit 9232 judges that neologisms appear in the historical path of said live-vertex; According to the spreading result of said neologisms, upgrade the accumulated history path probability in said historical path with respect to the participle in the corresponding said hot speech dictionary on the said historical path; Otherwise the accumulated history path probability that keeps said historical path.

In this embodiment; The excitation subelement 9133 that is different among above-mentioned Figure 10 just encourages historical path when hot speech matees fully; Excitation subelement 9233 is carried out excitation to historical path when hot speech partly matees in decode procedure; Promptly upgrade the accumulated history path probability in historical path,, can guarantee the survival probability in path, hot speech place better through improving the path accumulated probability in advance according to the spreading result in historical path.Said excitation subelement 9233 can have multiple implementation, describes in detail for example below.

Shown in figure 12, be a kind of concrete structure synoptic diagram of excitation subelement in the embodiment of the invention.

In this embodiment, said excitation subelement comprises:

First judgment sub-unit 121 is used to judge whether said neologisms are follow-up participles of the hot speech participle of correspondence on the said historical path;

First handles subelement 122; Be used for after said first judgment sub-unit 121 judges that said neologisms are the follow-up participle of the hot speech participle of correspondence on the said historical path, the accumulated history path probability in said historical path being carried out and the corresponding bonus point of said follow-up participle;

Second judgment sub-unit 123 is used for after said first judgment sub-unit 121 judges that said neologisms are not the follow-up participle of the hot speech participle of correspondence on the said historical path, judging whether said neologisms are the initial participle in the said hot speech dictionary;

Second handles subelement 124, is used for after said second judgment sub-unit 123 judges that said neologisms are the initial participle of said hot speech dictionary, the accumulated history path probability in said historical path being carried out and the corresponding bonus point of said initial participle; Otherwise the accumulated history path probability that keeps said historical path.

Shown in figure 13, be the another kind of concrete structure synoptic diagram of excitation subelement in the embodiment of the invention.

In this embodiment, said excitation subelement comprises:

Hot speech judgment sub-unit 131 is used for judging that said neologisms belong to whether the above neologisms of historical path word sequence before is a complete thermal speech of said hot speech dictionary;

The 3rd judgment sub-unit 132; Be used for when said hot speech judgment sub-unit 131 judges that word sequence before said neologisms belong to the above neologisms of historical path is complete thermal speech of said hot speech dictionary, judging whether said neologisms are the initial participles in the said hot speech dictionary;

The 3rd handles subelement 133, is used for after said the 3rd judgment sub-unit 132 judges that said neologisms are the initial participle of said hot speech dictionary, the accumulated history path probability in said historical path being carried out and the corresponding bonus point of said initial participle; Otherwise the accumulated history path probability that keeps said historical path;

The 4th judgment sub-unit 134; Be used for when said hot speech judgment sub-unit 131 judges that word sequence before said neologisms belong to the above neologisms of historical path is not complete thermal speech of said hot speech dictionary, judge that whether said neologisms are the follow-up participles of hot speech participle corresponding on the said historical path;

Manages subelement 135 everywhere; Be used for after said the 4th judgment sub-unit 134 judges that said neologisms are the follow-up participle of the hot speech participle of correspondence on the said historical path, the accumulated history path probability in said historical path being carried out and the corresponding bonus point of said follow-up participle; Otherwise before cancelling to the bonus point of the accumulated history path probability in said historical path.

Certainly, in practical application, said excitation subelement can also have other implementation, and this embodiment of the invention is not done qualification.Such as; It is historical to keep a plurality of hot speech participle couplings to the historical path of single live-vertex; In subsequent decoding when definite a certain hot speech participle matching result can not continue expansion; Before cancelling again to said historical path because the bonus point of the accumulated history path probability that hot speech participle produces further improves the rationality of hot speech coupling with this.

The speech recognition system of this embodiment adopts the motivational techniques based on predicted path, improves constantly the priority that hot speech belongs to the path through the mode that progressively encourages, and realizes the optimization to path, hot speech place, has improved the recognition effect of hot speech.

Each embodiment in this instructions all adopts the mode of going forward one by one to describe, and identical similar part is mutually referring to getting final product between each embodiment, and each embodiment stresses all is the difference with other embodiment.Especially, for system embodiment, because it is basically similar in appearance to method embodiment, so describe fairly simplely, relevant part gets final product referring to the part explanation of method embodiment.System embodiment described above only is schematically, and wherein said unit and module as the separating component explanation can or can not be physically to separate also.In addition, the purpose that can also select wherein some or all of unit and module to realize the present embodiment scheme according to the needs of reality.Those of ordinary skills promptly can understand and implement under the situation of not paying creative work.

More than disclosedly be merely preferred implementation of the present invention; But the present invention is not limited thereto; Any those skilled in the art can think do not have a creationary variation, and, all should drop in protection scope of the present invention not breaking away from some improvement and the retouching of being done under the principle of the invention prerequisite.

Claims

1. an audio recognition method is characterized in that, comprising:

Make up the decoding recognition network;

2. the method for claim 1 is characterized in that, said method also comprises: obtain the hot speech of user's input, and said hot speech is saved in hot set of words;

Said in decode procedure, according to hot speech the historical path of live-vertex the excitation comprises: in decode procedure, according to said hot set of words the historical path of live-vertex is encouraged.

3. method as claimed in claim 2 is characterized in that, said in decode procedure, according to said hot set of words the historical path of live-vertex the excitation comprises:

Obtain the historical path and the accumulated history path probability of all live-vertexs that decoding obtains to the current speech signal frame;

Judge according to said hot set of words whether contiguous word constitutes a hot speech on the said historical path;

If then improve the accumulated history path probability in said historical path;

If, then do not keep the accumulated history path probability in said historical path.

4. the method for claim 1 is characterized in that, said method also comprises: obtain the hot speech of user's input, said hot speech is carried out the text participle, and the participle that obtains is saved in hot speech dictionary;

Said in decode procedure, according to hot speech the historical path of live-vertex the excitation comprises: in decode procedure, according to said hot speech dictionary the historical path of live-vertex is encouraged.

5. method as claimed in claim 4 is characterized in that, said in decode procedure, according to said hot speech dictionary the historical path of live-vertex the excitation comprises:

Judge in the historical path of said live-vertex and neologisms whether occur;

If, upgrade the accumulated history path probability in said historical path then according to the spreading result of said neologisms with respect to the participle in the corresponding said hot speech dictionary on the said historical path;

6. method as claimed in claim 5 is characterized in that, said according to the spreading result of said neologisms with respect to the participle in the corresponding said hot speech dictionary on the said historical path, the accumulated history path probability of upgrading said historical path comprises:

Judge whether said neologisms are follow-up participles of the hot speech participle of correspondence on the said historical path;

If then the accumulated history path probability in said historical path is carried out and the corresponding bonus point of said follow-up participle;

If, do not judge then whether said neologisms are the initial participle in the said hot speech dictionary;

If then the accumulated history path probability in said historical path is carried out and the corresponding bonus point of said initial participle.

7. method as claimed in claim 5 is characterized in that, said according to the spreading result of said neologisms with respect to the participle in the corresponding said hot speech dictionary on the said historical path, the accumulated history path probability of upgrading said historical path comprises:

If the word sequence that said neologisms belong to before the above neologisms of historical path is a complete thermal speech in the said hot speech dictionary, judge then whether said neologisms are the initial participle in the said hot speech dictionary;

If then the accumulated history path probability in said historical path is carried out and the corresponding bonus point of said initial participle;

If, then do not keep the accumulated history path probability in said historical path;

If the word sequence that said neologisms belong to before the above neologisms of historical path is not a complete thermal speech in the said hot speech dictionary, judge then whether said neologisms are follow-up participles of the hot speech participle of correspondence on the said historical path;

If not, before then cancelling to the bonus point of the accumulated history path probability in said historical path.

8. a speech recognition system is characterized in that, comprising:

9. system as claimed in claim 8 is characterized in that, said system also comprises:

Hot set of words updating block is used to obtain the hot speech of user's input, and said hot speech is saved in hot set of words;

Said exciting unit encourages the historical path of live-vertex according to said hot set of words in said decoding unit decodes process.

10. system as claimed in claim 9 is characterized in that, said exciting unit comprises:

Obtain subelement, be used to obtain the historical path and the accumulated history path probability of all live-vertexs that decoding obtains to the current speech signal frame;

Judgment sub-unit is used for judging according to said hot set of words whether contiguous word constitutes a hot speech on the said historical path;

The excitation subelement is used for after said judgment sub-unit judges that contiguous word constitutes a hot speech on the said historical path, improving the accumulated history path probability in said historical path; After said judgment sub-unit judges that contiguous word can not constitute a hot speech on the said historical path, keep the accumulated history path probability in said historical path.

11. system as claimed in claim 8 is characterized in that, said system also comprises:

Hot speech acquiring unit is used to obtain the hot speech of user's input;

The cutting unit is used for the hot speech that said hot speech acquiring unit obtains is carried out the text participle, and the participle that obtains is saved in hot speech dictionary;

Said exciting unit encourages the historical path of live-vertex according to said hot speech dictionary in said decoding unit decodes process.

12. system as claimed in claim 11 is characterized in that, said exciting unit comprises:

The neologisms judgment sub-unit is used for judging whether the historical path of said live-vertex neologisms occur;

The excitation subelement; Be used for after said neologisms judgment sub-unit judges that neologisms appear in the historical path of said live-vertex; According to the spreading result of said neologisms, upgrade the accumulated history path probability in said historical path with respect to the participle in the corresponding said hot speech dictionary on the said historical path; Otherwise the accumulated history path probability that keeps said historical path.

13. system as claimed in claim 12 is characterized in that, said excitation subelement comprises:

First judgment sub-unit is used to judge whether said neologisms are follow-up participles of the hot speech participle of correspondence on the said historical path;

First handles subelement, is used for after said first judgment sub-unit judges that said neologisms are the follow-up participle of the hot speech participle of correspondence on the said historical path, the accumulated history path probability in said historical path being carried out and the corresponding bonus point of said follow-up participle;

Second judgment sub-unit is used for after said first judgment sub-unit judges that said neologisms are not the follow-up participle of the hot speech participle of correspondence on the said historical path, judging whether said neologisms are the initial participle in the said hot speech dictionary;

Second handles subelement, is used for after said second judgment sub-unit judges that said neologisms are the initial participle of said hot speech dictionary, and the accumulated history path probability in said historical path is carried out and the corresponding bonus point of said initial participle; Otherwise the accumulated history path probability that keeps said historical path.

14. system as claimed in claim 12 is characterized in that, said excitation subelement comprises:

Hot speech judgment sub-unit is used for judging that said neologisms belong to whether the above neologisms of historical path word sequence before is a complete thermal speech of said hot speech dictionary;

The 3rd judgment sub-unit; Be used for judging that in said hot speech judgment sub-unit said neologisms belong to word sequence before the above neologisms of historical path when being complete thermal speech of said hot speech dictionary, judge whether said neologisms are the initial participles in the said hot speech dictionary;

The 3rd handles subelement, is used for after said the 3rd judgment sub-unit judges that said neologisms are the initial participle of said hot speech dictionary, and the accumulated history path probability in said historical path is carried out and the corresponding bonus point of said initial participle; Otherwise the accumulated history path probability that keeps said historical path;

The 4th judgment sub-unit; Be used for judging that in said hot speech judgment sub-unit said neologisms belong to word sequence before the above neologisms of historical path when not being complete thermal speech of said hot speech dictionary, judge that whether said neologisms are the follow-up participles of hot speech participle corresponding on the said historical path;

The manages subelement everywhere, is used for after said the 4th judgment sub-unit judges that said neologisms are the follow-up participle of hot speech participle corresponding on the said historical path, the accumulated history path probability in said historical path being carried out and the corresponding bonus point of said follow-up participle; Otherwise before cancelling to the bonus point of the accumulated history path probability in said historical path.