WO2014101717A1 - 用户个性化信息语音识别方法及系统 - Google Patents

用户个性化信息语音识别方法及系统 Download PDF

Info

Publication number
WO2014101717A1
WO2014101717A1 PCT/CN2013/090037 CN2013090037W WO2014101717A1 WO 2014101717 A1 WO2014101717 A1 WO 2014101717A1 CN 2013090037 W CN2013090037 W CN 2013090037W WO 2014101717 A1 WO2014101717 A1 WO 2014101717A1
Authority
WO
WIPO (PCT)
Prior art keywords
name
language model
user
basic
decoding network
Prior art date
Application number
PCT/CN2013/090037
Other languages
English (en)
French (fr)
Inventor
潘青华
何婷婷
胡国平
胡郁
刘庆峰
Original Assignee
安徽科大讯飞信息科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 安徽科大讯飞信息科技股份有限公司 filed Critical 安徽科大讯飞信息科技股份有限公司
Priority to EP13869206.6A priority Critical patent/EP2940684B1/en
Priority to US14/655,946 priority patent/US9564127B2/en
Publication of WO2014101717A1 publication Critical patent/WO2014101717A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/083Recognition networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/187Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • G10L15/197Probabilistic grammars, e.g. word n-grams
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
    • G10L2015/227Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of the speaker; Human-factor methodology

Definitions

  • the present invention relates to the field of speech recognition technology, and in particular, to a user personalized information speech recognition method and system.
  • the contact person name in the user-specific personalized address book may also have some very useful names, that is, each personalized person name list cannot be uniformly covered in the training corpus.
  • the language model for continuous speech recognition in the prior art cannot simulate human name words, especially user personalized contact name words, and the name recognition effect is often significantly lower than other content recognition effects.
  • how to improve user personalized information in continuous speech recognition, especially the recognition accuracy of personal name information has become an urgent problem to be solved in continuous speech recognition systems.
  • the invention provides a user personalized information speech recognition method and system, so as to improve the recognition accuracy rate of user personalized information in continuous speech recognition.
  • An embodiment of the present invention provides a method for voice recognition of a user personalized information, including: receiving a voice signal; decoding the voice signal frame by frame according to a basic static decoding network, to obtain a decoding path on each active node in the basic static decoding network,
  • the basic static decoding network is a decoding network related to the basic name language model; if it is determined that the decoding path enters the person name node in the basic static decoding network, the network extension of the person name node is performed according to the user's affiliated static decoding network.
  • the auxiliary static decoding network is a decoding network related to a specific user name language model; after the decoding is completed, the recognition result is returned.
  • the method further comprises: determining an auxiliary static decoding network of the user before decoding the voice signal according to the basic static decoding network; or determining a name of a person having a decoding path entering the basic static decoding network After the node, the associated static decoding network of the user is determined.
  • the determining the affiliated static decoding network of the user comprises: determining a user identity according to a feature of the voice signal, and then determining the user's affiliate static decoding network according to the user identity; or according to the user's device code or The account determines the identity of the user and then determines the affiliated static decoding network of the user based on the identity of the user.
  • the method further comprises: constructing a basic name language model and a specific user name language model Type; respectively construct a basic static decoding network associated with the base name language model and an associated static decoding network associated with the specific user name language model.
  • the building a basic name language model comprises: separately collecting a person name database and a language model training corpus; and, according to the person name database and the language model training corpus, the relationship between regular words and regular words and person name words Perform statistics; generate a basic name language model based on statistical results.
  • the calculating, according to the name database and the language model training corpus, the regular word and the relationship between the regular word and the person name word include: the training corpus according to the name of the person in the person name database The person name detection is performed; all the specific person names in the training corpus are replaced by a unified virtual person name; and the regular words and the relationship between the regular words and the person name words are counted according to the replaced training corpus.
  • the constructing the basic static decoding network related to the basic person name language model comprises: setting a virtual pronunciation for the virtual person name, so that the virtual person name participates as a common word in the static network expansion of the acoustic model;
  • the virtual pronunciation determines a special node in the extended static network, and the special node includes: a node that enters the person name unit and a termination node of the person name unit; and a virtual pronunciation on the arc or arc of the special node
  • the unit is extended to obtain a basic static decoding network related to the basic name language model.
  • the constructing a specific user name language model includes: extracting a person name from the user-related person name related information, and recording the person name as a person noun strip; setting a word frequency probability for each person noun bar, and according to the person noun The word frequency probability of the bar generates a specific user name language model;
  • the constructing the affiliate static decoding network related to the specific user name language model includes: respectively setting the pronunciation of the first and last words in the language model of the specific user name to The virtual special pronunciation; the special pronunciation unit on the arc of the first node of the sentence and the arc of the end of the sentence is extended to obtain an affiliated static decoding network related to the language model of the specific user name.
  • the embodiment of the present invention further provides a user personalized information speech recognition system, comprising: a receiving unit, configured to receive a voice signal; and a decoding unit, configured to decode the voice signal according to a basic static decoding network, to obtain a basic static decoding network.
  • the basic static decoding network is a decoding network related to a basic person name language model; and a decoding path checking unit is configured to determine whether a decoding path enters a person name node in the basic static decoding network; a network extension unit, configured to: after the decoding path check unit determines that a decoding path enters a name node in the basic static decoding network, perform network expansion on the name node according to a secondary static decoding network of the user, where the auxiliary static The decoding network is a decoding network related to a specific user name language model; the result output unit is configured to return the recognition result after the decoding is completed.
  • the system further includes: a determining unit, configured to determine an auxiliary static decoding network of the user before the decoding unit decodes the voice signal according to the basic static decoding network; or determine in the decoding path check unit After the decoding path enters the person name node in the basic static decoding network, the associated static decoding network of the user is determined.
  • a determining unit configured to determine an auxiliary static decoding network of the user before the decoding unit decodes the voice signal according to the basic static decoding network
  • determine in the decoding path check unit After the decoding path enters the person name node in the basic static decoding network, the associated static decoding network of the user is determined.
  • the determining unit is specifically configured to determine a user identity according to a feature of the voice signal, and then determine an affiliate static decoding network of the user according to the user identity; or according to a user setting
  • the backup code or account determines the identity of the user, and then determines the secondary static decoding network of the user based on the identity of the user.
  • the system further comprises: a basic person name language model building unit, configured to build a basic person name language model; a specific user name language model building unit, configured to construct a specific user name language model; a basic static decoding network building unit, Constructing a basic static decoding network related to the basic name language model; an auxiliary static decoding network building unit for constructing an affiliated static decoding network related to the specific user name language model.
  • a basic person name language model building unit configured to build a basic person name language model
  • a specific user name language model building unit configured to construct a specific user name language model
  • a basic static decoding network building unit Constructing a basic static decoding network related to the basic name language model
  • an auxiliary static decoding network building unit for constructing an affiliated static decoding network related to the specific user name language model.
  • the basic person name language model building unit comprises: a person name collecting unit for collecting a person name database; a corpus collecting unit for collecting a language model training corpus; and a statistical unit for using the person name database and the language model
  • the training corpus is used to calculate the relationship between the regular words and the regular words and the human name words;
  • the basic person name language model generating unit is configured to generate a basic person name language model according to the statistical results obtained by the statistical unit.
  • the statistical unit includes: a detecting subunit, configured to perform a name detection in the training corpus according to a name of a person in the name database; and a replacement subunit, configured to use all specific names in the training corpus
  • the virtual static decoding network building unit includes: a virtual pronunciation setting unit, configured to set a virtual pronunciation for the virtual human name, in accordance with the preferred training method.
  • the special node determining unit is configured to determine a special node in the extended static network according to the virtual pronunciation, the special node includes: the entry node The node of the name unit and the termination node of the person name unit; the first expansion unit is configured to expand the virtual pronunciation unit on the arc or arc of the special node to obtain a basic static decoding network related to the basic person name language model .
  • the specific user name language model building unit includes: a person name extracting unit, configured to extract a person name from the user-related person name related information, and record the person name as a person noun strip; a specific user name language model generating unit, The utility model is configured to set a word frequency probability for each person noun bar, and generate a specific user name language model according to the word frequency probability of the person noun bar; the auxiliary static decoding network building unit comprises: a setting unit, configured to respectively set a specific user name
  • the pronunciation of the first and last sentences in the language model is a virtual special pronunciation;
  • the second expansion unit is used to expand the special pronunciation unit on the arc of the first node of the sentence and the arc of the end of the sentence to obtain a specific An affiliate static decoding network associated with the user name language model.
  • the user personalized information speech recognition method and system provided by the embodiment of the present invention, after receiving the voice signal input by the user, decoding the voice signal according to the basic static decoding network related to the basic person name language model, to obtain the basis Decoding the decoding path on the active node in the network, if it is determined that the decoding path enters the person name node in the basic static decoding network, further according to the user's affiliated static decoding network related to the specific user name language model, the person name node
  • the network expansion is performed, thereby not only improving the recognition accuracy of the personalized contact person name in the continuous speech recognition, but also improving the context content recognition accuracy rate of the contact person name.
  • the contact information is applied at multiple levels of speech recognition, so that the overall recognition effect is optimized, and the recognition accuracy of the personalized information of the user in continuous speech recognition is improved.
  • FIG. 1 is a flowchart of a method for recognizing a user personalized information voice according to an embodiment of the present invention
  • FIG. 2 is a diagram of a decoding u in a user personalized information speech recognition method according to an embodiment of the present invention
  • FIG. 3 is another specific decoding flowchart in a user personalized information speech recognition method according to an embodiment of the present invention
  • FIG. 4 is a flow chart of constructing a basic person name language model in an embodiment of the present invention.
  • FIG. 5 is a flowchart of constructing a language model of a specific user name in an embodiment of the present invention
  • FIG. 6 is a flow chart of constructing a basic static decoding network related to a basic person name language model in an embodiment of the present invention
  • FIG. 7 is a schematic diagram of an extension of a basic human name language model related decoding network in an embodiment of the present invention
  • FIG. 8 is a flowchart of constructing an auxiliary static decoding network related to a specific user name language model in the embodiment of the present invention
  • FIG. 9 is a schematic diagram showing the expansion of a specific user name language model related decoding network in the embodiment of the present invention.
  • FIG. 10 is a schematic structural diagram of a user personalized information voice recognition system according to an embodiment of the present invention.
  • FIG. 11 is a voice recognition of user personalized information according to an embodiment of the present invention;
  • FIG. 12 is a schematic diagram showing another specific implementation structure of a user personalized information speech recognition system according to an embodiment of the present invention.
  • FIG. 13 is a schematic diagram showing another specific implementation structure of a user personalized information speech recognition system according to an embodiment of the present invention.
  • FIG. 14 is a schematic structural diagram of a basic person name language model building unit in a user personalized information speech recognition system according to an embodiment of the present invention.
  • FIG. 15 is a schematic structural diagram of a basic static decoding network construction unit in a user personalized information speech recognition system according to an embodiment of the present invention.
  • 16 is a schematic structural diagram of a specific user name language model building unit and an affiliate static decoding network building unit in a user personalized information voice recognition system according to an embodiment of the present invention.
  • the embodiment of the present invention provides a user personalized information speech recognition method and system for a problem that the existing language model for continuous speech recognition cannot simulate a human name word well, especially a user personalized contact name word. To improve the recognition accuracy of personalized information of users.
  • a flowchart of a user personalized information voice recognition method includes: Step 101: Receive a voice signal.
  • Step 102 Decode the voice signal frame by frame according to a basic static decoding network, to obtain a decoding path on all active nodes in the basic static decoding network, where the basic static decoding network is a decoding network related to a basic person name language model. .
  • the process of decoding a speech signal using the decoding network is a process of searching for an optimal path in the decoding network to implement speech-to-text conversion.
  • the received continuous speech signal may first be sampled as a series of discrete energy values stored in a data buffer.
  • the received continuous speech signal may also be subjected to noise reduction processing.
  • the continuous speech signal is segmented into independent speech segments and non-speech segments, and then the segmented speech segments are subjected to speech enhancement processing, and speech enhancement processing is performed.
  • the ambient noise in the speech signal can be further eliminated by Wiener filtering or the like to improve the processing capability of the subsequent system.
  • the extracted effective speech features are extracted and stored in the feature cache area.
  • the MFCC MelFrequency Cepstrum Coefficient
  • the MFCC parameters and the first-order and second-order differences are obtained by performing short-term analysis on each frame of the speech data with a window length of 25 ms and a frame shift of 10 ms. 39 dimensions. That is, each frame of speech signal is quantized into a 39-dimensional sequence of features.
  • each of the frame voice signals is decoded according to the basic static decoding network, and the decoding path of the voice signals on all active nodes in the basic static decoding network is obtained.
  • the decoding may be performed after receiving the multi-frame speech signal, which is not limited in this embodiment of the present invention.
  • Step 103 If it is determined that a decoding path enters a person name node in the basic static decoding network, performing network expansion on the person name node according to the user's affiliated static decoding network, where the auxiliary static decoding network is a specific user name Language model related decoding network.
  • the search process of the decoding path is as follows: According to the chronological order from left to right, the cumulative historical path probability of each frame of the speech signal frame arriving at each active node in the decoding network is calculated. Specifically, for each frame of the voice signal to be examined, the historical path and the cumulative historical path probability of all active nodes in the current decoding network relative to the voice signal frame may be first calculated. Then, the next frame of the speech signal is acquired, and the decoding is extended backward from the historical path that satisfies the system preset condition.
  • the decoding network in the embodiment of the present invention is a basic static decoding network related to the basic name language model
  • the decoding path enters the person name node in the basic static decoding network, according to the user's attached static decoding network
  • Network extension is performed on the name node.
  • the auxiliary static decoding network is a decoding network related to a specific user name language model, the user's personalizedization is effectively improved by constructing and applying the personalized entry of the user, especially to the user's personalized contact information. The accuracy of information recognition.
  • Step 104 After the decoding is completed, return the recognition result.
  • the active with the largest cumulative historical path probability The node is the optimal node, and the historical path obtained from the optimal node backtracking through the decoding state is the optimal path, and the word sequence on the optimal path is the decoding result.
  • the user personalized information speech recognition method after receiving the speech signal, decodes the speech signal frame by frame according to the basic static decoding network related to the basic person name language model, and obtains all the active nodes in the decoding network.
  • Decoding the path if it is determined that the decoding path enters the person name node in the basic static decoding network, further expanding the network of the person name node according to the user's affiliated static decoding network related to the specific user name language model, thereby not only improving The recognition accuracy of the personalized contact person name in continuous speech recognition, and also improves the context content recognition accuracy rate of the contact person name.
  • the contact information is applied at multiple levels of speech recognition, so that the overall recognition effect is optimized, and the recognition accuracy of the personalized information of the user in continuous speech recognition is improved.
  • the above-mentioned basic static decoding network and the auxiliary static decoding network may be built by the system online, or may be constructed by offline, and directly loaded at system startup to reduce system computation and required memory. , further improve the decoding efficiency.
  • the basic static decoding network is a decoding network related to a basic person name language model
  • the auxiliary static decoding network is a decoding network related to a specific user name language model.
  • the human name-related language model and the related decoding network in the embodiment of the present invention are further described in detail below. The build process.
  • a basic human name language model for describing statistical probabilities between common words and common words and names, and a language model for describing specific statistical names of specific people are constructed.
  • the basic name language model is used to describe the statistical probability between commonly used words and between commonly used words and names.
  • the specific user name language model is used to describe the statistical probability of the specific person name associated with the user.
  • Step 201 Collecting a name database.
  • a large-scale database of person names can be collected to achieve effective coverage of common names.
  • Step 202 Collect a language model training corpus.
  • Step 203 Perform statistics on regular words and associations between regular words and human name words according to the name database and the language model training corpus.
  • the name detection may be performed in the training corpus according to the name of the person in the name database, for example, the name detection may be performed by using a traditional name detection algorithm. Then replace all the specific names in the training corpus with a uniform virtual name, and then on the replacement training corpus Statistics on the relationship between the word and the regular word and the person's name.
  • Step 204 Generate a basic person name language model according to the statistical result.
  • the basic name language model extracts the statistical probability of the name attribute words and the regular words by the induction of the human name words, and realizes the support for the recognition of the whole name of the person.
  • a specific user-related person name language model that is, the specific user name language model described above may be further constructed according to user requirements.
  • the user-specific person name language model may be extracted from the contact information after receiving the contact information uploaded by the user.
  • a flowchart for constructing a specific user name language model in the embodiment of the present invention includes: Step 301: Extract a person name from a user-related person name related information, and record the person name as a person noun.
  • the person name related information may be an address book or the like.
  • Step 302 Set a word frequency probability for each person noun bar.
  • the most single order can set the word frequency probability of each person's nouns to be equal, or according to the statistics of the human vocabulary in the massive corpus, and further set the word frequency according to the user history record according to the high and low frequency. And allow subsequent updates to it.
  • Step 303 Generate a specific user name language model according to the word frequency probability of the person noun.
  • a preset model such as a dictionary and an acoustic model may be combined with the multi-language model (ie, the basic name language model and the specific user name language model) constructed above to obtain a corresponding multiple decoding search static network.
  • a low-order acoustic model such as a uniphone model, may be selected, and the expansion of the acoustic unit of the words in the language model constitutes a decoding network.
  • a higher-order acoustic model such as a biphone (diphone) or a triphone (triphone) model, may be selected to improve the discriminability between the different vocal units.
  • ⁇ , ⁇ you can first replace all the specific names in the training corpus with a unified virtual name, and then build statistics to build a basic name language model. That is to say, the basic person name language model contains a virtual person name unit, which cannot clear the specific pronunciation before decoding.
  • an embodiment of the present invention further provides a network extension method based on an acoustic model to construct a static decoding network related to the basic name language model. As shown in FIG. 4, it is a flowchart of constructing a basic static decoding network related to a basic person name language model in the embodiment of the present invention, which includes the following steps:
  • Step 401 Set a virtual pronunciation for the virtual person name, so that the virtual person name participates as a common word in the static network extension of the acoustic model.
  • Step 402 Determine a special node in the extended static network, where the special node includes: a node that enters the name unit and a termination node of the person name unit.
  • the virtual pronunciation is recorded as $C, taking the triphone acoustic model as an example.
  • three types of nodes will be mainly included: a regular node (node A) and two types of special nodes (nodes). S and node E).
  • $C represents the pronunciation unit of the virtual person name, which is referred to as a virtual pronunciation unit for convenience of description.
  • Node A is a regular node, that is, the triphone model entering the A node and leaving the arc of the A node can be predetermined.
  • the node S is a special node, and the arc is a name unit, that is, a node entering the person name unit.
  • the right correlation expansion of the triphone model on the arc is uncertain due to the uncertainty of the specific person name on the arc of the node. As shown in the figure, x-b+$C and y-b+$C.
  • Node E is a special node whose arc is a name unit, that is, the termination node of the person name unit.
  • the left correlation of the triphone model on the arc is also undetermined, as shown in the figure ⁇ $C-a+x and $C- a+y.
  • Step 403 Extend the virtual pronunciation unit on the arc or arc of the special node to obtain a basic static decoding network related to the basic person name language model.
  • the extension of $ is replaced with all possible phone units, and correspondingly, the arc x-b+$C will expand into multiple triphone models. Collection, including x-b+a, x-b+b...etc.
  • the extension method can be determined according to the triphone group and the law of x-b.
  • FIG. 6 it is a flowchart of an auxiliary static decoding network related to a specific user name language model in the embodiment of the present invention, which includes the following steps:
  • Step 601 Set the pronunciation of the first sentence and the last sentence in the language model of the specific user name to be a virtual special pronunciation.
  • the language model usually contains two special words, namely the first sentence ⁇ s> and the ending ⁇ /s>, which respectively indicate the beginning of the sentence and the end of the sentence.
  • the pronunciation of the ending sentence of the first sentence is generally defined as silent sil.
  • the pronunciation of the sentence head and the end word of the specific user name language model may be specially processed, so as to construct a static of the triphone model extension.
  • the network is shown in Figure 7.
  • the pronunciation of the first sentence of the sentence is set to a virtual special pronunciation of $8
  • the pronunciation of the last sentence is a virtual special pronunciation $E.
  • the left correlation of the triphone model on the arc starting from the first node S of the sentence is uncertain, as shown by $S-a+b and $S-x+y
  • the right correlation of the triphone model on the arc of the ending node E is uncertain.
  • the && $$8 and ⁇ +$8 and keep the models on other arcs as regular triphone models.
  • Step 602 Extending the special pronunciation unit on the arc of the first node of the sentence and the arc of the end of the sentence, and obtaining an auxiliary static decoding network related to the language model of the specific user name.
  • the above-mentioned basic static decoding network and the associated static decoding network can be constructed in an offline manner, wherein the auxiliary static decoding network is related to a specific user, that is, different users can correspond to different auxiliary static decoding networks. Therefore, in the process of identifying the received voice signal, the auxiliary static decoding network for the user may be loaded, and the specific loading timing may be different.
  • the voice signal may be performed frame by frame according to the basic static decoding network. Before the decoding, it may also be after determining that the decoding path enters the name node in the basic static decoding network, etc., which are respectively exemplified below.
  • FIG. 8 it is a specific decoding flowchart of the user personalized information voice recognition method in the embodiment of the present invention, which includes the following steps:
  • Step 801 receiving a voice signal.
  • Step 802 Preprocess the voice signal and extract an acoustic feature.
  • Step 803 determining a secondary static decoding network of the user.
  • Step 804 Decode the voice signal in the basic static network, and search for the current decoding path.
  • Step 805 Determine whether a path in the current decoding path enters the name node in the basic static decoding network; if yes, go to step 806; otherwise, go to step 807.
  • Step 806 Perform network extension on the name node in the basic static network according to the secondary static decoding network of the user.
  • the person name node in the basic static network may be replaced by the attached static decoding network; or the decoding path of the entry name node may be directly entered into the attached static decoding network of the user.
  • the decoding path of the entry name node when the decoding path of the entry name node is set to enter the user's attached static decoding network, when a new voice frame signal is received, the decoding path of the secondary static decoding network entering the user will be The user's affiliated static decoding network searches for a subsequent decoding path and returns to the terminating node of the person name node of the basic static network when the path arrives at the terminating node of the attached static decoding network.
  • Step 807 Determine whether the current frame is the last frame, that is, whether the decoding ends; if yes, execute step 808; otherwise, go to step 804.
  • Step 808 returning the decoding result.
  • the body decoding flowchart includes the following steps:
  • Step 901 Receive a voice signal.
  • Step 902 Preprocess the voice signal and extract an acoustic feature.
  • Step 903 Decode the voice signal frame by frame in the basic static network, and search for the current decoding path.
  • Step 904 Determine whether a path in the current decoding path enters the name node in the basic static decoding network; if yes, go to step 905; otherwise, go to step 907.
  • Step 905 Determine a secondary static decoding network of the user.
  • Step 906 Perform network extension on the person name node in the basic static network according to the user's attached static decoding network.
  • the person name node in the basic static network may be replaced by the attached static decoding network; or the decoding path of the entry name node may be directly entered into the attached static decoding network of the user.
  • the decoding path of the entry name node when the decoding path of the entry name node is set to enter the user's attached static decoding network, when a new voice frame signal is received, the decoding path of the secondary static decoding network entering the user will be The user's affiliated static decoding network searches for a subsequent decoding path and returns to the terminating node of the person name node of the basic static network when the path arrives at the terminating node of the attached static decoding network.
  • Step 907 Determine whether the current frame is the last frame, that is, whether the decoding ends; if yes, execute step 908; otherwise, go to step 903.
  • Step 908 returning the decoding result.
  • the identity of the user is determined according to the user's device code or account number, and then the attached static decoding network is determined according to the identity of the user.
  • the user personalized information voice recognition method after receiving the voice signal, decodes the voice signal frame by frame according to the basic static decoding network related to the basic person name language model, and obtains all active nodes in the decoded network.
  • the decoding path on the network if it is determined that the decoding path enters the person name node in the basic static decoding network, further expanding the network of the person name node according to the user's affiliated static decoding network related to the specific user name language model, thereby
  • the recognition accuracy of the personalized contact person name in the continuous speech recognition is improved, and the context content recognition accuracy rate of the contact person name is also improved.
  • the contact information is applied at multiple levels of speech recognition, so that the overall recognition effect is optimized, and the recognition accuracy of the personalized information of the user in continuous speech recognition is improved.
  • the user personalized information speech recognition method in the embodiment of the present invention is applicable not only to the user name decoding but also to other definable personalized information, such as address recognition.
  • the embodiment of the present invention further provides a user personalized information speech recognition system, as shown in FIG. 10, which is a structural schematic diagram of the system.
  • the system includes: The receiving unit 111 is configured to receive a voice signal;
  • the decoding unit 112 is configured to decode the voice signal frame by frame according to the basic static decoding network, to obtain a decoding path on each active node in the basic static decoding network, where the basic static decoding network is related to the basic person name language model.
  • the internet
  • a decoding path checking unit 113 configured to determine whether a decoding path enters a name node in the basic static decoding network
  • the network extension unit 114 is configured to: after the decoding path check unit 113 determines that the decoding path enters the name node in the basic static decoding network, perform network expansion on the name node according to the secondary static decoding network of the user,
  • the secondary static decoding network is a decoding network associated with a specific user name language model;
  • the decoding unit 112 is further configured to return a recognition result after the decoding is completed.
  • the process of decoding the voice signal input by the user by the decoding unit 112 is a process of searching for the optimal path in the basic static decoding network to realize the conversion of the voice to the text.
  • the received continuous speech signal may first be sampled as a series of discrete energy values stored in a data buffer.
  • a pre-processing unit (not shown) may be further included in the system for receiving the receiving unit 111 before the decoding unit 112 decodes the voice signal input by the user.
  • the continuous speech signal is subjected to noise reduction processing.
  • the continuous speech signal can be segmented into independent speech segments and non-speech segments by analyzing the short-time energy and the short-term zero-crossing rate of the speech signal, and then the segmented speech segment is subjected to speech enhancement processing.
  • speech enhancement processing the ambient noise in the speech signal can be further eliminated by Wiener filtering or the like to improve the processing capability of the subsequent system.
  • the processed valid speech features are extracted from the processed speech energy signal and stored in the feature buffer area.
  • the MFCC MelFrequency Cepstmm Coefficient
  • the MFCC parameters and the first-order and second-order differences are obtained by performing short-term analysis on each frame of the speech data with a window length of 25 ms and a frame shift of 10 ms. 39 dimensions. That is, each frame of the speech signal is quantized into a 39-dimensional feature sequence.
  • the decoding unit 112 decodes the voice signal according to the basic static decoding network, and acquires a decoding path of the voice signal on all active nodes in the basic static decoding network.
  • the active node with the largest cumulative historical path probability is the optimal node
  • the historical path obtained from the optimal node backtracking through the decoding state is the optimal path
  • the word sequence on the optimal path is Decode the result.
  • the basic static decoding network is a decoding network related to the basic name language model
  • the network extension unit 114 performs static decoding according to the user's attachment.
  • the network performs network expansion on the name node.
  • the auxiliary static decoding network is a decoding network related to a specific user name language model, the user's personalizedization is effectively improved by constructing and applying the personalized entry of the user, especially to the user's personalized contact information. The accuracy of information recognition.
  • the user personalized information voice recognition system after receiving the voice signal, according to The basic static decoding network associated with the basic name language model decodes the speech signal frame by frame to obtain a decoding path on all active nodes in the decoding network. If it is determined that the decoding path enters the person name node in the basic static decoding network, Further, the user name node is network-expanded according to the user's affiliated static decoding network related to the specific user name language model, thereby not only improving the recognition accuracy of the personalized contact person name in the continuous speech recognition, but also improving the connection.
  • the contextual content recognition accuracy rate of each person's name is applied at multiple levels of speech recognition, so that the overall recognition effect is optimized, and the recognition accuracy of the personalized information of the user in continuous speech recognition is improved.
  • the network extension unit 114 needs to perform network extension on the name node according to the user's attached static decoding network. If there is only one user in the system, the attached static decoding network is unique and can be built online by the system or built offline, and loaded directly at system startup. If there are multiple users of the system, it is necessary to identify the current user and the affiliated static decoding network corresponding to the user. Similarly, the attached static decoding networks of these different users can be built online by the system, or they can be built offline and loaded directly at system startup.
  • the determination of the secondary static decoding network corresponding to the current user may be completed at different timings.
  • the user personalized information voice recognition system further includes: a determining unit 121, configured to determine, before the decoding unit 112 decodes the voice signal according to the basic static decoding network, The user's attached static decoding network.
  • the user personalized information speech recognition system further includes: a determining unit 131, configured to determine, by the decoding path checking unit 113, that a decoding path enters the basic static decoding network. After the person name node, the affiliated static decoding network of the user is determined.
  • the user identity may be determined according to the characteristics of the voice signal, and then the user's affiliate static decoding network is determined according to the user identity; or according to the user's device.
  • the code or account determines the identity of the user and then determines the affiliated static decoding network of the user based on the identity of the user.
  • the above-mentioned basic static decoding network and the attached static decoding network can be built online by the system, or can be built offline, and loaded directly at system startup to reduce system computation and memory required, and further improve decoding efficiency. .
  • the method further includes:
  • the basic name language model building unit 131 is configured to construct a basic person name language model
  • a specific user name language model building unit 132 configured to construct a specific user name language model
  • a basic static decoding network building unit 133 configured to construct a basic static decoding network related to the basic person name language model
  • the affiliate static decoding network construction unit 134 is configured to construct an affiliate static decoding network related to the specific user name language model.
  • FIG. 14 it is a schematic structural diagram of a basic person name language model building unit in a user personalized information speech recognition system according to an embodiment of the present invention.
  • the basic person name language model building unit includes:
  • a name collection unit 141 for collecting a name database
  • a corpus collection unit 142 configured to collect a language model training corpus
  • the statistic unit 143 is configured to perform statistics on the regular words and the relationship between the regular words and the person name words according to the name database and the language model training corpus;
  • the basic person name language model generating unit 143 is configured to generate a basic person name language model according to the statistical result obtained by the statistical unit 143.
  • the statistical unit 143 may perform the name detection in the training corpus according to the name of the person in the name database. For example, the traditional name detection algorithm may be used for the name detection. Then for the said
  • the statistical unit 143 can include:
  • a detecting subunit configured to perform a name detection in the training corpus according to a name of a person in the name database
  • the basic human name language model constructed by the basic person name language model building unit better describes the statistical probability of the person name attribute words and regular words by the inductive extraction of the human name words. Support for the whole word recognition of names.
  • FIG. 15 it is a schematic structural diagram of a basic static decoding network construction unit in a user personalized information speech recognition system according to an embodiment of the present invention.
  • the basic static decoding network building unit includes:
  • a virtual pronunciation setting unit 151 configured to set a virtual pronunciation for the virtual person name, so that the virtual person name participates as a common word in a static network extension of the acoustic model;
  • the special node determining unit 152 is configured to determine, according to the virtual pronunciation, a special node in the extended static network, where the special node includes: a node that enters the person name unit and a termination node of the person name unit; and a first expansion unit 153.
  • the virtual pronunciation unit on the arcing or arcing of the special node is expanded to obtain a basic static decoding network related to the basic person name language model.
  • FIG. 16 it is a schematic structural diagram of a specific user name language model building unit and an auxiliary static decoding network building unit in the user personalized information voice recognition system according to the embodiment of the present invention.
  • the specific user name language model building unit includes:
  • the person name extracting unit 161 is configured to extract a person name from the person name related information related to the user, and record the person name as a person noun strip; a specific user name language model generating unit 162, configured to set a word frequency probability for each person noun bar, and generate a specific user name language model according to the word frequency probability of the person noun bar;
  • the auxiliary static decoding network construction unit includes:
  • a setting unit 171 configured to separately set a pronunciation of a sentence first sentence and a sentence ending word in a specific user name language model to be a virtual special pronunciation
  • the second expansion unit 172 is configured to expand the special pronunciation unit on the arc of the sentence head node and the arc of the end of the sentence node to obtain an auxiliary static decoding network related to the language model of the specific user name.
  • the user personalized information speech recognition system provided by the embodiment of the invention can not only improve the recognition accuracy of the personalized contact person name in the continuous speech recognition, but also improve the context content recognition accuracy rate of the contact person name.
  • the contact information is applied at multiple levels of speech recognition, so that the overall recognition effect is optimized, and the recognition accuracy of the personalized information of the user in continuous speech recognition is improved.
  • the user personalized information speech recognition system of the embodiment of the present invention is applicable not only to user name decoding but also to other definable personalized information, such as address recognition.

Abstract

一种用户个性化信息语音识别方法及系统。该方法包括:接收语音信号(101);根据基础静态解码网络对所述语音信号进行解码,得到基础静态解码网络中各活跃节点上的解码路径,所述基础静态解码网络是与基础人名语言模型相关的解码网络(102);如果确定有解码路径进入所述基础静态解码网络中的人名节点,则根据用户的附属静态解码网络对所述人名节点进行网络扩展,所述附属静态解码网络是与特定用户人名语言模型相关的解码网络(103);在解码完成后,返回识别结果(104)。该方法及系统可以提高连续语音识别中用户个性化信息的识别准确率。

Description

用户个性化信息语音识别方法及系统
技术领域
本发明涉及语音识别技术领域,具体涉及一种用户个性化信息语音识别方 法及系统。
背景技术
随着手机等智能终端上语音输入功能和应用的普及,用户在手机等智能终 端上使用语音输入的需求越来越多,对用户个性化信息, 尤其是通讯录中联系 人的识别准确率也提出了更高的要求。而传统连续语音识别系统由于语言模型 训练方式及识别方法的局限性,对存在多音字现象的汉语语音信号可能无法提 供正确的字词结果,特别是在人名信息识别中, 其识别准确率更受到了进一步 的限制, 主要体现在:
1. 中文常见人名数量众多, 对此连续语音识别的词典中通常将人名字词 作为未登录词处理, 导致训练语料中覆盖的人名数量极其有限;
2. 其次中文人名同音字大量存在, 常见人名有几十个甚至更多的汉字组 合;
3. 对每个用户来说, 用户特有的个性化通讯录中联系人人名可能还会有 一部分非常用人名, 即每个个性化的人名列表在训练语料中无法均匀覆盖。
基于以上原因,现有技术中用于连续语音识别的语言模型不能很好地模拟 人名字词特别是用户个性化联系人名字词,人名识别效果也往往明显低于其他 内容的识别效果。显然如何在连续语音识别中提高用户个性化信息,特别是人 名信息的识别准确率已经成为连续语音识别系统亟待解决的问题。
发明内容
本发明提供一种用户个性化信息语音识别方法及系统,以提高连续语音识 别中用户个性化信息的识别准确率。
本发明实施例提供一种用户个性化信息语音识别方法, 包括: 接收语音信 号; 根据基础静态解码网络逐帧对所述语音信号进行解码,得到基础静态解码 网络中各活跃节点上的解码路径,所述基础静态解码网络是与基础人名语言模 型相关的解码网络;如果确定有解码路径进入所述基础静态解码网络中的人名 节点, 则根据用户的附属静态解码网络对所述人名节点进行网络扩展, 所述附 属静态解码网络是与特定用户人名语言模型相关的解码网络; 在解码完成后, 返回识别结果。
优选地, 所述方法还包括: 在根据基础静态解码网络对所述语音信号进行 解码之前,确定所述用户的附属静态解码网络; 或者在确定有解码路径进入所 述基础静态解码网络中的人名节点之后, 确定所述用户的附属静态解码网络。
优选地, 所述确定所述用户的附属静态解码网络包括: 根据所述语音信号 的特征确定用户身份,然后根据所述用户身份确定所述用户的附属静态解码网 络; 或者根据用户的设备码或帐号确定用户身份, 然后根据所述用户身份确定 所述用户的附属静态解码网络。
优选地, 所述方法还包括: 构建基础人名语言模型和特定用户人名语言模 型;分别构建与所述基础人名语言模型相关的基础静态解码网络和与所述特定 用户人名语言模型相关的附属静态解码网络。
优选地, 所述构建基础人名语言模型包括: 分别采集人名数据库和语言模 型训练语料; 根据所述人名数据库及所述语言模型训练语料,对常规字词以及 常规字词与人名字词间关联关系进行统计;根据统计结果生成基础人名语言模 型。
优选地, 所述根据所述人名数据库及所述语言模型训练语料, 对常规字词 以及常规字词与人名字词间关联关系进行统计包括:根据所述人名数据库中的 人名在所述训练语料中进行人名检测;对所述训练语料中的所有具体人名用一 个统一的虚拟人名替换;根据替换后的训练语料对常规字词以及常规字词与人 名字词间关联关系进行统计。
优选地, 所述构建与所述基础人名语言模型相关的基础静态解码网络包 括: 为所述虚拟人名设置一个虚拟发音, 以使所述虚拟人名作为一个普通单词 参与声学模型的静态网络扩展;根据所述虚拟发音确定扩展后的静态网络中的 特殊节点, 所述特殊节点包括: 进入人名单元的节点和人名单元的终止节点; 对所述特殊节点的入弧或出弧上的虚拟发音单元进行扩展,得到与基础人名语 言模型相关的基础静态解码网络。
优选地, 所述构建特定用户人名语言模型包括: 从用户相关的人名相关信 息中提取人名, 并将所述人名作为人名词条记录; 对每个人名词条设置一个词 频概率, 并根据人名词条的词频概率生成特定用户人名语言模型; 所述构建与 所述特定用户人名语言模型相关的附属静态解码网络包括:分别设定特定用户 人名语言模型中的句首词和句尾词的发音为虚拟的特殊发音;对于句首节点的 出弧和句尾节点的入弧上的特殊发音单元进行扩展,得到特定用户人名语言模 型相关的附属静态解码网络。
本发明实施例还提供一种用户个性化信息语音识别系统,包括:接收单元, 用于接收语音信号; 解码单元, 用于根据基础静态解码网络对所述语音信号进 行解码,得到基础静态解码网络中各活跃节点上的解码路径, 所述基础静态解 码网络是与基础人名语言模型相关的解码网络; 解码路径检查单元, 用于确定 是否有解码路径进入所述基础静态解码网络中的人名节点; 网络扩展单元, 用 于在所述解码路径检查单元确定有解码路径进入所述基础静态解码网络中的 人名节点后,根据用户的附属静态解码网络对所述人名节点进行网络扩展, 所 述附属静态解码网络是与特定用户人名语言模型相关的解码网络;结果输出单 元, 用于在解码完成后, 返回识别结果。
优选地, 所述系统还包括: 确定单元, 用于在所述解码单元根据基础静态 解码网络对所述语音信号进行解码之前, 确定所述用户的附属静态解码网络; 或者在解码路径检查单元确定有解码路径进入所述基础静态解码网络中的人 名节点之后, 确定所述用户的附属静态解码网络。
优选地,所述确定单元,具体用于根据所述语音信号的特征确定用户身份, 然后根据所述用户身份确定所述用户的附属静态解码网络;或者根据用户的设 备码或帐号确定用户身份,然后根据所述用户身份确定所述用户的附属静态解 码网络。
优选地, 所述系统还包括: 基础人名语言模型构建单元, 用于构建基础人 名语言模型; 特定用户人名语言模型构建单元, 用于构建特定用户人名语言模 型; 基础静态解码网络构建单元, 用于构建与所述基础人名语言模型相关的基 础静态解码网络; 附属静态解码网络构建单元, 用于构建与所述特定用户人名 语言模型相关的附属静态解码网络。
优选地, 所述基础人名语言模型构建单元包括: 人名采集单元, 用于采集 人名数据库; 语料采集单元, 用于采集语言模型训练语料; 统计单元, 用于根 据所述人名数据库及所述语言模型训练语料,对常规字词以及常规字词与人名 字词间关联关系进行统计;基础人名语言模型生成单元, 用于根据所述统计单 元得到的统计结果生成基础人名语言模型。
优选地, 所述统计单元包括: 检测子单元, 用于根据所述人名数据库中的 人名在所述训练语料中进行人名检测; 替换子单元, 用于对所述训练语料中的 所有具体人名用 个统:的虚 ^人名替换 ^计子,单元, 于、根据替换后的训 优选地, 所述基础静态解码网络构建单元包括: 虚拟发音设置单元, 用于 为所述虚拟人名设置一个虚拟发音,以使所述虚拟人名作为一个普通单词参与 声学模型的静态网络扩展; 特殊节点确定单元, 用于根据所述虚拟发音确定扩 展后的静态网络中的特殊节点, 所述特殊节点包括: 进入人名单元的节点和人 名单元的终止节点; 第一扩展单元, 用于对所述特殊节点的入弧或出弧上的虚 拟发音单元进行扩展, 得到与基础人名语言模型相关的基础静态解码网络。
优选地, 所述特定用户人名语言模型构建单元包括: 人名提取单元, 用于 从用户相关的人名相关信息中提取人名, 并将所述人名作为人名词条记录; 特 定用户人名语言模型生成单元, 用于对每个人名词条设置一个词频概率, 并根 据人名词条的词频概率生成特定用户人名语言模型;所述附属静态解码网络构 建单元包括: 设定单元, 用于分别设定特定用户人名语言模型中的句首词和句 尾词的发音为虚拟的特殊发音; 第二扩展单元, 用于对于句首节点的出弧和句 尾节点的入弧上的特殊发音单元进行扩展,得到特定用户人名语言模型相关的 附属静态解码网络。
本发明实施例提供的用户个性化信息语音识别方法及系统,在接收到用户 输入的语音信号后,根据与基础人名语言模型相关的基础静态解码网络对所述 语音信号进行解码,得到所述基础静态解码网络中活跃节点上的解码路径,如 果确定有解码路径进入所述基础静态解码网络中的人名节点,则进一步根据用 户的与特定用户人名语言模型相关的附属静态解码网络对所述人名节点进行 网络扩展, 从而不仅提高了连续语音识别中个性化的联系人人名的识别准确 率, 而且还提高了联系人人名的上下文内容识别准确率。在语音识别的多个层 面应用联系人信息,使整体识别效果得到了优化,提高了连续语音识别中用户 个性化信息的识别准确率。 附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施 例中所需要使用的附图作筒单地介绍,显而易见地, 下面描述中的附图仅仅是 本发明中记载的一些实施例,对于本领域普通技术人员来讲,还可以根据这些 附图获得其他的附图。
图 1是本发明实施例用户个性化信息语音识别方法的流程图;
图 2是本发明实施例用户个性化信息语音识别方法中的一种 解码 u呈图; 图 3 是本发明实施例用户个性化信息语音识别方法中的另一种具体解码 流程图;
图 4是本发明实施例中构建基础人名语言模型的流程图;
图 5是本发明实施例中构建特定用户人名语言模型的流程图;
图 6是本发明实施例中构建与基础人名语言模型相关的基础静态解码网 络的流程图;
图 7是本发明实施例中基础人名语言模型相关解码网络扩展示意图; 图 8 是本发明实施例中构建与特定用户人名语言模型相关的附属静态解 码网络的流程图;
图 9是本发明实施例中特定用户人名语言模型相关解码网络扩展示意图; 图 10是本发明实施例用户个性化信息语音识别系统的结构示意图; 图 11是本发明实施例用户个性化信息语音识别系统的一种具体实现结构 示意图;
图 12是本发明实施例用户个性化信息语音识别系统的另一种具体实现结 构示意图;
图 13是本发明实施例用户个性化信息语音识别系统的另一种具体实现结 构示意图;
图 14是本发明实施例用户个性化信息语音识别系统中基础人名语言模型 构建单元的结构示意图;
图 15是本发明实施例用户个性化信息语音识别系统中基础静态解码网络 构建单元的结构示意图;
图 16是本发明实施例用户个性化信息语音识别系统中特定用户人名语言 模型构建单元和附属静态解码网络构建单元的结构示意图。
具体实施方式
下面详细描述本发明的实施例, 所述实施例的示例在附图中示出, 其 中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功 能的元件。 下面通过参考附图描述的实施例是示例性的, 仅用于解释本发 明, 而不能解释为对本发明的限制。
本发明实施例针对现有的用于连续语音识别的语言模型不能很好地模拟 人名字词,特别是用户个性化联系人名字词的问题,提供了一种用户个性化信 息语音识别方法及系统, 以提高用户个性化信息的识别准确率。
如图 1所示, 本发明实施例用户个性化信息语音识别方法的流程图包括: 步骤 101 , 接收语音信号。
步骤 102, 根据基础静态解码网络逐帧对所述语音信号进行解码, 得到所 述基础静态解码网络中所有活跃节点上的解码路径,所述基础静态解码网络是 与基础人名语言模型相关的解码网络。
利用所述解码网络对语音信号进行解码的过程是一个在该解码网络中搜 索最优路径, 实现语音到文本的转换的过程。
具体地,可以首先对接收的连续语音信号采样为一系列离散能量值存入数 据緩存区。
当然, 为了进一步提高系统的鲁棒性,还可以先对接收到的连续语音信号 进行降噪处理。 首先通过对语音信号的短时能量和短时过零率分析,将连续的 语音信号分割成独立的语音片断和非语音片断,然后对分割得到的语音片断进 行语音增强处理, 在进行语音增强处理时, 可以通过维纳滤波等方法, 将语音 信号中的环境噪声进一步消除, 以提高后续系统对该信号的处理能力。
考虑到降噪处理后的语音信号中依然会存在大量语音识别无关的冗余信 息, 直接对其识别可能会使运算量和识别准确率降低, 为此, 可以从降噪处理 后的语音能量信号中提取识别有效语音特征, 并存入特征緩存区内。 具体地, 可以提取语音的 MFCC(MelFrequency Cepstrum Coefficient, Mel频率倒谱系 数)特征,对窗长 25ms 帧移 10ms 的每帧语音数据做短时分析得到 MFCC 参 数及其一阶、 二阶差分, 共计 39 维。 也就是说, 将每帧语音信号量化为一 39 维的特征序列。 然后,根据所述基础静态解码网络对其中每帧语音信号进行解 码, 获取所述语音信号在所述基础静态解码网络中所有活跃节点上的解码路 径。 当然, 在实际应用中, 还可以在接收到多帧语音信号后再进行解码, 对此 本发明实施例不做限定。
步骤 103 ,如果确定有解码路径进入所述基础静态解码网络中的人名节点, 则根据所述用户的附属静态解码网络对所述人名节点进行网络扩展,所述附属 静态解码网络是与特定用户人名语言模型相关的解码网络。
在现有技术中, 解码路径的搜索过程如下: 按照从左到右的时间顺序, 计 算每帧语音信号帧到达解码网络中每个活跃节点的累积历史路径概率。 具体 地, 对于需要考察的每帧语音信号帧, 可以首先计算当前解码网络中所有活跃 节点相对于该语音信号帧的历史路径和累积历史路径概率。 然后, 获取下一帧 语音信号帧, 并从满足系统预设条件的历史路径向后扩展解码。
由于本发明实施例中的解码网络是与基础人名语言模型相关的基础静态 解码网络, 因此, 在有解码路径进入所述基础静态解码网络中的人名节点时, 根据所述用户的附属静态解码网络对所述人名节点进行网络扩展。由于所述附 属静态解码网络是与特定用户人名语言模型相关的解码网络,因此通过对用户 个性化词条的构建及应用, 尤其是对用户个性化联系人信息的应用,有效提高 了用户个性化信息的识别准确率。
步骤 104, 在解码完成后, 返回识别结果。
当对最后一帧语音信号帧解码后,其中具有最大累积历史路径概率的活跃 节点即为最优节点 ,从该最优节点通过解码状态回溯得到的历史路径即为最优 路径, 该最优路径上的单词序列即为解码结果。
本发明实施例用户个性化信息语音识别方法,在接收到语音信号后,根据 与基础人名语言模型相关的基础静态解码网络逐帧对所述语音信号进行解码, 得到解码网络中所有活跃节点上的解码路径,如果确定有解码路径进入所述基 础静态解码网络中的人名节点,则进一步根据用户的与特定用户人名语言模型 相关的附属静态解码网络对所述人名节点进行网络扩展,从而不仅提高了连续 语音识别中个性化的联系人人名的识别准确率,而且还提高了联系人人名的上 下文内容识别准确率。在语音识别的多个层面应用联系人信息,使整体识别效 果得到了优化, 提高了连续语音识别中用户个性化信息的识别准确率。
需要说明的是,在实际应用中, 上述基础静态解码网络和附属静态解码网 络可以由系统在线构建, 也可以通过离线方式构建, 在系统启动时直接载入, 以减少系统运算量及所需内存, 进一步提高解码效率。
上述基础静态解码网络是与基础人名语言模型相关的解码网络,附属静态 解码网络是与特定用户人名语言模型相关的解码网络,下面进一步详细说明本 发明实施例中人名相关语言模型及相关解码网络的构建过程。
传统语音识别系统通常采用统计模型的方法构建语言模型,通过模拟语法 和语义知识减少识别范围、 提高识别率。 一般的, 系统首先根据预设词典 对海量训练语料进行分词处理, 然后分别统计各词联合出现的概率, 并采 用条件概率的方式构建语言模型。 假设某个词^出现的概率仅和其前 n-1 个词相关, 记为 p(wk I W^-1 ) = p(wk I Wk k— - 。
然而由于中文人名数量众多, 传统词典很少将人名作为确定字词处理, 因而训练语料分词后的人名数量极其有限,训练得到的语言模型也无法很好地 描述具体人名的出现概率, 进而影响了人名相关整词的识别准确率。
为此, 在本发明实施例中, 分别构建用以描述常用字词间以及常用字词与 人名间的统计概率的基础人名语言模型、以及特定用户相关的用以描述具体人 名统计概率的语言模型。其中,基础人名语言模型用于描述常用字词间以及常 用字词与人名间的统计概率。特定用户人名语言模型用于描述该用户相关的具 体人名的统计概率。
如图 2所示, 是本发明实施例中构建基础人名语言模型的流程图, 包括: 步骤 201 , 采集人名数据库。
具体地, 可以采集一个较大规模的人名数据库, 以实现对常用人名的有效 覆盖。
步骤 202, 采集语言模型训练语料。
步骤 203 , 根据所述人名数据库及所述语言模型训练语料, 对常规字词以 及常规字词与人名字词间关联关系进行统计。
具体地,可以根据所述人名数据库中的人名在所述训练语料中进行人名检 测, 比如, 可以采用传统人名检测算法进行人名检测。 然后对所述训练语料中 的所有具体人名用一个统一的虚拟人名替换,然后在替换后的训练语料上对常 规字词以及常规字词与人名字词间关联关系进行统计。
步骤 204, 根据统计结果生成基础人名语言模型。
需要说明的是,在实际应用中,还可以在上述过程中对语料中的具体人名 进行统计,确定各类不同人名词条出现的词频, 以便在构建特定用户人名语言 模型过程中依据该词频设置人名词条的词频概率。
相比于传统的语言模型, 该基础人名语言模型通过对人名字词的归纳提 取, 更好地描述了人名属性字词和常规字词的统计概率, 实现了对人名整词识 别的支持。
上述基础人名语言模型虽然描述了人名属性字词的统计概率,但依然无法 解决具体人名字词识别的问题。 而在中文中, 人名同音字大量存在, 常见人名 有几十个甚至更多的汉字组合, 此外对每个用户来说, 用户特有的个性化通讯 录中联系人人名可能还会有一部分非常用人名,即每个个性化的人名列表在训 练语料中无法均匀覆盖。
因此,为了更好地识别各特定用户相关人名字词,在本发明另一实施例中, 还可进一步根据用户需求构建特定用户相关的人名语言模型,即前面所述的特 定用户人名语言模型。 具体地, 可以在接收到用户上传的联系人信息后从所述 联系人信息中提取获得该用户特定的人名语言模型。
如图 3所示, 本发明实施例中构建特定用户人名语言模型的流程图包括: 步骤 301 , 从用户相关的人名相关信息中提取人名, 并将所述人名作为人 名词条记录。
所述人名相关信息可以是通讯录等。
步骤 302 , 对每个人名词条设置一个词频概率。
最筒单的可以设置每个人名词条的词频概率均等,或者根据海量语料中统 计的人名词频相应设置,更进一步的还可以根据用户历史使用记录按高低频度 对人名词条进行词频设置, 并允许后续对其进行更新。
步骤 303 , 根据人名词条的词频概率生成特定用户人名语言模型。
在本发明实施例中, 可以利用词典, 声学模型等预置模型结合上述构建的 多重语言模型(即基础人名语言模型和特定用户人名语言模型 )扩展, 获得相 应的多重解码搜索静态网络。 具体地, 可以选择低阶的声学模型, 如 uniphone 模型, 对语言模型中的字词进行声学单元的扩展构成解码网络。 进一步地, 为 了提高解码准确率,还可选用更高阶的声学模型,如 biphone(双音素)、 triphone (三音素)模型等,'提高不同^发音单元之间 ^的区分性。 曰 、 ^ 时, 可以先对所述训练语料中的所有具体人名用一个统一的虚拟人名替换, 然 行统计, 构建基础人名语言模型。 也就是说, 基础人名语言模型中包含有虚拟 人名单元, 其在解码前无法明确具体发音。 为此, 本发明实施例还提供一种基 于声学模型的网络扩展方法,以构建与所述基础人名语言模型相关的静态解码 网络。 如图 4所示,是本发明实施例中构建与基础人名语言模型相关的基础静态 解码网络的流程图, 包括以下步骤:
步骤 401 , 为所述虚拟人名设置一个虚拟发音, 以使所述虚拟人名作为一 个普通单词参与声学模型的静态网络扩展。
步骤 402, 确定扩展后的静态网络中的特殊节点, 所述特殊节点包括: 进 入人名单元的节点和人名单元的终止节点。
将所述虚拟发音记为 $C, 以 triphone声学模型为例, 如图 5所示, 在扩展 后的静态网络中将主要包括三类节点:常规节点(节点 A )和两类特殊节点(节 点 S和节点 E )。
其中, a,b,x,y,n表示普通的发音单元, $C表示虚拟人名的发音单元, 为了 描述方便, 将其称为虚拟发音单元。
节点 A为常规节点, 即进入 A节点以及离开 A节点的弧上的 triphone模 型是可以预先确定的。
节点 S为特殊节点, 其出弧为人名单元, 即进入人名单元的节点, 显然该 节点的入弧上由于具体人名的不确定性导致该入弧上的 triphone模型的右相关 扩展不确定, 如图中 x-b+$C和 y-b+$C。
节点 E为特殊节点,其入弧为人名单元,即人名单元的终止节点,相应的, 其出弧上的 triphone模型左相关也无法确定, 如图†$C-a+x和$C-a+y。
步骤 403 , 对所述特殊节点的入弧或出弧上的虚拟发音单元进行扩展, 得 到与基础人名语言模型相关的基础静态解码网络。
对于节点 S的入弧, 例如 x-b+$C和 y-b+$C, 对$ 进行扩展替换成所有 可能的 phone单元, 相应的, 由弧 x-b+$C将扩展出多个 triphone模型的集合, 包括 x-b+a, x-b+b…等。 扩展方式可以根据 x-b的 triphone组和规律确定。
对于节点 E的出弧, 同样采取上述类似操作, 如对 $C-a+x和$C-a+y, 将 $C替换成所有可能的 phone单元, 扩展出相应的准确的 triphone模型。
保持从节点 S到节点 E的弧 *-$C+*不变, 在后续动态解码进入到节点 S 时对其进行具体人名静态解码网络的替换。
同样, 在构建特定用户人名语言模型相关的静态解码网络时, 对特定用户 相关的具体语言模型的扩展,在采用高阶声学模型时,也需要采用与上述类似 的方法。
如图 6所示,是本发明实施例中特定用户人名语言模型相关的附属静态解 码网络的流程图, 包括以下步骤:
步骤 601 , 分别设定特定用户人名语言模型中的句首词和句尾词的发音为 虚拟的特殊发音。
通常语言模型中一般会包含两个特殊的词, 即句首词 <s>和句尾 </s>, 分别表示句子开始和句子结束, 句首句尾词发音一般定义为静音 sil。
在本发明实施例中,为了保证人名单元在识别过程中和原始静态网络的连 接, 可以对该特定用户人名语言模型的句首和句尾词的发音进行特殊处理, 以 便构建 triphone模型扩展的静态网络, 具体如图 7所示。 其中,设定句首词的发音为虚拟的特殊发音$8, 句尾词的发音为虚拟的特 殊发音 $E。 从句首节点 S 出发的弧上的 triphone模型左相关不确定, 如图中 $S-a+b和$S-x+y, 而句尾节点 E的弧上的 triphone模型的右相关是不确定的, 如图中 & +$8和 ^+$8 , 并保持其他弧上的模型为常规 triphone模型。
步骤 602 , 对于句首节点的出弧和句尾节点的入弧上的特殊发音单元进行 扩展, 得到特定用户人名语言模型相关的附属静态解码网络。
具体地, 对于句首节点 S的出弧, 例如 $S-a+b和$S-x+y, 将 $S替换成所 有可能的 phone, 扩展出相应的准确的 triphone模型; 对于句尾节点 E的入弧 也做类似操作, 例如 & +$8和 +$8 , 将$8替换成所有可能的 phone, 扩展 出相应的准确的 triphone模型。
前面提到,上述基础静态解码网络及附属静态解码网络可以通过离线方式 构建, 其中, 附属静态解码网络是与特定用户相关的, 也就是说, 不同用户可 以对应不同的附属静态解码网络。因此,在对接收的语音信号进行识别过程中, 可以载入针对该用户的附属静态解码网络, 具体载入时机可以不同, 比如, 可 以是在根据基础静态解码网络逐帧对所述语音信号进行解码之前,也可以是在 确定有解码路径进入所述基础静态解码网络中的人名节点之后等,对此, 下面 分别举例说明。
如图 8所示,是本发明实施例用户个性化信息语音识别方法中的一种具体 解码流程图, 包括以下步骤:
步骤 801 , 接收语音信号。
步骤 802, 对所述语音信号进行预处理, 并提取声学特征。
步骤 803 , 确定用户的附属静态解码网络。
步骤 804, 在基础静态网络中对语音信号解码, 搜素当前解码路径。
步骤 805 , 判断当前解码路径中是否有路径进入基础静态解码网络中的人 名节点; 若是, 则执行步骤 806; 否则执行步骤 807。
步骤 806, 根据用户的附属静态解码网络对基础静态网络中的人名节点进 行网络扩展。
具体地,可以利用附属静态解码网络对基础静态网络中的该人名节点进行 替换;或设置所述进入人名节点的解码路径直接进入所述用户的附属静态解码 网络。
需要说明的是,当设置所述进入人名节点的解码路径进入所述用户的附属 静态解码网络时,在接收到新的语音帧信号时, 所述进入用户的附属静态解码 网络的解码路径将在所述用户的附属静态解码网络内搜索后续解码路径,并在 所述路径到达附属的静态解码网络的终止节点时返回到基础静态网络的人名 节点的终止节点。
步骤 807 , 判断当前帧是否最后一帧, 即是否解码结束; 若是, 则执行步 骤 808; 否则转入步骤 804。
步骤 808 , 返回解码结果。
如图 9所示,是本发明实施例用户个性化信息语音识别方法中的另一种具 体解码流程图, 包括以下步骤:
步骤 901 , 接收语音信号。
步骤 902, 对所述语音信号进行预处理, 并提取声学特征。
步骤 903 , 在基础静态网络中逐帧对语音信号解码, 搜素当前解码路径。 步骤 904, 判断当前解码路径中是否有路径进入基础静态解码网络中的人 名节点; 若是, 则执行步骤 905; 否则执行步骤 907。
步骤 905 , 确定用户的附属静态解码网络。
步骤 906, 根据用户的附属静态解码网络对基础静态网络中的人名节点进 行网络扩展。
具体地,可以利用附属静态解码网络对基础静态网络中的该人名节点进行 替换;或设置所述进入人名节点的解码路径直接进入所述用户的附属静态解码 网络。
需要说明的是,当设置所述进入人名节点的解码路径进入所述用户的附属 静态解码网络时,在接收到新的语音帧信号时, 所述进入用户的附属静态解码 网络的解码路径将在所述用户的附属静态解码网络内搜索后续解码路径,并在 所述路径到达附属的静态解码网络的终止节点时返回到基础静态网络的人名 节点的终止节点。
步骤 907 , 判断当前帧是否最后一帧, 即是否解码结束; 若是, 则执行步 骤 908; 否则转入步骤 903。
步骤 908 , 返回解码结果。
需要说明的是, 上述步骤 803和步骤 905中, 确定所述用户的附属静态解 码网络的方式可以有多种, 比如:
( 1 )根据用户的语音信号特征确定用户的身份, 即具体的用户, 然后根 据用户的身份确定其附属静态解码网络。
( 2 )根据用户的设备码或帐号确定用户的身份, 然后根据用户的身份确 定其附属静态解码网络。
可见,本发明实施例用户个性化信息语音识别方法,在接收到语音信号后, 根据与基础人名语言模型相关的基础静态解码网络逐帧对所述语音信号进行 解码,得到解码网络中所有活跃节点上的解码路径,如果确定有解码路径进入 所述基础静态解码网络中的人名节点,则进一步根据用户的与特定用户人名语 言模型相关的附属静态解码网络对所述人名节点进行网络扩展,从而不仅提高 了连续语音识别中个性化的联系人人名的识别准确率,而且还提高了联系人人 名的上下文内容识别准确率。在语音识别的多个层面应用联系人信息,使整体 识别效果得到了优化, 提高了连续语音识别中用户个性化信息的识别准确率。
需要说明的是,本发明实施例用户个性化信息语音识别方法不仅适用于用 户人名解码,还适用于其他可定义的个性化信息的语音识别,比如地址识别等。
相应地, 本发明实施例还提供一种用户个性化信息语音识别系统, 如图 10所示, 是该系统的一种结构示意图。
在该实施例中, 所述系统包括: 接收单元 111 , 用于接收语音信号;
解码单元 112,用于根据基础静态解码网络逐帧对所述语音信号进行解码, 得到基础静态解码网络中各活跃节点上的解码路径,所述基础静态解码网络是 与基础人名语言模型相关的解码网络;
解码路径检查单元 113 , 用于确定是否有解码路径进入所述基础静态解码 网络中的人名节点;
网络扩展单元 114, 用于在所述解码路径检查单元 113确定有解码路径进 入所述基础静态解码网络中的人名节点后,根据所述用户的附属静态解码网络 对所述人名节点进行网络扩展,所述附属静态解码网络是与特定用户人名语言 模型相关的解码网络;
所述解码单元 112, 还用于在解码完成后, 返回识别结果。
解码单元 112 对用户输入的语音信号进行解码的过程是一个在所述基础 静态解码网络中搜索最优路径, 实现语音到文本的转换的过程。 具体地, 可以 首先对接收的连续语音信号采样为一系列离散能量值存入数据緩存区。 当然, 为了进一步提高系统的鲁棒性, 在所述系统中还可以包括预处理单元(未图 示), 用于在解码单元 112对用户输入的语音信号进行解码之前, 对接收单元 111接收到的连续语音信号进行降噪处理。 具体地, 可以首先通过对语音信号 的短时能量和短时过零率分析,将连续的语音信号分割成独立的语音片断和非 语音片断, 然后对分割得到的语音片断进行语音增强处理, 在进行语音增强处 理时, 可以通过维纳滤波等方法, 将语音信号中的环境噪声进一步消除, 以提 高后续系统对该信号的处理能力。
考虑到降噪处理后的语音信号中依然会存在大量语音识别无关的冗余信 息, 直接对其识别可能会使运算量和识别准确率降低, 为此, 所述预处理单元 还可以从降噪处理后的语音能量信号中提取识别有效语音特征,并存入特征緩 存区内。 具体地, 可以提取语音的 MFCC(MelFrequency Cepstmm Coefficient, Mel 频率倒谱系数) 特征, 对窗长 25ms 帧移 10ms 的每帧语音数据做短时分 析得到 MFCC 参数及其一阶、 二阶差分, 共计 39 维。 也就是说, 将每帧语 音信号量化为一 39 维的特征序列。 然后, 再由解码单元 112根据所述基础静 态解码网络对语音信号进行解码,获取所述语音信号在所述基础静态解码网络 中所有活跃节点上的解码路径。 当解码完成后, 其中具有最大累积历史路径概 率的活跃节点即为最优节点,从该最优节点通过解码状态回溯得到的历史路径 即为最优路径, 该最优路径上的单词序列即为解码结果。
由于所述基础静态解码网络是与基础人名语言模型相关的解码网络, 因 此, 在有解码路径进入所述基础静态解码网络中的人名节点时, 由网络扩展单 元 114根据所述用户的附属静态解码网络对所述人名节点进行网络扩展。由于 所述附属静态解码网络是与特定用户人名语言模型相关的解码网络,因此通过 对用户个性化词条的构建及应用, 尤其是对用户个性化联系人信息的应用,有 效提高了用户个性化信息的识别准确率。
本发明实施例用户个性化信息语音识别系统,在接收到语音信号后,根据 与基础人名语言模型相关的基础静态解码网络逐帧对所述语音信号进行解码, 得到解码网络中所有活跃节点上的解码路径,如果确定有解码路径进入所述基 础静态解码网络中的人名节点,则进一步根据用户的与特定用户人名语言模型 相关的附属静态解码网络对所述人名节点进行网络扩展,从而不仅提高了连续 语音识别中个性化的联系人人名的识别准确率,而且还提高了联系人人名的上 下文内容识别准确率。在语音识别的多个层面应用联系人信息,使整体识别效 果得到了优化, 提高了连续语音识别中用户个性化信息的识别准确率。
上述网络扩展单元 114 需要根据用户的附属静态解码网络对所述人名节 点进行网络扩展。如果系统的用户只有一个, 则所述附属静态解码网络是唯一 的,可以由系统在线构建,也可以通过离线方式构建,在系统启动时直接载入。 如果系统的用户有多个,则需要识别当前的用户以及该用户对应的附属静态解 码网络。 同样, 这些不同用户的附属静态解码网络可以由系统在线构建, 也可 以通过离线方式构建, 在系统启动时直接载入。
需要说明的是,在具体应用中, 当前用户对应的附属静态解码网络的确定 可以在不同时机来完成。
如图 11所示, 是本发明用户个性化信息语音识别系统的一种具体实现结 构示意图。 与图 10不同的是, 在该实施例中, 用户个性化信息语音识别系统 还包括: 确定单元 121 , 用于在所述解码单元 112根据基础静态解码网络对所 述语音信号进行解码之前, 确定所述用户的附属静态解码网络。
如图 12所示, 是本发明用户个性化信息语音识别系统的另一种具体实现 结构示意图。 与图 10不同的是, 在该实施例中, 所述用户个性化信息语音识 别系统还包括: 确定单元 131 , 用于在解码路径检查单元 113确定有解码路径 进入所述基础静态解码网络中的人名节点之后,确定所述用户的附属静态解码 网络。
需要说明的是, 无论是上述确定单元 121还是确定单元 131 , 都可以根据 所述语音信号的特征确定用户身份,然后根据所述用户身份确定所述用户的附 属静态解码网络; 或者根据用户的设备码或帐号确定用户身份, 然后根据所述 用户身份确定所述用户的附属静态解码网络。
在实际应用中,上述基础静态解码网络和附属静态解码网络可以由系统在 线构建, 也可以通过离线方式构建, 在系统启动时直接载入, 以减少系统运算 量及所需内存, 进一步提高解码效率。
由于中文人名数量众多,传统词典很少将人名作为确定字词处理, 因而训 练语料分词后的人名数量极其有限,训练得到的语言模型也无法很好地描述具 体人名的出现概率, 进而影响了人名相关整词的识别准确率。 为此, 在本发明 系统的另一实施例中, 如图 13所示, 还可进一步包括:
基础人名语言模型构建单元 131 , 用于构建基础人名语言模型;
特定用户人名语言模型构建单元 132, 用于构建特定用户人名语言模型; 基础静态解码网络构建单元 133, 用于构建与所述基础人名语言模型相关 的基础静态解码网络; 附属静态解码网络构建单元 134, 用于构建与所述特定用户人名语言模型 相关的附属静态解码网络。
如图 14所示, 是本发明实施例用户个性化信息语音识别系统中基础人名 语言模型构建单元的结构示意图。
所述基础人名语言模型构建单元包括:
人名采集单元 141 , 用于采集人名数据库;
语料采集单元 142, 用于采集语言模型训练语料;
统计单元 143 , 用于根据所述人名数据库及所述语言模型训练语料, 对常 规字词以及常规字词与人名字词间关联关系进行统计;
基础人名语言模型生成单元 143 , 用于根据所述统计单元 143得到的统计 结果生成基础人名语言模型。
上述统计单元 143 可以根据所述人名数据库中的人名在所述训练语料中 进行人名检测, 比如, 可以采用传统人名检测算法进行人名检测。 然后对所述
Figure imgf000015_0001
统计单元 143可以包括:
检测子单元,用于根据所述人名数据库中的人名在所述训练语料中进行人 名检测;
替换子单元,用于对所述训练语料中的所有具体人名用一个统一的虚拟人 名替换;
统计子单元,用于根据替换后的训练语料对常规字词以及常规字词与人名 字词间关联关系进行统计。
相比于传统的语言模型,由所述基础人名语言模型构建单元构建的基础人 名语言模型通过对人名字词的归纳提取,更好地描述了人名属性字词和常规字 词的统计概率, 实现了对人名整词识别的支持。
如图 15所示, 是本发明实施例用户个性化信息语音识别系统中基础静态 解码网络构建单元的结构示意图。
所述基础静态解码网络构建单元包括:
虚拟发音设置单元 151 , 用于为虚拟人名设置一个虚拟发音, 以使所述虚 拟人名作为一个普通单词参与声学模型的静态网络扩展;
特殊节点确定单元 152, 用于根据所述虚拟发音确定扩展后的静态网络中 的特殊节点,所述特殊节点包括:进入人名单元的节点和人名单元的终止节点; 第一扩展单元 153 , 用于对所述特殊节点的入弧或出弧上的虚拟发音单元 进行扩展, 得到与基础人名语言模型相关的基础静态解码网络。
如图 16所示, 是本发明实施例用户个性化信息语音识别系统中特定用户 人名语言模型构建单元和附属静态解码网络构建单元的结构示意图。
所述特定用户人名语言模型构建单元包括:
人名提取单元 161 , 用于从用户相关的人名相关信息中提取人名, 并将所 述人名作为人名词条记录; 特定用户人名语言模型生成单元 162, 用于对每个人名词条设置一个词频 概率, 并根据人名词条的词频概率生成特定用户人名语言模型;
所述附属静态解码网络构建单元包括:
设定单元 171 , 用于分别设定特定用户人名语言模型中的句首词和句尾词 的发音为虚拟的特殊发音;
第二扩展单元 172, 用于对于句首节点的出弧和句尾节点的入弧上的特殊 发音单元进行扩展, 得到特定用户人名语言模型相关的附属静态解码网络。
利用本发明实施例提供的用户个性化信息语音识别系统,不仅可以提高连 续语音识别中个性化的联系人人名的识别准确率,而且还可以提高联系人人名 的上下文内容识别准确率。在语音识别的多个层面应用联系人信息,使整体识 别效果得到了优化, 提高了连续语音识别中用户个性化信息的识别准确率。
需要说明的是,本发明实施例用户个性化信息语音识别系统不仅适用于用 户人名解码,还适用于其他可定义的个性化信息的语音识别,比如地址识别等。
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相 似的部分互相参见即可, 每个实施例重点说明的都是与其他实施例的不同之 处。 尤其, 对于系统实施例而言, 由于其基本相似于方法实施例, 所以描述得 比较筒单,相关之处参见方法实施例的部分说明即可。 以上所描述的系统实施 例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是 物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元, 即可以 位于一个地方, 或者也可以分布到多个网络单元上。可以根据实际的需要选择 其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在 不付出创造性劳动的情况下, 即可以理解并实施。
以上对本发明实施例进行了详细介绍,本文中应用了具体实施方式对本发 明进行了阐述, 以上实施例的说明只是用于帮助理解本发明的方法及设备; 同 时, 对于本领域的一般技术人员, 依据本发明的思想, 在具体实施方式及应用 范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。

Claims

权利要求
1、 一种用户个性化信息语音识别方法, 其特征在于, 包括:
接收语音信号;
根据基础静态解码网络对所述语音信号进行解码,得到基础静态解码网络 中各活跃节点上的解码路径,所述基础静态解码网络是与基础人名语言模型相 关的解码网络;
如果确定有解码路径进入所述基础静态解码网络中的人名节点,则根据用 户的附属静态解码网络对所述人名节点进行网络扩展,所述附属静态解码网络 是与特定用户人名语言模型相关的解码网络;
解码完成后, 返回识别结果。
2、 根据权利要求 1所述的方法, 其特征在于, 所述方法还包括: 在根据基础静态解码网络对所述语音信号进行解码之前,确定所述用户的 附属静态解码网络; 或者
在确定有解码路径进入所述基础静态解码网络中的人名节点之后,确定所 述用户的附属静态解码网络。
3、 根据权利要求 2所述的方法, 其特征在于, 所述确定所述用户的附属 静态解码网络包括:
根据所述语音信号的特征确定用户身份,然后根据所述用户身份确定所述 用户的附属静态解码网络; 或者
根据用户的设备码或帐号确定用户身份,然后根据所述用户身份确定所述 用户的附属静态解码网络。
4、根据权利要求 1至 3任一项所述的方法,其特征在于,所述方法还包括: 构建基础人名语言模型和特定用户人名语言模型;
分别构建与所述基础人名语言模型相关的基础静态解码网络和与所述特 定用户人名语言模型相关的附属静态解码网络。
5、 根据权利要求 4所述的方法, 其特征在于, 所述构建基础人名语言模 型包括:
分别采集人名数据库和语言模型训练语料;
根据所述人名数据库及所述语言模型训练语料,对常规字词以及常规字词 与人名字词间关联关系进行统计;
根据统计结果生成基础人名语言模型。
6、 根据权利要求 5所述的方法, 其特征在于, 所述根据所述人名数据库 及所述语言模型训练语料,对常规字词以及常规字词与人名字词间关联关系进 行统计包括:
根据所述人名数据库中的人名在所述训练语料中进行人名检测;
对所述训练语料 、的^ "有具 ^人名 ^用、一个统口一的 ^拟人名 _替^奂曰; 、 , 、 进行统计。
7、 根据权利要求 6所述的方法, 其特征在于, 所述构建与所述基础人名 语言模型相关的基础静态解码网络包括:
为所述虚拟人名设置一个虚拟发音,以使所述虚拟人名作为一个普通单词 参与声学模型的静态网络扩展;
根据所述虚拟发音确定扩展后的静态网络中的特殊节点,所述特殊节点包 括: 进入人名单元的节点和人名单元的终止节点;
对所述特殊节点的入弧或出弧上的虚拟发音单元进行扩展,得到与基础人 名语言模型相关的基础静态解码网络。
8、 根据权利要求 4所述的方法, 其特征在于, 所述构建特定用户人名语 言模型包括:
从用户相关的人名相关信息中提取人名, 并将所述人名作为人名词条记 录;
对每个人名词条设置一个词频概率,并根据人名词条的词频概率生成特定 用户人名语言模型;
所述构建与所述特定用户人名语言模型相关的附属静态解码网络包括: 分别设定特定用户人名语言模型中的句首词和句尾词的发音为虚拟的特 殊发音;
对于句首节点的出弧和句尾节点的入弧上的特殊发音单元进行扩展,得到 特定用户人名语言模型相关的附属静态解码网络。
9、 一种用户个性化信息语音识别系统, 其特征在于, 包括:
接收单元, 用于接收语音信号;
解码单元, 用于根据基础静态解码网络对所述语音信号进行解码,得到基 础静态解码网络中各活跃节点上的解码路径,所述基础静态解码网络是与基础 人名语言模型相关的解码网络;
解码路径检查单元,用于确定是否有解码路径进入所述基础静态解码网络 中的人名节点;
网络扩展单元,用于在所述解码路径检查单元确定有解码路径进入所述基 础静态解码网络中的人名节点后,根据用户的附属静态解码网络对所述人名节 点进行网络扩展,所述附属静态解码网络是与特定用户人名语言模型相关的解 码网络;
结果输出单元, 用于在解码完成后, 返回识别结果。
10、 根据权利要求 9所述的系统, 其特征在于, 所述系统还包括: 确定单元,用于在所述解码单元根据基础静态解码网络逐帧对所述语音信 号进行解码之前,确定所述用户的附属静态解码网络; 或者在解码路径检查单 元确定有解码路径进入所述基础静态解码网络中的人名节点之后,确定所述用 户的附属静态解码网络。
11、 根据权利要求 10所述的系统, 其特征在于, 所述确定单元, 具体用 于根据所述语音信号的特征确定用户身份,然后根据所述用户身份确定所述用 户的附属静态解码网络; 或者根据用户的设备码或帐号确定用户身份, 然后根 据所述用户身份确定所述用户的附属静态解码网络。
12、 根据权利要求 9至 11任一项所述的系统, 其特征在于, 所述系统还 包括:
基础人名语言模型构建单元, 用于构建基础人名语言模型;
特定用户人名语言模型构建单元, 用于构建特定用户人名语言模型; 基础静态解码网络构建单元,用于构建与所述基础人名语言模型相关的基 础静态解码网络;
附属静态解码网络构建单元,用于构建与所述特定用户人名语言模型相关 的附属静态解码网络。
13、 根据权利要求 12所述的系统, 其特征在于, 所述基础人名语言模型 构建单元包括:
人名采集单元, 用于采集人名数据库;
语料采集单元, 用于采集语言模型训练语料;
统计单元, 用于根据所述人名数据库及所述语言模型训练语料,对常规字 词以及常规字词与人名字词间关联关系进行统计;
基础人名语言模型生成单元,用于根据所述统计单元得到的统计结果生成 基础人名语言模型。
14、 根据权利要求 13所述的系统, 其特征在于, 所述统计单元包括: 检测子单元,用于根据所述人名数据库中的人名在所述训练语料中进行人 名检测;
替换子单元,用于对所述训练语料中的所有具体人名用一个统一的虚拟人 名替换;
统计子单元,用于根据替换后的训练语料对常规字词以及常规字词与人名 字词间关联关系进行统计。
15、 根据权利要求 14所述的系统, 其特征在于, 所述基础静态解码网络 构建单元包括:
虚拟发音设置单元, 用于为所述虚拟人名设置一个虚拟发音, 以使所述虚 拟人名作为一个普通单词参与声学模型的静态网络扩展;
特殊节点确定单元,用于根据所述虚拟发音确定扩展后的静态网络中的特 殊节点, 所述特殊节点包括: 进入人名单元的节点和人名单元的终止节点; 第一扩展单元,用于对所述特殊节点的入弧或出弧上的虚拟发音单元进行 扩展, 得到与基础人名语言模型相关的基础静态解码网络。
16、 根据权利要求 12所述的系统, 其特征在于, 所述特定用户人名语言 模型构建单元包括:
人名提取单元, 用于从用户相关的人名相关信息中提取人名, 并将所述人 名作为人名词条记录;
特定用户人名语言模型生成单元, 用于对每个人名词条设置一个词频概 率, 并根据人名词条的词频概率生成特定用户人名语言模型;
所述附属静态解码网络构建单元包括:
设定单元,用于分别设定特定用户人名语言模型中的句首词和句尾词的发 音为虚拟的特殊发音;
第二扩展单元,用于对于句首节点的出弧和句尾节点的入弧上的特殊发音 单元进行扩展, 得到特定用户人名语言模型相关的附属静态解码网络。
PCT/CN2013/090037 2012-12-28 2013-12-20 用户个性化信息语音识别方法及系统 WO2014101717A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP13869206.6A EP2940684B1 (en) 2012-12-28 2013-12-20 Voice recognizing method and system for personalized user information
US14/655,946 US9564127B2 (en) 2012-12-28 2013-12-20 Speech recognition method and system based on user personalized information

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201210585934.7A CN103065630B (zh) 2012-12-28 2012-12-28 用户个性化信息语音识别方法及系统
CN201210585934.7 2012-12-28

Publications (1)

Publication Number Publication Date
WO2014101717A1 true WO2014101717A1 (zh) 2014-07-03

Family

ID=48108230

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/090037 WO2014101717A1 (zh) 2012-12-28 2013-12-20 用户个性化信息语音识别方法及系统

Country Status (4)

Country Link
US (1) US9564127B2 (zh)
EP (1) EP2940684B1 (zh)
CN (1) CN103065630B (zh)
WO (1) WO2014101717A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111477217A (zh) * 2020-04-08 2020-07-31 北京声智科技有限公司 一种命令词识别方法及装置
CN111583910A (zh) * 2019-01-30 2020-08-25 北京猎户星空科技有限公司 模型更新方法、装置、电子设备及存储介质

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103065630B (zh) 2012-12-28 2015-01-07 科大讯飞股份有限公司 用户个性化信息语音识别方法及系统
CN103971686B (zh) * 2013-01-30 2015-06-10 腾讯科技(深圳)有限公司 自动语音识别方法和系统
US20150370787A1 (en) * 2014-06-18 2015-12-24 Microsoft Corporation Session Context Modeling For Conversational Understanding Systems
KR102338041B1 (ko) * 2014-09-29 2021-12-10 현대모비스 주식회사 음성 인식 장치 및 방법
CN106469554B (zh) * 2015-08-21 2019-11-15 科大讯飞股份有限公司 一种自适应的识别方法及系统
CN106683677B (zh) 2015-11-06 2021-11-12 阿里巴巴集团控股有限公司 语音识别方法及装置
CN105529027B (zh) * 2015-12-14 2019-05-31 百度在线网络技术(北京)有限公司 语音识别方法和装置
CN108122555B (zh) * 2017-12-18 2021-07-23 北京百度网讯科技有限公司 通讯方法、语音识别设备和终端设备
CN108305634B (zh) * 2018-01-09 2020-10-16 深圳市腾讯计算机系统有限公司 解码方法、解码器及存储介质
CN110634472A (zh) * 2018-06-21 2019-12-31 中兴通讯股份有限公司 一种语音识别方法、服务器及计算机可读存储介质
CN108989341B (zh) * 2018-08-21 2023-01-13 平安科技(深圳)有限公司 语音自主注册方法、装置、计算机设备及存储介质
CN109274845A (zh) * 2018-08-31 2019-01-25 平安科技(深圳)有限公司 智能语音自动回访方法、装置、计算机设备及存储介质
CN110517692A (zh) * 2019-08-30 2019-11-29 苏州思必驰信息科技有限公司 热词语音识别方法和装置
CN110473527B (zh) * 2019-09-17 2021-10-08 浙江核新同花顺网络信息股份有限公司 一种语音识别的方法和系统
CN110610700B (zh) * 2019-10-16 2022-01-14 科大讯飞股份有限公司 解码网络构建方法、语音识别方法、装置、设备及存储介质
CN111145756B (zh) * 2019-12-26 2022-06-14 北京搜狗科技发展有限公司 一种语音识别方法、装置和用于语音识别的装置
CN113113024A (zh) * 2021-04-29 2021-07-13 科大讯飞股份有限公司 语音识别方法、装置、电子设备和存储介质
CN113450803B (zh) * 2021-06-09 2024-03-19 上海明略人工智能(集团)有限公司 会议录音转写方法、系统、计算机设备和可读存储介质
CN114242046B (zh) * 2021-12-01 2022-08-16 广州小鹏汽车科技有限公司 语音交互方法及装置、服务器及存储介质
KR102620070B1 (ko) * 2022-10-13 2024-01-02 주식회사 타이렐 상황 인지에 따른 자율발화 시스템
KR102626954B1 (ko) * 2023-04-20 2024-01-18 주식회사 덴컴 치과용 음성 인식 장치 및 이를 이용한 방법
KR102617914B1 (ko) * 2023-05-10 2023-12-27 주식회사 포지큐브 음성 인식 방법 및 그 시스템
KR102632872B1 (ko) * 2023-05-22 2024-02-05 주식회사 포지큐브 음성인식 오류 교정 방법 및 그 시스템

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5040218A (en) * 1988-11-23 1991-08-13 Digital Equipment Corporation Name pronounciation by synthesizer
JP2002108389A (ja) * 2000-09-29 2002-04-10 Matsushita Electric Ind Co Ltd 音声による個人名称検索、抽出方法およびその装置と車載ナビゲーション装置
JP2003228394A (ja) * 2002-01-31 2003-08-15 Nippon Telegr & Teleph Corp <Ntt> 音声入力を利用する名詞特定装置およびその方法
CN1753083A (zh) * 2004-09-24 2006-03-29 中国科学院声学研究所 语音标记方法、系统及基于语音标记的语音识别方法和系统
CN101194305A (zh) * 2005-08-19 2008-06-04 思科技术公司 用于分发语音识别语法的系统和方法
CN101454827A (zh) * 2006-05-25 2009-06-10 雅马哈株式会社 语音状况数据生成装置、语音状况可视化装置、语音状况数据编辑装置、语音数据再现装置以及语音通信系统
CN101923854A (zh) * 2010-08-31 2010-12-22 中国科学院计算技术研究所 一种交互式语音识别系统和方法
CN102265612A (zh) * 2008-12-15 2011-11-30 坦德伯格电信公司 用于加速人脸检测的方法
CN103065630A (zh) * 2012-12-28 2013-04-24 安徽科大讯飞信息科技股份有限公司 用户个性化信息语音识别方法及系统

Family Cites Families (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7242988B1 (en) * 1991-12-23 2007-07-10 Linda Irene Hoffberg Adaptive pattern recognition based controller apparatus and method and human-factored interface therefore
US6088669A (en) * 1997-01-28 2000-07-11 International Business Machines, Corporation Speech recognition with attempted speaker recognition for speaker model prefetching or alternative speech modeling
US6363342B2 (en) * 1998-12-18 2002-03-26 Matsushita Electric Industrial Co., Ltd. System for developing word-pronunciation pairs
US20030191625A1 (en) * 1999-11-05 2003-10-09 Gorin Allen Louis Method and system for creating a named entity language model
US7275033B1 (en) * 2000-09-30 2007-09-25 Intel Corporation Method and system for using rule-based knowledge to build a class-based domain specific statistical language model
JP4263614B2 (ja) * 2001-12-17 2009-05-13 旭化成ホームズ株式会社 リモートコントロール装置及び情報端末装置
US8285537B2 (en) * 2003-01-31 2012-10-09 Comverse, Inc. Recognition of proper nouns using native-language pronunciation
CA2486125C (en) * 2003-10-30 2011-02-08 At&T Corp. A system and method of using meta-data in speech-processing
US7783474B2 (en) * 2004-02-27 2010-08-24 Nuance Communications, Inc. System and method for generating a phrase pronunciation
US7478033B2 (en) * 2004-03-16 2009-01-13 Google Inc. Systems and methods for translating Chinese pinyin to Chinese characters
JP4301102B2 (ja) * 2004-07-22 2009-07-22 ソニー株式会社 音声処理装置および音声処理方法、プログラム、並びに記録媒体
US20080154601A1 (en) * 2004-09-29 2008-06-26 Microsoft Corporation Method and system for providing menu and other services for an information processing system using a telephone or other audio interface
JP4816409B2 (ja) * 2006-01-10 2011-11-16 日産自動車株式会社 認識辞書システムおよびその更新方法
US8949266B2 (en) * 2007-03-07 2015-02-03 Vlingo Corporation Multiple web-based content category searching in mobile search application
JP5386692B2 (ja) * 2007-08-31 2014-01-15 独立行政法人情報通信研究機構 対話型学習装置
EP2227757A4 (en) * 2007-12-06 2018-01-24 Google LLC Cjk name detection
US20090326945A1 (en) * 2008-06-26 2009-12-31 Nokia Corporation Methods, apparatuses, and computer program products for providing a mixed language entry speech dictation system
US8589157B2 (en) * 2008-12-05 2013-11-19 Microsoft Corporation Replying to text messages via automated voice search techniques
US8255217B2 (en) * 2009-10-16 2012-08-28 At&T Intellectual Property I, Lp Systems and methods for creating and using geo-centric language models
KR20120113717A (ko) * 2009-12-04 2012-10-15 소니 주식회사 검색 장치, 검색 방법, 및 프로그램
US8762156B2 (en) * 2011-09-28 2014-06-24 Apple Inc. Speech recognition repair using contextual information
US9640175B2 (en) * 2011-10-07 2017-05-02 Microsoft Technology Licensing, Llc Pronunciation learning from user correction
CN102592595B (zh) * 2012-03-19 2013-05-29 安徽科大讯飞信息科技股份有限公司 语音识别方法及系统
US10354650B2 (en) * 2012-06-26 2019-07-16 Google Llc Recognizing speech with mixed speech recognition models to generate transcriptions
US9940927B2 (en) * 2013-08-23 2018-04-10 Nuance Communications, Inc. Multiple pass automatic speech recognition methods and apparatus

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5040218A (en) * 1988-11-23 1991-08-13 Digital Equipment Corporation Name pronounciation by synthesizer
JP2002108389A (ja) * 2000-09-29 2002-04-10 Matsushita Electric Ind Co Ltd 音声による個人名称検索、抽出方法およびその装置と車載ナビゲーション装置
JP2003228394A (ja) * 2002-01-31 2003-08-15 Nippon Telegr & Teleph Corp <Ntt> 音声入力を利用する名詞特定装置およびその方法
CN1753083A (zh) * 2004-09-24 2006-03-29 中国科学院声学研究所 语音标记方法、系统及基于语音标记的语音识别方法和系统
CN101194305A (zh) * 2005-08-19 2008-06-04 思科技术公司 用于分发语音识别语法的系统和方法
CN101454827A (zh) * 2006-05-25 2009-06-10 雅马哈株式会社 语音状况数据生成装置、语音状况可视化装置、语音状况数据编辑装置、语音数据再现装置以及语音通信系统
CN102265612A (zh) * 2008-12-15 2011-11-30 坦德伯格电信公司 用于加速人脸检测的方法
CN101923854A (zh) * 2010-08-31 2010-12-22 中国科学院计算技术研究所 一种交互式语音识别系统和方法
CN103065630A (zh) * 2012-12-28 2013-04-24 安徽科大讯飞信息科技股份有限公司 用户个性化信息语音识别方法及系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP2940684A4 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111583910A (zh) * 2019-01-30 2020-08-25 北京猎户星空科技有限公司 模型更新方法、装置、电子设备及存储介质
CN111583910B (zh) * 2019-01-30 2023-09-26 北京猎户星空科技有限公司 模型更新方法、装置、电子设备及存储介质
CN111477217A (zh) * 2020-04-08 2020-07-31 北京声智科技有限公司 一种命令词识别方法及装置
CN111477217B (zh) * 2020-04-08 2023-10-10 北京声智科技有限公司 一种命令词识别方法及装置

Also Published As

Publication number Publication date
CN103065630B (zh) 2015-01-07
US20150348542A1 (en) 2015-12-03
CN103065630A (zh) 2013-04-24
US9564127B2 (en) 2017-02-07
EP2940684A1 (en) 2015-11-04
EP2940684A4 (en) 2017-01-04
EP2940684B1 (en) 2019-05-22

Similar Documents

Publication Publication Date Title
WO2014101717A1 (zh) 用户个性化信息语音识别方法及系统
US6067520A (en) System and method of recognizing continuous mandarin speech utilizing chinese hidden markou models
CN107016994B (zh) 语音识别的方法及装置
WO2014117547A1 (en) Method and device for keyword detection
CN109036471B (zh) 语音端点检测方法及设备
JP2021105708A (ja) ニューラル・スピーチ・ツー・ミーニング
CN102945673A (zh) 一种语音指令范围动态变化的连续语音识别方法
CN112614514B (zh) 有效语音片段检测方法、相关设备及可读存储介质
CN111489754A (zh) 一种基于智能语音技术的话务数据分析方法
CN107123419A (zh) Sphinx语速识别中背景降噪的优化方法
Samin et al. Deep learning based large vocabulary continuous speech recognition of an under-resourced language Bangladeshi Bangla
CN114495905A (zh) 语音识别方法、装置及存储介质
Rose et al. Integration of utterance verification with statistical language modeling and spoken language understanding
CN111640423B (zh) 一种词边界估计方法、装置及电子设备
CN114360514A (zh) 语音识别方法、装置、设备、介质及产品
CN115132178B (zh) 一种基于深度学习的语义端点检测系统
CN116052655A (zh) 音频处理方法、装置、电子设备和可读存储介质
Wang et al. Cloud-based automatic speech recognition systems for southeast asian languages
JP2006053203A (ja) 音声処理装置および方法、記録媒体、並びにプログラム
JP2001242885A (ja) 音声認識装置および音声認識方法、並びに記録媒体
CN113096667A (zh) 一种错别字识别检测方法和系统
Amoolya et al. Automatic speech recognition for tulu language using gmm-hmm and dnn-hmm techniques
CN113506561B (zh) 文本拼音的转换方法及装置、存储介质及电子设备
Pranjol et al. Bengali speech recognition: An overview
CN113035247B (zh) 一种音频文本对齐方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13869206

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2013869206

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 14655946

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE